___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [1]:
import pandas as pd
import numpy as np
import sklearn

doc=pd.read_csv("moviereviews2.tsv",sep="\t")
print(len(doc))
print(doc.head())



6000
  label                                             review
0   pos  I loved this movie and will watch it again. Or...
1   pos  A warm, touching movie that has a fantasy-like...
2   pos  I was not expecting the powerful filmmaking ex...
3   neg  This so-called "documentary" tries to tell tha...
4   pos  This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [2]:
# Check for NaN values:
doc.isnull().sum()




label      0
review    20
dtype: int64

In [3]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks=[]
for i,lb,rv in doc.itertuples():
    if type(rv)==str:
        if rv.isspace():#empty string
            blanks.append(i)
doc.drop(blanks,inplace=True)
len(doc)
    








6000

### Task #3: Remove NaN values:

In [4]:
doc.dropna(inplace=True)

### Task #4: Take a quick look at the `label` column:

In [5]:
print(doc['label'].head())

0    pos
1    pos
2    pos
3    neg
4    pos
Name: label, dtype: object


### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [13]:
from sklearn.model_selection import train_test_split

x=doc['review']
y=doc['label']


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.33,random_state=42)
print(type(x_test))


<class 'pandas.core.series.Series'>


### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [7]:

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

text_classifier=Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())])
text_classifier.fit(x_train,y_train)












### Task #7: Run predictions and analyze the results

In [8]:
# Form a prediction set
y_pred=text_classifier.predict(x_test)
print(y_pred)

['neg' 'pos' 'pos' ... 'pos' 'pos' 'pos']


In [9]:
# Report the confusion matrix
from sklearn.metrics import confusion_matrix,accuracy_score

print(confusion_matrix(y_test,y_pred))



[[900  91]
 [ 63 920]]


In [10]:
# Print a classification report
print(sklearn.metrics.classification_report(y_test,y_pred))



              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [11]:
# Print the overall accuracy
print(accuracy_score(y_test,y_pred))



0.9219858156028369


In [16]:
print(text_classifier.predict(["the movie is great."]))

['pos']


## Great job!