___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv('../TextFiles/moviereviews2.tsv',sep='\t')

### Task #2: Check for missing values:

In [3]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [5]:
df.dropna(inplace=True)
df.isnull().sum()

label     0
review    0
dtype: int64

In [7]:
len(df)

5980

In [12]:
# Check for whitespace strings (it's OK if there aren't any!):
blank = []
for i,lb,rv in df.itertuples():
    if rv.isspace() or lb.isspace():
        blank.append(i)


blank





[]

### Task #3: Remove NaN values:

In [13]:
df.drop(blank,inplace=True)

### Task #4: Take a quick look at the `label` column:

In [15]:
df['label']

0       pos
1       pos
2       pos
3       neg
4       pos
5       neg
6       neg
7       neg
8       pos
9       pos
10      pos
11      neg
12      pos
13      neg
14      pos
15      pos
16      pos
17      neg
18      neg
19      pos
20      neg
21      neg
22      pos
23      neg
24      neg
25      pos
26      pos
27      neg
28      pos
29      neg
       ... 
5970    pos
5971    pos
5972    neg
5973    pos
5974    neg
5975    neg
5976    pos
5977    neg
5978    pos
5979    neg
5980    pos
5981    pos
5982    neg
5983    neg
5984    neg
5985    pos
5986    neg
5987    neg
5988    neg
5989    pos
5990    pos
5991    pos
5992    neg
5993    pos
5994    neg
5995    pos
5996    neg
5997    neg
5998    pos
5999    pos
Name: label, Length: 5980, dtype: object

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [19]:
X = df['review']
y = df['label']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.3,random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

text_cls = Pipeline([('tfidf',TfidfVectorizer()),('lr_svc',LinearSVC())])

model = text_cls.fit(X_train,y_train)








### Task #7: Run predictions and analyze the results

In [29]:
# Form a prediction set
pred = model.predict(X_test)


In [30]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,pred))


[[821  78]
 [ 58 837]]


In [31]:
# Print a classification report
print(metrics.classification_report(y_test,pred))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       899
         pos       0.91      0.94      0.92       895

   micro avg       0.92      0.92      0.92      1794
   macro avg       0.92      0.92      0.92      1794
weighted avg       0.92      0.92      0.92      1794



In [None]:
# Print the overall accuracy
print(metrics.accuracy_score())

## Great job!