___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [3]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep="\t")




### Task #2: Check for missing values:

In [4]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [11]:
# Check for whitespace strings (it's OK if there aren't any!):
print(df.dropna()[df.dropna().review.str.isspace()].shape)

print(df.dropna()[df.dropna().label.str.isspace()].shape)




(0, 2)
(0, 2)


### Task #3: Remove NaN values:

In [12]:
df.dropna(inplace=True)

### Task #4: Take a quick look at the `label` column:

In [13]:
df.label

0       pos
1       pos
2       pos
3       neg
4       pos
       ... 
5995    pos
5996    neg
5997    neg
5998    pos
5999    pos
Name: label, Length: 5980, dtype: object

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [20]:
from sklearn.model_selection import train_test_split

X = df["review"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    



### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [21]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

pipeline = Pipeline( [ 
                        ("tfid", TfidfVectorizer()),
                        ("clf", LinearSVC())
                    ])

pipeline.fit(X_train, y_train)








Pipeline(steps=[('tfid', TfidfVectorizer()), ('clf', LinearSVC())])

### Task #7: Run predictions and analyze the results

In [22]:
# Form a prediction set
predictions = pipeline.predict(X_test)

In [25]:
# Report the confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(y_test, predictions))

[[900  91]
 [ 63 920]]


In [26]:
# Print a classification report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [29]:
# Print the overall accuracy
import sklearn.metrics

sklearn.metrics.accuracy_score(y_test, predictions)

0.9219858156028369

## Great job!

In [32]:
import nltk
nltk.download("vader_lexicon")

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\patricio.woodley\AppData\Roaming\nltk_data...


True

In [31]:
pip install nltk

Collecting nltk
  Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
Collecting regex
  Downloading regex-2021.4.4-cp39-cp39-win_amd64.whl (270 kB)
Installing collected packages: regex, nltk
Successfully installed nltk-3.6.2 regex-2021.4.4
Note: you may need to restart the kernel to use updated packages.
