___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Text Classification Assessment
This assessment is very much like the Text Classification Project we just completed, and the dataset is very similar.

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
For this exercise you can load the dataset from `'../TextFiles/moviereviews2.tsv'`.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews2.tsv', sep='\t')

df

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...
5,neg,Lately they have been trying to hock this film...
6,neg,This is without a doubt the worst movie I have...
7,neg,"PLAN B has the appearance of a quickly made, u..."
8,pos,At least something good came out of Damon Runy...
9,pos,The story of Cinderella is one of my favorites...


### Task #2: Check for missing values:

In [2]:
# Check for NaN values:
df.isnull().sum()

label      0
review    20
dtype: int64

In [3]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks =[]

for i, l, r in df.itertuples():
    if type(r)==str:
        if r.isspace():
            blanks.append(i)
            
len(blanks)

0

### Task #3: Remove NaN values:

In [4]:
df.dropna(inplace=True)

### Task #4: Take a quick look at the `label` column:

In [5]:
df['label'].value_counts()

pos    2990
neg    2990
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. To compare your results to the solution notebook, use `test_size=0.33, random_state=42`

In [6]:
from sklearn.model_selection import train_test_split
X = df['review']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

# Remove Punctuation and Stopwords

In [7]:
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords 

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jamiezeng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
def clean(mess):
    nopunc = [char for char in mess if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    stopword = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    
    return stopword   

In [9]:
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


In [10]:
df['review'] = df['review'].apply(clean)

In [11]:
df.head()

Unnamed: 0,label,review
0,pos,"[loved, movie, watch, Original, twist, Plot, M..."
1,pos,"[warm, touching, movie, fantasylike, qualitybr..."
2,pos,"[expecting, powerful, filmmaking, experience, ..."
3,neg,"[socalled, documentary, tries, tell, USA, fake..."
4,pos,"[show, escape, reality, past, ten, years, sadl..."


## Pipeline

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer

model = Pipeline([('tfidf', TfidfVectorizer()),('linsvc', LinearSVC())])
model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('linsvc',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
            

### Task #7: Run predictions and analyze the results

In [14]:
# Form a prediction set
pre = model.predict(X_test)

In [15]:
# Report the confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

In [16]:
# Print a classification report
print(confusion_matrix(y_test, pre))
print(classification_report(y_test, pre))

[[900  91]
 [ 63 920]]
              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [17]:
# Print the overall accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, pre))

0.9219858156028369
