# Movie Review Analysis as Text Classification

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

We've included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

For more information on this dataset visit http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame

In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('../TextFiles/moviereviews2.tsv', sep='\t')

df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [2]:
# Check for NaN values:
df.isna().sum()

label      0
review    20
dtype: int64

In [10]:
# Check for whitespace strings (it's OK if there aren't any!):
blank = []
for i, label, rv in df.itertuples():
    if rv.isspace() :
        blank.append(i)


### Task #3: Remove NaN values:

In [9]:
df= df.dropna()
df

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...
...,...,...
5995,pos,"Of the three remakes of this plot, I like them..."
5996,neg,Poor Whoopi Goldberg. Imagine her at a friend'...
5997,neg,"Honestly before I watched this movie, I had he..."
5998,pos,This movie is essentially shot on a hand held ...


### Task #4: Take a quick look at the `label` column:

In [11]:
df['label']

0       pos
1       pos
2       pos
3       neg
4       pos
       ... 
5995    pos
5996    neg
5997    neg
5998    pos
5999    pos
Name: label, Length: 5980, dtype: object

### Task #5: Split the data into train & test sets:
You may use whatever settings you like. In here we have used the parameters as `test_size=0.33, random_state=42`

In [12]:
from sklearn.model_selection import train_test_split
X = df['review']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42)



### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. Here `LinearSVC` is being used as the classification model.

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.svm import LinearSVC
SVC = Pipeline([('Tf_vec', TfidfVectorizer()), ('SVC', LinearSVC())])
SVC.fit(X_train, y_train)





### Task #7: Run predictions and analyze the results

In [14]:
# Form a prediction set
pred = SVC.predict(X_test)

In [15]:
# Report the confusion matrix
print(confusion_matrix(y_test, pred))


[[900  91]
 [ 63 920]]


In [16]:
# Print a classification report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [17]:
# Print the overall accuracy
print(accuracy_score(y_test, pred))

0.9219858156028369
