# Movie Review Classification 

The **moviereviews2.tsv** dataset contains the text of 6000 movie reviews. 3000 are positive, 3000 are negative, and the text has been preprocessed as a tab-delimited file. As before, labels are given as `pos` and `neg`.

It's included 20 reviews that contain either `NaN` data, or have strings made up of whitespace.

Dataset source : http://ai.stanford.edu/~amaas/data/sentiment/

### Task #1: Perform imports and load the dataset into a pandas DataFrame
.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
reviews = pd.read_csv('moviereviews2.tsv' , sep = '\t')






In [11]:
reviews.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [9]:
# Check for NaN values:
reviews.isnull().sum()

label      0
review    20
dtype: int64

In [14]:
reviews.insert(0, 'ID', range(0, 0 + len(reviews)))

In [15]:
reviews.columns

Index(['ID', 'label', 'review'], dtype='object')

In [25]:
reviews.head()

Unnamed: 0,ID,label,review
0,0,pos,I loved this movie and will watch it again. Or...
1,1,pos,"A warm, touching movie that has a fantasy-like..."
2,2,pos,I was not expecting the powerful filmmaking ex...
3,3,neg,"This so-called ""documentary"" tries to tell tha..."
4,4,pos,This show has been my escape from reality for ...


In [24]:
reviews.items()

<generator object DataFrame.items at 0x00000014ED17B970>

In [33]:
# Check for whitespace strings 
blanks = []

for i,j,lbl,rvw in reviews.itertuples():
    if type(rvw)==str: 
        if rvw.isspace():
            blanks.append(i)








In [34]:
len(blanks)

0

### Task #3: Remove NaN values:

In [35]:
reviews.dropna(inplace = True)

In [37]:
reviews.drop('ID', inplace = True, axis =1)

### Task #4: Take a quick look at the `label` column:

In [38]:
reviews['label'].value_counts()

pos    2990
neg    2990
Name: label, dtype: int64

### Task #5: Split the data into train & test sets:
used `test_size=0.33, random_state=42`

In [39]:
from sklearn.model_selection import train_test_split
X = reviews['review']
y = reviews['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)






### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [41]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

clf = Pipeline([('TfidfVector',TfidfVectorizer()),('LSVC',LinearSVC())])

clf.fit(X_train, y_train)








Pipeline(steps=[('TfidfVector', TfidfVectorizer()), ('LSVC', LinearSVC())])

### Task #7: Run predictions and analyze the results

In [42]:
# Form a prediction set
predictions = clf.predict(X_test)

In [44]:
# Report the confusion matrix
from sklearn.metrics import accuracy_score , classification_report, confusion_matrix
confusion_matrix(y_test, predictions)


array([[900,  91],
       [ 63, 920]], dtype=int64)

In [45]:
# Print a classification report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [46]:
# Print the overall accuracy
accuracy_score(y_test, predictions)

0.9219858156028369