# Movie reviews classifica

In [60]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import metrics
from IPython.display import Markdown, display
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

## Data exploration

In [61]:
df = pd.read_csv('moviereviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [62]:
# Size of the dataset
len(df)

2000

In [63]:
# Display a review
display(Markdown('> '+df['review'][0]))

> how do films like mouse hunt get into theatres ? 
isn't there a law or something ? 
this diabolical load of claptrap from steven speilberg's dreamworks studio is hollywood family fare at its deadly worst . 
mouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . 
writer adam rifkin and director gore verbinski are the names chiefly responsible for this swill . 
the plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . 
deciding to check out the long-abandoned house , they soon learn that it's worth a fortune and set about selling it in auction to the highest bidder . 
but battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . 
the story alternates between unfunny scenes of the brothers bickering over what to do with their inheritance and endless action sequences as the two take on their increasingly determined furry foe . 
whatever promise the film starts with soon deteriorates into boring dialogue , terrible overacting , and increasingly uninspired slapstick that becomes all sound and fury , signifying nothing . 
the script becomes so unspeakably bad that the best line poor lee evens can utter after another run in with the rodent is : " i hate that mouse " . 
oh cringe ! 
this is home alone all over again , and ten times worse . 
one touching scene early on is worth mentioning . 
we follow the mouse through a maze of walls and pipes until he arrives at his makeshift abode somewhere in a wall . 
he jumps into a tiny bed , pulls up a makeshift sheet and snuggles up to sleep , seemingly happy and just wanting to be left alone . 
it's a magical little moment in an otherwise soulless film . 
a message to speilberg : if you want dreamworks to be associated with some kind of artistic credibility , then either give all concerned in mouse hunt a swift kick up the arse or hire yourself some decent writers and directors . 
this kind of rubbish will just not do at all . 


In [64]:
# Take a look at the distribution of the data
df['label'].value_counts()

neg    1000
pos    1000
Name: label, dtype: int64

## Data cleaning

In [65]:
# Check for missing values
df.isnull().sum()

label      0
review    35
dtype: int64

In [66]:
# Remove missing values
df.dropna(inplace=True)
df.isnull().sum()

label     0
review    0
dtype: int64

In [67]:
# Detect empty strings
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [68]:
# Remove empty strings
df.drop(blanks, inplace=True)
len(df)

1938

In [69]:
# Take a look at the distribution of the data
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

## Models building

### Train test split

In [70]:
X = df['review']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

A good practice in Natural Language Processing is to deal with stopwords because they might influence the results and decrease the processing speed. Hence, I am going to create a list of stopwords and filter them when building the pipelines.

In [71]:
# Create my list of stopping words for model building
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', 
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', 
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', 
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', 
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

### Naïve Bayes model

#### Build pipeline to vectorize the data

In [72]:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                     ('clf', MultinomialNB()),
                       ])

#### Feed the training data through the pipeline

In [73]:
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(stop_words=['a', 'about', 'an', 'and', 'are',
                                             'as', 'at', 'be', 'been', 'but',
                                             'by', 'can', 'even', 'ever', 'for',
                                             'from', 'get', 'had', 'has',
                                             'have', 'he', 'her', 'hers', 'his',
                                             'how', 'i', 'if', 'in', 'into',
                                             'is', ...])),
                ('clf', MultinomialNB())])

#### Analysis of the results

In [74]:
# Analysis of the predictions
predictions = text_clf_nb.predict(X_test)

#confusion matrix
print(metrics.confusion_matrix(y_test,predictions))

[[282  26]
 [105 227]]


In [75]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.73      0.92      0.81       308
         pos       0.90      0.68      0.78       332

    accuracy                           0.80       640
   macro avg       0.81      0.80      0.79       640
weighted avg       0.82      0.80      0.79       640



In [76]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.7953125


Based on text alone the naïve bayes classifier can correctly classified reviews as positive or negative **79.5%** of the time, which is good because I am working with a relatively small amount of data.

### Linear Support Vector Classifier

#### Build the pipeline

In [77]:
# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                     ('clf', LinearSVC()),
                         ])

#### Feed the training data through the pipeline

In [78]:
text_clf_lsvc.fit(X_train, y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(stop_words=['a', 'about', 'an', 'and', 'are',
                                             'as', 'at', 'be', 'been', 'but',
                                             'by', 'can', 'even', 'ever', 'for',
                                             'from', 'get', 'had', 'has',
                                             'have', 'he', 'her', 'hers', 'his',
                                             'how', 'i', 'if', 'in', 'into',
                                             'is', ...])),
                ('clf', LinearSVC())])

In [79]:
# Form a prediction set
predictions = text_clf_lsvc.predict(X_test)

In [80]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[256  52]
 [ 48 284]]


In [81]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.84      0.83      0.84       308
         pos       0.85      0.86      0.85       332

    accuracy                           0.84       640
   macro avg       0.84      0.84      0.84       640
weighted avg       0.84      0.84      0.84       640



In [82]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.84375


I notice that the Linear Support Vector Classifier performs better than the Naïve bayes with an accuracy of **0.84**. 


**NB:** I tried to build the Linear SVC without removing the stopwords and obtain **0.847** accuracy. This proves that not all stopwords impairs the classification performance.