# Text Classification

The goal of this project is to develop a **classification model to predict the positive/negative labels** of movie reviews. This prediction will be **based solely on the text content** of the reviews.

The data used in this project is the polarity dataset v2.0, http://www.cs.cornell.edu/people/pabo/movie-review-data/, of Cornell University.

#### 1. Perform initial imports

In [1]:
import numpy as np
import pandas as pd

#### 2. Load data

In [2]:
df = pd.read_csv("data/moviereviews.tsv", sep='\t')

#### 3. Check the dataframe

In [3]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [4]:
len(df)

2000

In [5]:
# check number of both labels

df['label'].value_counts()

pos    1000
neg    1000
Name: label, dtype: int64

In [6]:
# check first negative review

print(df['review'][0])

how do films like mouse hunt get into theatres ? 
isn't there a law or something ? 
this diabolical load of claptrap from steven speilberg's dreamworks studio is hollywood family fare at its deadly worst . 
mouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . 
writer adam rifkin and director gore verbinski are the names chiefly responsible for this swill . 
the plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . 
deciding to check out the long-abandoned house , they soon learn that it's worth a fortune and set about selling it in auction to the highest bidder . 
but battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . 
the story alternate

In [7]:
# check first positive review

print(df['review'][2])

this has been an extraordinary year for australian films . 
 " shine " has just scooped the pool at the australian film institute awards , picking up best film , best actor , best director etc . to that we can add the gritty " life " ( the anguish , courage and friendship of a group of male prisoners in the hiv-positive section of a jail ) and " love and other catastrophes " ( a low budget gem about straight and gay love on and near a university campus ) . 
i can't recall a year in which such a rich and varied celluloid library was unleashed from australia . 
 " shine " was one bookend . 
stand by for the other one : " dead heart " . 
>from the opening credits the theme of division is established . 
the cast credits have clear and distinct lines separating their first and last names . 
bryan | brown . 
in a desert settlement , hundreds of kilometres from the nearest town , there is an uneasy calm between the local aboriginals and the handful of white settlers who live nearby . 

#### 4. Check missing values

In [8]:
df.isnull().sum()

label      0
review    35
dtype: int64

There are 35 missing reviews. We should delete these rows.

In [9]:
# remove rows with missing reviews

df.dropna(inplace = True)

In [10]:
# check missing values

df.isnull().sum()

label     0
review    0
dtype: int64

#### 5. Check empty strings

In [11]:
# using the isspace() method

empty_strings = []

for i, lb, rv in df.itertuples():
    if rv.isspace():
        empty_strings.append(i)

In [12]:
print(empty_strings)
print(len(empty_strings))

[57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]
27


There are 27 reviews that correspond to empty strings. These reviews are identified by the indices in the empty_strings list. We sould remove them.

In [13]:
# remove rows with empty strings

df.drop(empty_strings, inplace = True)

In [14]:
# check length

len(df)

1938

In [15]:
# check number of both labels

df['label'].value_counts()

pos    969
neg    969
Name: label, dtype: int64

We now have 1938 movie reviews (969 are positive and 969 are negative).

#### 6. Split the data into train and test sets

In [16]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

#### 7. Build pipeline to vectorize the data and train/fit the model

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                    ('clf', LinearSVC())])

text_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

#### 8. Make predictions with the test set

In [18]:
predictions = text_clf.predict(X_test)

#### 9. Evaluate the predictions

In [19]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [20]:
# confusion matrix

print(confusion_matrix(y_test, predictions))

[[235  47]
 [ 41 259]]


In [21]:
# classification report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         neg       0.85      0.83      0.84       282
         pos       0.85      0.86      0.85       300

   micro avg       0.85      0.85      0.85       582
   macro avg       0.85      0.85      0.85       582
weighted avg       0.85      0.85      0.85       582



In [22]:
# accuracy score

print(accuracy_score(y_test, predictions))

0.8487972508591065


Based solely on the text content of the reviews we've managed to correctly classify **84,9%** of them as positive or negative.

#### 10. Test the fitted model on new data

In [23]:
# make up some reviews and test the model

my_review = "The movie was great! The main actors were superb and the storyline was convincing."
my_review_2 = "Terrible movie! A complete waste of money."

reviews = [my_review, my_review_2]

print(text_clf.predict(reviews))

['pos' 'neg']


Even though this is a very simple model, everything seems to be working just fine.