# Text Classification Project
Now we're at the point where we should be able to:
* Read in a collection of documents - a *corpus*
* Transform text into numerical vector data using a pipeline
* Create a classifier
* Fit/train the classifier
* Test the classifier on new data
* Evaluate performance

For this project we'll use the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/

In this exercise we'll try to develop a classification model as we did for the SMSSpamCollection dataset - that is, we'll try to predict the Positive/Negative labels based on text content alone. In an upcoming section we'll apply *Sentiment Analysis* to train models that have a deeper understanding of each review.

Perform imports and load the dataset
The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('UPDATED_NLP_COURSE/TextFiles/moviereviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [2]:
len(df)#the number of reviews

2000

In [4]:
# Take a look at a typical review. This one is labeled "negative":
from IPython.display import Markdown, display
display(Markdown('> '+df['review'][2]))

> this has been an extraordinary year for australian films . 
 " shine " has just scooped the pool at the australian film institute awards , picking up best film , best actor , best director etc . to that we can add the gritty " life " ( the anguish , courage and friendship of a group of male prisoners in the hiv-positive section of a jail ) and " love and other catastrophes " ( a low budget gem about straight and gay love on and near a university campus ) . 
i can't recall a year in which such a rich and varied celluloid library was unleashed from australia . 
 " shine " was one bookend . 
stand by for the other one : " dead heart " . 
>from the opening credits the theme of division is established . 
the cast credits have clear and distinct lines separating their first and last names . 
bryan | brown . 
in a desert settlement , hundreds of kilometres from the nearest town , there is an uneasy calm between the local aboriginals and the handful of white settlers who live nearby . 
the local police officer has the task of enforcing " white man's justice " to the aboriginals . 
these are people with a proud 40 , 000 year heritage behind them . 
naturally , this includes their own system of justice ; key to which is " payback " . 
an eye for an eye . 
revenge . 
usually extracted by the spearing through of the recipient's thigh . 
brown , as the officer , manages quite well to keep the balance . 
he admits that he has to 'bend the rules' a bit , including actively encouraging at least one brutal " payback " . 
 ( be warned that this scene , near the start , is not for the squeamish ) . 
the local priest - an aboriginal , but in the " white fellas " church - has a foot on either side of the line . 
he is , figuratively and literally , in both camps . 
ernie dingo brings a great deal of understanding to this role as the man in the middle . 
he is part churchman and part politician . 
however the tension , like the heat , flies and dust , is always there . 
whilst her husband - the local teacher - is in church , white lady kate ( milliken ) and her aborginal friend tony , ( pedersen ) have gone off into the hills . 
he takes her to a sacred site , even today strictly men-only . 
she appears to not know this . 
tony tells her that this is a special place , an initiation place . 
he then makes love to her , surrounded by ancient rock art . 
the community finds out about this sacrilegious act and it's payback time . 
the fuse is lit and the brittle inter-racial peace is shattered . 
everyone is affected in the fall out . 
to say more is to give away the details of this finely crafted film . 
suffice to say it's a rewarding experience . 
bryan brown , acting and co-producing , is the pivotal character . 
his officer is real , human and therefore flawed . 
brown comments that he expects audiences to feel warmth towards the man , then suddenly feel angry about him . 
it wasn't long ago that i visited central australia - ayers rock ( uluru ) and alice springs - for the first time . 
the wide-screen cinematography shows the dead heart of australia in a way that captures it's vicious beauty , but never deteriorates into a moving slide show , in which the gorgeous background dominates those pesky actors in the foreground . 
the cultural clash has provided the thesis for many a film ; from the western to the birdcage . 
at least three excellent australian films have covered the aboriginal people and the line between them and we anglo-saxon 'invaders' : " jedda " , " the chant of jimmie blacksmith " and " the last wave " . 
in a year when the race 'debate' has reared up in australia , it is nourishing to see such an intelligent , non-judgemental film as " dead heart " . 
the aboriginal priest best sums this up . 
he is asked to say if he is a " black fella or white fella " . 


In [8]:
# Check for missing values:
df.isnull().sum()

label      0
review    35
dtype: int64

In [9]:
df.dropna(inplace=True)

In [10]:
# Check for missing values:
df.isnull().sum()

label     0
review    0
dtype: int64

In [11]:
# Detect & remove empty strings
blanks = []  # start with an empty list
#   (index,label,review)
for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [12]:
df.drop(blanks,inplace=True)
len(df)#the original data was 2,000

1938

In [13]:
#split data into train & set
from sklearn.model_selection import train_test_split
X = df['review']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [15]:
#check the size of each dataset
print(f"Trainning data:{X_train.shape} Test data:{X_test.shape}")

Trainning data:(1356,) Test data:(582,)


In [17]:
#train the data
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [18]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

In [19]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [20]:
# Report the confusion matrix
from sklearn import metrics
# print(metrics.confusion_matrix(y_test,predictions))
df = pd.DataFrame(metrics.confusion_matrix(y_test,predictions), index=['ham','spam'], columns=['ham','spam'])
df

Unnamed: 0,ham,spam
ham,235,47
spam,41,259


In [21]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.85      0.83      0.84       282
         pos       0.85      0.86      0.85       300

    accuracy                           0.85       582
   macro avg       0.85      0.85      0.85       582
weighted avg       0.85      0.85      0.85       582



In [22]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.8487972508591065


In [24]:
# trying a kind of real review
text_clf.predict(["I loved the movie, was amazing"])

array(['pos'], dtype=object)

In [25]:
# trying a kind of real review
text_clf.predict(["I hate the movie, was the worst"])

array(['neg'], dtype=object)

In [26]:
text_clf.predict(["A movie I really wanted to love was terrible. \I'm sure the producers had the best intentions, but the execution was lacking."])

array(['neg'], dtype=object)