# Text Classification Project

This notebook shows how to perfom text classification by applying bags-of-words and TFIDF. The [Movie Review Dataset from Cornell](https://www.cs.cornell.edu/people/pabo/movie-review-data/) is used. Nothing new is shown here, just a more complex example than in previous notebook.

Overview of contents:

1. Load and Explore Dataset
2. Train/Test Split
3. Build Pipelines: Naive Bayes & Support Vector Machines with `TfidfVectorizer(stop_words)`
4. Evaluate Pipelines
    - 4.1 Naive Bayes
    - 4.2 Support Vector Machines

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Load and Explore Dataset

In [38]:
import numpy as np
import pandas as pd

In [39]:
df = pd.read_csv('../data/moviereviews.tsv', sep='\t') # TAB separator
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [40]:
# Check some reviews
i = 2
print(df['label'][i])
print(df['review'][i])

pos
this has been an extraordinary year for australian films . 
 " shine " has just scooped the pool at the australian film institute awards , picking up best film , best actor , best director etc . to that we can add the gritty " life " ( the anguish , courage and friendship of a group of male prisoners in the hiv-positive section of a jail ) and " love and other catastrophes " ( a low budget gem about straight and gay love on and near a university campus ) . 
i can't recall a year in which such a rich and varied celluloid library was unleashed from australia . 
 " shine " was one bookend . 
stand by for the other one : " dead heart " . 
>from the opening credits the theme of division is established . 
the cast credits have clear and distinct lines separating their first and last names . 
bryan | brown . 
in a desert settlement , hundreds of kilometres from the nearest town , there is an uneasy calm between the local aboriginals and the handful of white settlers who live nearby . 
the

In [41]:
# Total numbe of reviews
len(df)

2000

In [42]:
# Check for the existence of NaN values in a cell
df.isnull().sum()

label      0
review    35
dtype: int64

In [43]:
# We need to remove NULL items
df.dropna(inplace=True)
len(df)

1965

In [44]:
# Sometimes empty reviews are filled with spaces
# We need to manually check them in a for-loop
blanks = []  # start with an empty list
for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

In [45]:
print(len(blanks), 'blanks: ', blanks)

27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [46]:
# Remove blank entries
df.drop(blanks, inplace=True)

In [47]:
# Final number of items
len(df)

1938

In [48]:
# Is the dataset balanced? Target is 50/50, great!
df['label'].value_counts()

neg    969
pos    969
Name: label, dtype: int64

## 2. Train/Test Split

In [49]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## 3. Build Pipelines: Naive Bayes & Support Vector Machines with `TfidfVectorizer`

When building our `Pipeline`, we can pass the **stop words** to the `TfidfVectorizer`:
- `TfidVectorizer(stop_words='english')` to accept scikit-learn's built-in list,
- or `TfidVectorizer(stop_words=[a, and, the])` to accept a custom list.

In [62]:
# List of default stop words in scikit-learn
from sklearn.feature_extraction import text
print(text.ENGLISH_STOP_WORDS)

frozenset({'con', 'hundred', 'behind', 'everyone', 'least', 'namely', 'as', 'always', 'would', 'hasnt', 'until', 'again', 'back', 'toward', 'empty', 'may', 'eg', 'all', 'none', 'eleven', 'found', 'per', 'side', 'there', 'therein', 'thereafter', 'seeming', 'very', 'the', 'wherein', 'everywhere', 'up', 'several', 'among', 'interest', 'be', 'mostly', 'has', 'four', 'anyway', 'thereby', 'but', 'whom', 'afterwards', 'than', 'couldnt', 'twenty', 'beforehand', 'out', 'thence', 'amoungst', 'hers', 'latterly', 'too', 'something', 'whenever', 'upon', 'system', 'whether', 'not', 'other', 'cry', 'her', 'seems', 'who', 'call', 'neither', 'some', 'full', 'serious', 'take', 'then', 'yours', 'whither', 'which', 'whoever', 'since', 'what', 'it', 'ever', 'nowhere', 'hereafter', 'sometimes', 'three', 'yet', 'he', 'one', 'less', 'bill', 'often', 'six', 'hereby', 'forty', 'were', 'fill', 'where', 'indeed', 'ie', 'and', 'you', 'few', 'here', 'with', 'of', 'such', 'become', 'hereupon', 'below', 'across', 'be

In [63]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                     ('clf', MultinomialNB()),
])

# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                     ('clf', LinearSVC()),
])

## 4. Evaluate Pipelines

### 4.1 Naive Bayes

In [64]:
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')),
                ('clf', MultinomialNB())])

In [65]:
# Form a prediction set
predictions = text_clf_nb.predict(X_test)

In [66]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[274  34]
 [ 94 238]]


In [67]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.74      0.89      0.81       308
         pos       0.88      0.72      0.79       332

    accuracy                           0.80       640
   macro avg       0.81      0.80      0.80       640
weighted avg       0.81      0.80      0.80       640



In [68]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.8


### 4.2 Support Vector Machines

In [69]:
text_clf_lsvc.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer(stop_words='english')),
                ('clf', LinearSVC())])

In [70]:
# Form a prediction set
predictions = text_clf_lsvc.predict(X_test)

In [71]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[252  56]
 [ 52 280]]


In [72]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.83      0.82      0.82       308
         pos       0.83      0.84      0.84       332

    accuracy                           0.83       640
   macro avg       0.83      0.83      0.83       640
weighted avg       0.83      0.83      0.83       640



In [73]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.83125
