# Text Classification Exercise: Movie Reviews

## Introduction

This exercise uses the data from Kaggle's [IMDB Movie reviews](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) competition.

**Description of the data:**

- **`labeledTrainData.tsv.zip`** contains the dataset.
- Each observation in this dataset is a review of a movie by a user.
- The **sentiment** column is the sentiment of the review (1 -> positive and 0 -> negative).
- The **review** column is the text of the review.

# **Goal:** Predict the sentiment of the review using the review text.

## Task 1

Read **`labeledTrainData.tsv.zip`** into a pandas DataFrame and examine it. Please note that pandas can directly read tsv/csv files inside a zip file.

## Task 2

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review** as the feature and the **sentiment** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

## Task 3

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

## Task 4

Use multinomial Naive Bayes to **predict the sentiment** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

## Task 5

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

## Task 6

Use different **tuning parameters** e.g max_df, min_df, max_features etc to build models and check test accuracy.

Hint:

- You can write a function which accepts a vectorizer as a parameter and..
- Create DTMs for Training and Test data
- Trains a model (SVM)
- Calculate the testing accuracy and prints the same

Call the above function with Vectorizers object created using different tuning parameters. Use TF-IDF vectorizer for this task.

In [1]:
#from google.colab import drive
#drive.mount('/gdrive')

In [2]:
import pandas as pd
import numpy as np

In [3]:
# read file into pandas using a relative path. Please change the path as needed
#sms_df = pd.read_table('/gdrive/My Drive/Statistical NLP AIML/labeledTrainData.tsv.zip')
sms_df = pd.read_table('labeledTrainData.tsv.zip')

In [4]:
sms_df.shape

(25000, 3)

In [5]:
sms_df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [6]:
from sklearn.model_selection import train_test_split

In [7]:
# split X and y into training and testing sets
sms_train, sms_test, y_train, y_test = train_test_split(sms_df.review, sms_df.sentiment, random_state=2)

In [8]:
#Traing data
print(sms_train.shape)
print(y_train.shape)

(18750,)
(18750,)


In [9]:
#Test Data
print(sms_test.shape)
print(y_test.shape)

(6250,)
(6250,)


### 3. Tokenization & Vectorization

Using **CountVectorizer**, to get numeric features.

In [10]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
cvect = CountVectorizer(stop_words='english')

In [11]:
#Feed SMS data to CountVectorizer
cvect.fit(sms_train)

#Check the vocablury size
len(cvect.vocabulary_)

66361

Build Document-term Matrix (DTM)

In [12]:
#Convert Training SMS messages into Count Vectors
X_train_ct = cvect.transform(sms_train)

In [13]:
#Size of Document Term Matrix
X_train_ct.shape

(18750, 66361)

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

Convert Test SMS also in numerical features

In [14]:
X_test_ct = cvect.transform(sms_test)

In [15]:
print(X_train_ct.shape)
print(X_test_ct.shape)

(18750, 66361)
(6250, 66361)


In [16]:
from sklearn.naive_bayes import MultinomialNB
# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_ct, y_train)
y_pred_class = nb.predict(X_test_ct)
print("Number of Features")
print(X_train_ct.shape[1])
print("Training Accuracy")
print(nb.score(X_train_ct,y_train))
print("Testing Accuracy")
print(nb.score(X_test_ct,y_test))

Number of Features
66361
Training Accuracy
0.91696
Testing Accuracy
0.85616


In [21]:
#Use a Sklearn Pipeline and perform CountVectoriser and SVC together
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import cross_val_score
from textblob import TextBlob, Word
# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    text = text.lower()
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]
sms_df = pd.read_table('labeledTrainData.tsv.zip')
sms_train, sms_test, y_train, y_test = train_test_split(sms_df.review, sms_df.sentiment, random_state=2)
pipe = Pipeline((
("cv",CountVectorizer(analyzer=split_into_lemmas,stop_words='english',ngram_range=(1, 2),min_df=6)),
("nb",MultinomialNB())
))
pipe.fit(sms_train,y_train)
print(len(pipe['cv'].vocabulary_))
print("Training Accuracy")
print(pipe.score(sms_train,y_train))
print("Testing Accuracy")
print(pipe.score(sms_test,y_test))
predicted = pipe.predict(sms_test)
print(confusion_matrix(y_test,predicted))
print(classification_report(y_test,predicted))
#scoresdt = cross_val_score(pipe,sms_train,y_train,cv=10)
#print(scoresdt)
#print("Average Cross Validation Accuracy")
#print(np.mean(scoresdt))

14426
Training Accuracy
0.8625066666666666
Testing Accuracy
0.83488
[[2770  417]
 [ 615 2448]]
              precision    recall  f1-score   support

           0       0.82      0.87      0.84      3187
           1       0.85      0.80      0.83      3063

    accuracy                           0.83      6250
   macro avg       0.84      0.83      0.83      6250
weighted avg       0.84      0.83      0.83      6250

