### This notebook illustrates the steps in building a Naive Bayes model for classifying IMDB Movie Review sentiment (negative/positive). 

In [3]:
import pandas as pd

#### LOAD DATASETS ####

train_data_file = "train.csv"
test_data_file = "test.csv"

# Import train and test dataset into data frames and print out their lengths
train_data_df = pd.read_csv(train_data_file)
test_data_df = pd.read_csv(test_data_file)
print ("Train set: ",len(train_data_df))
print ("Test set: ",len(test_data_df))

# print out top 5 rows of the train set
display(train_data_df.head(5))

Train set:  39998
Test set:  10001


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Count-based feature extraction using scikit-learn CountVectorizer

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# examine the features extracted using CountVectorizer for a small subset of instances
text = train_data_df.iloc[0:3]["review"]
print(text,"\n")
vectorizer = CountVectorizer(ngram_range = (1,1)) # adjust this ngram range to observe different n-grams

# tokenize and build vocabulary
vectorizer.fit(text)
print(vectorizer.vocabulary_,"\n")

# create feature vector representation
vector = vectorizer.transform(text)

# summarize vector information, the integers associated with the n-grams are feature indices
print(vector.shape,"\n") 

# complete vectors
print(vector.toarray())

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
Name: review, dtype: object 

{'one': 190, 'of': 187, 'the': 282, 'other': 196, 'reviewers': 227, 'has': 120, 'mentioned': 173, 'that': 280, 'after': 4, 'watching': 313, 'just': 147, 'oz': 199, 'episode': 79, 'you': 337, 'll': 159, 'be': 22, 'hooked': 128, 'they': 286, 'are': 13, 'right': 228, 'as': 16, 'this': 289, 'is': 140, 'exactly': 83, 'what': 318, 'happened': 118, 'with': 325, 'me': 172, 'br': 27, 'first': 94, 'thing': 287, 'struck': 267, 'about': 0, 'was': 311, 'its': 143, 'brutality': 29, 'and': 11, 'unflinching': 303, 'scenes': 234, 'violence': 309, 'which': 321, 'set': 243, 'in': 133, 'from': 100, 'word': 331, 'go': 108, 'trust': 299, 'not': 186, 'show': 250, 'for': 97, 'faint': 89, 'hearted': 123, 'or': 192, 'timid': 292, 'pulls': 217, 'no': 185, 'punches': 218, 'regards': 225, 'to': 293, 'drugs': 72, 'sex': 245

Interpret the feature vectors. What do they tell you about the individual examples?

### Feature extraction using CountVectorizer for training and testing data on reviews


In [5]:
# use review for model building

train_text = train_data_df["review"]
test_text = test_data_df["review"]

# set the n-gram range
vectorizer = CountVectorizer(ngram_range = (1,1))

# create training data representation
train_data_cv = vectorizer.fit_transform(train_text.values.astype('U'))
print(train_data_cv.shape,"\n") 

# create test data representation
test_data_cv = vectorizer.transform(test_text.values.astype('U'))
print(test_data_cv.shape,"\n") 

print(test_data_cv)

(39998, 92513) 

(10001, 92513) 

  (0, 47)	1
  (0, 133)	1
  (0, 162)	1
  (0, 1667)	2
  (0, 1728)	2
  (0, 2096)	1
  (0, 3399)	1
  (0, 4031)	1
  (0, 4148)	4
  (0, 4462)	1
  (0, 4650)	1
  (0, 5370)	2
  (0, 5596)	3
  (0, 5959)	3
  (0, 7970)	1
  (0, 8128)	1
  (0, 8232)	1
  (0, 8318)	1
  (0, 8929)	1
  (0, 10989)	10
  (0, 12447)	2
  (0, 12547)	1
  (0, 13557)	1
  (0, 14485)	1
  (0, 14778)	1
  :	:
  (10000, 81880)	4
  (10000, 81899)	13
  (10000, 81960)	1
  (10000, 82018)	1
  (10000, 82072)	2
  (10000, 82168)	3
  (10000, 82595)	1
  (10000, 82848)	6
  (10000, 83557)	1
  (10000, 83960)	1
  (10000, 84402)	1
  (10000, 84471)	1
  (10000, 84974)	1
  (10000, 85612)	1
  (10000, 87960)	1
  (10000, 89162)	2
  (10000, 89690)	1
  (10000, 89701)	1
  (10000, 90077)	1
  (10000, 90141)	1
  (10000, 90257)	1
  (10000, 90620)	3
  (10000, 90786)	1
  (10000, 91083)	1
  (10000, 91847)	1


Note the difference between how training and test data are vectorized (`fit_transform` vs. `transform`). 
Why is this needed?

#### fit_transform training data creating features on the training set
#### transform then applies these features (extracted from the training set) to test set

## Building a Naive Bayes model using unigram features (bag-of-words)

In [6]:
from sklearn.naive_bayes import MultinomialNB
# define true labels from train set
x_train = train_data_cv
y_train = train_data_df["sentiment"]
x_test = test_data_cv
y_test = test_data_df["sentiment"]

# build model on the training data
model = MultinomialNB()
model.fit(x_train, y_train)

# predict the labels for the test data
predictions = model.predict(x_test)
predictions

array(['negative', 'negative', 'negative', ..., 'negative', 'negative',
       'negative'], dtype='<U8')

### Saving and Loading the model

In [7]:
import pickle

# save the model to disk
filename = 'NaiveBayesModelBOW.sav'
pickle.dump(model, open(filename, 'wb'))

# some time later...
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(x_test, y_test)
print(result)

0.8497150284971503


### Evaluating the model

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix 

print ("Accuracy score: ", accuracy_score(y_test, predictions))
print ("Individual label performance: ")
print (classification_report(y_test, predictions))
print (confusion_matrix(y_test, predictions))

Accuracy score:  0.8497150284971503
Individual label performance: 
              precision    recall  f1-score   support

    negative       0.83      0.88      0.85      4994
    positive       0.87      0.82      0.84      5007

    accuracy                           0.85     10001
   macro avg       0.85      0.85      0.85     10001
weighted avg       0.85      0.85      0.85     10001

[[4408  586]
 [ 917 4090]]


### Comparing the reviews where model predicted erroneously

In [9]:
predictions = predictions
labels = test_data_df["sentiment"]
inputs = test_data_df["review"]

for idx, prediction, label in zip(enumerate(inputs), predictions, labels):
    if prediction != label:
        print("Sample", 'has been classified as', prediction, 'and should be', label, idx) 
        print ("\n")

Sample has been classified as negative and should be positive (20, 'This movie is stuffed full of stock Horror movie goodies: chained lunatics, pre-meditated murder, a mad (vaguely lesbian) female scientist with an even madder father who wears a mask because of his horrible disfigurement, poisoning, spooky castles, werewolves (male and female), adultery, slain lovers, Tibetan mystics, the half-man/half-plant victim of some unnamed experiment, grave robbing, mind control, walled up bodies, a car crash on a lonely road, electrocution, knights in armour - the lot, all topped off with an incredibly awful score and some of the worst Foley work ever done.<br /><br />The script is incomprehensible (even by badly dubbed Spanish Horror movie standards) and some of the editing is just bizarre. In one scene where the lead female evil scientist goes to visit our heroine in her bedroom for one of the badly dubbed: "That is fantastical. I do not understand. Explain to me again how this is..." exposi

### Feature extraction using TfidfVectorizer 

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# this time, we vectorize using TF-IDF
text = train_data_df.iloc[0:3]["review"]
print(text,"\n")

tf = TfidfVectorizer()
tf.fit(text)

print(tf.vocabulary_,"\n")

# encode document
data = tf.transform(text)

# summarize encoded vector
print(data.shape,"\n") 
print(data.toarray())

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
Name: review, dtype: object 

{'one': 190, 'of': 187, 'the': 282, 'other': 196, 'reviewers': 227, 'has': 120, 'mentioned': 173, 'that': 280, 'after': 4, 'watching': 313, 'just': 147, 'oz': 199, 'episode': 79, 'you': 337, 'll': 159, 'be': 22, 'hooked': 128, 'they': 286, 'are': 13, 'right': 228, 'as': 16, 'this': 289, 'is': 140, 'exactly': 83, 'what': 318, 'happened': 118, 'with': 325, 'me': 172, 'br': 27, 'first': 94, 'thing': 287, 'struck': 267, 'about': 0, 'was': 311, 'its': 143, 'brutality': 29, 'and': 11, 'unflinching': 303, 'scenes': 234, 'violence': 309, 'which': 321, 'set': 243, 'in': 133, 'from': 100, 'word': 331, 'go': 108, 'trust': 299, 'not': 186, 'show': 250, 'for': 97, 'faint': 89, 'hearted': 123, 'or': 192, 'timid': 292, 'pulls': 217, 'no': 185, 'punches': 218, 'regards': 225, 'to': 293, 'drugs': 72, 'sex': 245

### Feature Extraction using TfidfVectorizer for training and testing data on Review

In [11]:
train_data_tfidf = tf.fit_transform(train_text)
print(train_data_tfidf.shape,"\n") 

test_data_tfidf = tf.transform(test_text)
print(test_data_tfidf.shape,"\n") 

idf = tf.idf_

# print out feature names (the words) and the IDF values
print(dict(zip(tf.get_feature_names_out(), idf)))

(39998, 92513) 

(10001, 92513) 



AttributeError: 'TfidfVectorizer' object has no attribute 'get_feature_names_out'

## Building and evaluate a Naive Bayes model using tf-idf representation

In [None]:
# define true labels from train set
x_train = train_data_tfidf
y_train = train_data_df["sentiment"]
x_test = test_data_tfidf
y_test = test_data_df["sentiment"]


model = MultinomialNB()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print ("Accuracy score: ", accuracy_score(y_test, predictions))
print ("Individual label performance: ")
print (classification_report(y_test, predictions))
print (confusion_matrix(y_test, predictions))

What can you say about the performance of Naive Bayes models created using count vectors and tf-idf vectors?

## Exploring EmoLex lexicon

You can try **EmoLex**'s interactive playground here: https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm under the heading **An Interactive Visualizer**. 

How can we incorporate the information in EmoLex in our classification models?