# Text Classification with Naive Bayes

The goal of this lab is to build a model using Naive Bayes to classify movie reviews into positive or negative, 
and then test the classifier on new movie reviews.

The dataset is from the following publication: ''Thumbs up? Sentiment Classification using Machine Learning
Techniques''. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
Proceedings of EMNLP, pp. 79--86, 2002.
Similar datasets can be found on [this site](https://www.cs.cornell.edu/people/pabo/movie-review-data/).

Each review is stored in a separate text file. All the files are grouped into 2 subfolders: *pos* and *neg*.
The dataset can be downloaded from here: [movies_reviews](https://drive.google.com/file/d/1rAJqDC8p6b5RWwoUT-0HwsxWk-b3j8cR/view?usp=sharing).

- First of all, create a new Jupyther notebook, and implement a module that reads files and stores their content in 2 string arrays of file names.
- Next, you would need to convert the words in each document into a vector of word occurrences. 
You can use the code with stop words from the clustering demo or you can use the `sklearn` module `feature_extraction.text`, 
where you are interested in the `CountVectorizer` (for this one you would need to remove stop words) or in the `TfidfTransformer`. 
The latter assigns a score to each word based on its frequencies across all the documents, 
and thus the words that occur across all the documents (the stop words) get score zero, so there is no need to remove stop words. You can find a nice explanation and an example about tf/idf score [here](https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3). Whatever vectorization technique you chose, you would need to explain it in your own words in a separate markdown cell in your notebook. 
- Once you have a vector for each review, you can add the labels *pos* or *neg*, depending on the directory (as we did in cat/dog classification demo), and then divide the dataset into training and testing.
- Now use the train dataset to build a Naive Bayes model. You can use the `sklearn` module `naive_bayes` from [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes) to accomplish this task. Carefully select the correct classifier for the data at hand by reading about different classification options.
- Test the accuracy on the train and on the test data. Try to reach the accuracy of at least 0.80.

Finally, find 5 new movie reviews on the internet which include a numeric or star rating (known to be positive or negative), and try to classify them into positive/negative using your classifier. Report and discuss the results in a separate markdown cell.

For starters, you can look at a demo of text classification [here](https://heartbeat.fritz.ai/understanding-naive-bayes-its-applications-in-text-classification-part-1-ec9caea4baae).

In [19]:
import pandas as pd
import numpy as np
import os
from glob import glob
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import classification_report

In [20]:
files = glob('movies_reviews/*/*')
res = []
for file in files:
    cl = file.split('\\')[-2]
    with open(file, 'r') as f:
        res.append([cl, f.read()])
df = pd.DataFrame(res, columns=['sentiment', 'review'])


In [21]:
df.head()

Unnamed: 0,sentiment,review
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


In [22]:
df['sentiment'].value_counts()

pos    1005
neg    1000
Name: sentiment, dtype: int64

In [23]:
vec = CountVectorizer(stop_words='english')
X = vec.fit_transform(df['review'])
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)
print(X_train.shape, X_test.shape)

(1604, 39373) (401, 39373)


In [24]:
nb = BernoulliNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.72      0.85      0.78       196
         pos       0.82      0.68      0.75       205

    accuracy                           0.76       401
   macro avg       0.77      0.76      0.76       401
weighted avg       0.77      0.76      0.76       401



In [25]:
vec = CountVectorizer(stop_words='english', max_df=.99, min_df=0.01)
X = vec.fit_transform(df['review'])
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)
print(X_train.shape, X_test.shape)

nb = BernoulliNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))

(1604, 4361) (401, 4361)
              precision    recall  f1-score   support

         neg       0.77      0.83      0.80       196
         pos       0.83      0.77      0.79       205

    accuracy                           0.80       401
   macro avg       0.80      0.80      0.80       401
weighted avg       0.80      0.80      0.80       401



In [26]:
vec = CountVectorizer(stop_words='english', max_df=.99, min_df=0.01)
X = vec.fit_transform(df['review'])
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)
print(X_train.shape, X_test.shape)

nb = BernoulliNB(alpha=1)
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))

(1604, 4361) (401, 4361)
              precision    recall  f1-score   support

         neg       0.77      0.83      0.80       196
         pos       0.83      0.77      0.79       205

    accuracy                           0.80       401
   macro avg       0.80      0.80      0.80       401
weighted avg       0.80      0.80      0.80       401



In [27]:
vec = CountVectorizer(stop_words='english', max_df=.99, min_df=0.01, ngram_range=(1,3))
X = vec.fit_transform(df['review'])
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)
print(X_train.shape, X_test.shape)

nb = BernoulliNB(alpha=1)
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))

(1604, 4894) (401, 4894)
              precision    recall  f1-score   support

         neg       0.77      0.83      0.80       196
         pos       0.83      0.76      0.79       205

    accuracy                           0.80       401
   macro avg       0.80      0.80      0.80       401
weighted avg       0.80      0.80      0.80       401



## TFIDF 

In [28]:
vec = TfidfVectorizer(stop_words='english')
X = vec.fit_transform(df['review']).todense()
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)
print(X_train.shape, X_test.shape)

(1604, 39373) (401, 39373)


In [29]:
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.64      0.59      0.62       196
         pos       0.64      0.68      0.66       205

    accuracy                           0.64       401
   macro avg       0.64      0.64      0.64       401
weighted avg       0.64      0.64      0.64       401



In [30]:
nb = BernoulliNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.72      0.85      0.78       196
         pos       0.82      0.68      0.75       205

    accuracy                           0.76       401
   macro avg       0.77      0.76      0.76       401
weighted avg       0.77      0.76      0.76       401



In [31]:
df_extra = pd.read_csv('github_reviews\gbReviewsSample.csv')
df_extra.head()

Unnamed: 0,Reviews,Sentiment
0,Who would have thought that a movie about a ma...,pos
1,After realizing what is going on around us ......,pos
2,I grew up watching the original Disney Cindere...,neg
3,David Mamet wrote the screenplay and made his ...,pos
4,"Admittedly, I didn't have high expectations of...",neg


In [32]:
vec = CountVectorizer(stop_words='english', max_df=.99, min_df=0.01)
X = vec.fit_transform(df['review'])
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, shuffle=True)
print(X_train.shape, X_test.shape)

nb = BernoulliNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))

(1604, 4361) (401, 4361)
              precision    recall  f1-score   support

         neg       0.77      0.83      0.80       196
         pos       0.83      0.77      0.79       205

    accuracy                           0.80       401
   macro avg       0.80      0.80      0.80       401
weighted avg       0.80      0.80      0.80       401



## Additional Data

In [33]:
X_extra = vec.transform(df_extra['Reviews'])
y_pred = nb.predict(X_extra)

### Prediction

In [34]:
y_pred

array(['neg', 'neg', 'neg', 'pos', 'neg'], dtype='<U3')

### Accuracy 

In [35]:
df_extra['Sentiment'].values

array(['pos', 'pos', 'neg', 'pos', 'neg'], dtype=object)

In [36]:
df_extra['Reviews'].iloc[0]

'Who would have thought that a movie about a man who drives a couple hundreds of miles on his lawn mower to see his brother, could possibly be good cinema? I certainly didn\'t. I thought I knew what to expect: one of the most boring experiences of my life. Well I was as wrong as I haven\'t been wrong too often yet, because this is one of the best, most realistic and honest Hollywood films I\'ve ever seen...<br /><br />Giving a short resume of "The Straight Story" isn\'t very difficult. It\'s about an old and stubborn man who steps on his lawn mower and drives off to another state to pay his brother a visit when he hears that the man has had a severe stroke. That\'s already special on itself, but what makes it even more special is the fact that he hasn\'t seen his brother in ten years because of some stupid argument. In the meantime he has his share of bad luck and problems, but he also meets a lot of people whose lives he influences in one way or another with his philosophical approach

### Test Results



With a new set of data, the best model misclassified two of its predictions. It may be that the new data from the sample of five reviews contains both negative and positive words that misrepresent the overall review. Since the model has an 80% margin of accuracy, it is acceptable that it wrongly predicted two revisions out of five.