
# Naive Bayes: Text Classification


The goal of this lab is to build a model using Naive Bayes to classify movie reviews into positive or negative, and then test the classifier on new movie reviews.
The dataset can be downloaded from read me: movies_reviews.

### 1. Preprocess the data

In [146]:
data_dir = "/Users/elaine/Desktop/ML2020labs/movies_reviews"

In [147]:
import os
cwd = os.getcwd()
os.chdir(cwd)
print(os.listdir(data_dir))

['.DS_Store', 'neg', 'pos', 'new_review']


In [148]:
pos_files = []
pos_path_files = data_dir +"/pos/"
for path in os.listdir(pos_path_files):
    if '.txt' in path:
        pos_files.append(os.path.join(pos_path_files, path))
        
neg_files = []
neg_path_files = data_dir +"/neg/"
for path in os.listdir(neg_path_files):
    if '.txt' in path:
        neg_files.append(os.path.join(neg_path_files, path))
        
len(neg_files), len(pos_files) 

(1000, 1005)

Here I made the file names into two list


In [149]:
pos_review = ['' for i in range(1000)]
neg_review = ['' for i in range(1000)]
for i in range(1000):    
    file = pos_files[i]
    f = open(file, "r")
    pos_review[i] = f.read()
    f.close()
    
for i in range(1000):   
    file = neg_files[i]
    f = open(file, "r")
    neg_review[i] = f.read()
    f.close()
    
# print(pos_review[0])
# print(neg_review[0])

Now i have two lists, X_pos and X_neg.
Both of them are storing the texts.
Each element in the list is the content of a file. 
And now i want to preprocess my data

In [150]:
import re
import string

In [151]:
#preprocess the data since CountVectorizer don't remove all the characters we don't want 
for i in range(1000):    
    review = pos_review[i] 
    review = review.lower()
    review = re.sub("^\d+\s|\s\d+\s|\s\d+$", '', review)
    review = re.sub(r'[^\w\s]', '', review) 
    review = review.strip()
    review = review.replace("_", "")
    pos_review[i] = review
    
for i in range(1000):   
    review = neg_review[i] 
    review = review.lower()
    review = re.sub("^\d+\s|\s\d+\s|\s\d+$", ' ', review)
    review = re.sub(r'[^\w\s]', '', review) 
    review = review.strip()
    review = review.replace("_", "")

    neg_review[i] = review
    
# this also works
# review = neg_review[i] 
# review = re.sub(r"""\w*\d\w*""", '', review)
# punc_lower = re.sub('[%s]' % re.escape(string.punctuation), '', review.lower())
# review = neg_review[i] 


Split Dataset into Test and Training Set

In [152]:
Y_train = ['pos']*800 + ['neg']*800
Y_test = ['pos']*200 + ['neg']*200
X_train = pos_review[:800] + neg_review[:800]
X_test = pos_review[800:] + neg_review[800:]

In [153]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Cvectorizer = CountVectorizer(stop_words='english')
#Initializing CountVectorizer from sklearn
X_train = Cvectorizer.fit_transform(X_train) 
X_test = Cvectorizer.transform(X_test) 

#Print the dimensions of the processed data for checking of the code run successfully
print(X_train.toarray().shape)
# print(Cvectorizer.get_feature_names())
# X_train.toarray()

(1600, 43021)


I used CountVectorizer. CountVectorizer.fit_transform(X_train) is to transform the the X_train data from a list of texts to a matrix that each row is a vector that indicate the occurance of words.
For example, if i have the raw texts like:
['i like apple', 'you like banana', 'he likes apple']
then after the transformation, the CountVectorizer gives this:
[[1,1,1,0,0,0]
[0,1,0,1,1,0]
[0,1,1,0,0,1]]
where CountVectorizer.get_feature_names() is
['i','like','apple','you','banana','he']
It can also preprocess data like put the data into lower case, elimites the punctuations, empty spaces...
Also, the CountVectorizer excludes the stopwords. In this lab, i choose 'english'. Cvectorizer = CountVectorizer(stop_words='english')

### 2. Build Models

I used the both of the models that i think can be used for text classification.
The first one is CategoricalNB and the other one is MultinomialNB. 

In [154]:
from sklearn.naive_bayes import CategoricalNB
# Y_train = Y_train.toarray()
clf1 = CategoricalNB()
X_train = X_train.toarray()
clf1.fit(X_train, Y_train)
Y_predict = clf1.predict(X_train)

from sklearn import metrics
print("Accuracy for CategoricalNB", metrics.accuracy_score(Y_train, Y_predict))

Accuracy for CategoricalNB 0.976875


In [155]:
# from sklearn.datasets import load_iris
# from sklearn.model_selection import train_test_split
# from sklearn.naive_bayes import GaussianNB

from sklearn.naive_bayes import MultinomialNB
clf2 = MultinomialNB().fit(X_train, Y_train)
Y_predict = clf2.predict(X_train)
from sklearn import metrics
print("Accuracy for MultinomialNB", metrics.accuracy_score(Y_train, Y_predict))

Accuracy for MultinomialNB 0.981875


### 3. Test the Models

Both of the models are doing very well in terms of accuracy for the training data. So i used both in the testing part.

In [156]:
# X_test = X_test.toarray()
# prediction_test1 = clf1.predict(X_test[0])
# score1 = metrics.accuracy_score(Y_test, prediction_test1)
# print('Total accuracy classification score with CategoricalNB: {}'.format(score1))

however, i found out that this model (categoricalNB)doesn't work in this situation...

In [157]:
prediction_test2 = clf2.predict(X_test)
score2 = metrics.accuracy_score(Y_test, prediction_test2)
print('Total accuracy classification score with MultinomialNB: {}'.format(score2))

Total accuracy classification score with MultinomialNB: 0.8275


### 4. New Data

I added 5 new reviews about one of my favorite movie, Flipped. 

In [158]:
new_files = []
new_path_files = data_dir +"/new_review/"
for path in os.listdir(new_path_files):
    if '.txt' in path:
        print(path)
        new_files.append(os.path.join(new_path_files, path))
        
new_review = ['' for i in range(5)]
for i in range(5):    
    file = new_files[i]
    f = open(file, "r")
    new_review[i] = f.read()
    f.close()
    
for i in range(5):    
    review = new_review[i] 
    review = review.lower()
    review = re.sub("^\d+\s|\s\d+\s|\s\d+$", '', review)
    review = re.sub(r'[^\w\s]', '', review) 
    review = review.strip()
    review = review.replace("_", "")
    new_review[i] = review

review3.txt
review2.txt
review1.txt
review5.txt
review4.txt


In [159]:
X_new = Cvectorizer.transform(new_review) 
prediction_new = clf2.predict(X_new)
Y_new = ['pos','pos','pos','neg','pos']
score3 = metrics.accuracy_score(Y_new, prediction_new)
print('Total accuracy classification score with MultinomialNB: {}'.format(score3))
print(prediction_new)
print(Y_new)

Total accuracy classification score with MultinomialNB: 1.0
['pos' 'pos' 'pos' 'neg' 'pos']
['pos', 'pos', 'pos', 'neg', 'pos']


All of the reviews are classified correctly!