## Naive Bayes Classifier for Sentiment Analysis


We create a naive Bayes classifier to perform binary sentiment analysis on movie reviews. We use Pang & Lee’s (2005) polarity dataset 2.0, which consists of 1000 positive and 1000 negative movie reviews. Note that these reviews have already been pre-processed so that tokenization has already been done. Each review is in its own file, each sentence is on its own line, and each token is followed by a space. Specifically:

construct a baseline naive Bayes classifier that uses only words as features. This is the type of naive Bayes classifier we discussed in class, which is equivalent to a unigram language model for each class (positive and negative). To construct it, we get the frequencies of every word in the vocabulary in each class, add 1 to all word frequencies in each class (Laplace smoothing), and then normalize each frequency by dividing by the total to get a probability distribution. Finally, take the log of each probability to get a log-probability of each word for each class. [Note that this might be easiest using a vector library like python’s numpy.] To apply the naive Bayes classifier, We calculate the log probability of a test review under each class by summing the log probabilities of each word in the review. (Ignore new words in the test reviews that weren’t in training, since the models don’t assign them a probability.) Finally, classify according to whichever log probability is higher. (Recall that log probabilities are always negative, and -2 is higher than -4.)

### Set Up Algorithm Function

In [96]:
import string
import numpy as np

In [97]:
#Model Training
def modelTrain(string_pos,string_neg):
    token_dict = {}
    # read positive reviews
    for word in string_pos.split():
        # skip punctuations
        if word in string.punctuation:
            continue
        
        # update dictionary
        if word not in token_dict.keys():
            token_dict[word] = np.array([1,1])
        else:
            token_dict[word] += np.array([1,0])
    
    # read negative reviews
    for word in string_neg.split():
        # skip punctuations
        if word in string.punctuation:
            continue
        
        # update dictionary
        if word not in token_dict.keys():
            token_dict[word] = np.array([1,1])
        else:
            token_dict[word] += np.array([0,1])
    
    return token_dict




In [98]:
#Log Transformation
def modelLogTransform(token_dict):
    # transform count into log_prob
    prob_dict = {}
    count = np.array([0,0])
    for word in token_dict.keys(): 
        count += token_dict[word]
        
    for word in token_dict.keys():
        prob_dict[word] = np.log( token_dict[word]/count )
    
    return prob_dict


In [99]:
# Model Prediction
def modelPredict(string_input,prob_dict):
    log_output = np.array([0.,0.])
    
    for word in string_input.split():
        if word in string.punctuation:
            continue
        
        if word in prob_dict.keys():
            log_output += prob_dict[word]
    
    pred_class = 'Positive' if log_output[0] > log_output[1] else 'Negative'
    
    return pred_class,log_output #two D nested array



Train on the first 100 examples in each class (those with filenames beginning with cv0), classify the 200 reviews with filenames that begin with cv6 and cv7 and report performance: precision, recall, and F score. Then, do the same for training on the first 300 examples, the first 500 examples, and finally, the first 600 examples (cv0 to cv5), in each case testing on those that begin with cv6 and cv7. 

### Train Using First 100 Reviews


In [83]:
# train model using first 100 positive and negative reviews, aka files starting with 'cv0'

import os
#specify path to list of documents:
dirpath_pos = 'review_polarity/txt_sentoken/pos/'
dirpath_neg = 'review_polarity/txt_sentoken/neg/'

In [94]:
#read first 100 positive and negative reviews and join into two long string of positive reviews and negative reviews
long_string_pos=''
for document in os.listdir(dirpath_pos):
    if document[:3]=='cv0':
        #print(document)
        review=open(dirpath_pos+document, 'r').read()
        long_string_pos=long_string_pos+' '+review
print(long_string_pos)  

long_string_neg=''
for document in os.listdir(dirpath_neg):
    if document[:3]=='cv0':
        #print(document)
        review=open(dirpath_neg+document, 'r').read()
        long_string_neg=long_string_neg+' '+review
print(long_string_neg)  

 films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
getting the hughes brothers to direct this seems almost as

In [100]:
#train model to obtain token_dict from first 100 positive/negative review

token_dict1=modelTrain(long_string_pos, long_string_neg)

#logtransform the token_dict to get prob_dict:

prob_dict1=modelLogTransform(token_dict1)

### Testing of Model 1

In [111]:
#test model using 'cv6' and 'cv7'
#positive reviews for testing:
list_of_review_pos=[]
for document in os.listdir(dirpath_pos):
    if (document[:3]=='cv6') | (document[:3]=='cv7'):
        #print(document)
        review=open(dirpath_pos+document, 'r').read()
        list_of_review_pos.append(review)
#print(list_of_review_pos) 

#negative reviews for testing:
list_of_review_neg=[]
for document in os.listdir(dirpath_neg):
    if (document[:3]=='cv6') | (document[:3]=='cv7'):
        #print(document)
        review=open(dirpath_neg+document, 'r').read()
        list_of_review_neg.append(review)
#print(list_of_review_neg) 

In [115]:
#collect list of predicted class for positive reviews and negative reviews:
list_of_prediction_pos=[]
for review in list_of_review_pos:
    list_of_prediction_pos.append(modelPredict(review, prob_dict1)[0])

print(list_of_prediction_pos)

list_of_prediction_neg=[]
for review in list_of_review_neg:
    list_of_prediction_neg.append(modelPredict(review, prob_dict1)[0])

print(list_of_prediction_neg)

['Positive', 'Negative', 'Negative', 'Negative', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Negative', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Po

In [124]:
#model accuracy: precision, recall, F score
tp=list_of_prediction_pos.count('Positive')
tn=list_of_prediction_neg.count('Negative')
fp=list_of_prediction_neg.count('Positive')
fn=list_of_prediction_pos.count('Negative')

precision1 = tp / (tp + fp)

recall1 = tp / (tp + fn)

F_score1=2*((precision1*recall1)/(precision1+recall1))

print('precision: ', precision1)
print('recall: ', recall1)
print('F-score: ', F_score1)

precision:  0.7207207207207207
recall:  0.8
F-score:  0.7582938388625592


### Train Model Using First 600 Reviews

In [None]:
# train model using first 600 positive and negative reviews, aka files starting with 'cv0' to 'cv5'

import os
#specify path to list of documents:
dirpath_pos = 'review_polarity/txt_sentoken/pos/'
dirpath_neg = 'review_polarity/txt_sentoken/neg/'

In [130]:
#read first 100 positive and negative reviews and join into two long string of positive reviews and negative review
long_string_pos=''
list_of_review_idx=['cv0','cv1','cv2','cv3','cv4','cv5']
for document in os.listdir(dirpath_pos):
    if any(idx == document[:3] for idx in list_of_review_idx):
        #print(document)
        review=open(dirpath_pos+document, 'r').read()
        long_string_pos=long_string_pos+' '+review
print(long_string_pos)  

long_string_neg=''
list_of_review_idx=['cv0','cv1','cv2','cv3','cv4','cv5']
for document in os.listdir(dirpath_neg):
    if any(idx == document[:3] for idx in list_of_review_idx):
        #print(document)
        review=open(dirpath_neg+document, 'r').read()
        long_string_neg=long_string_neg+' '+review
print(long_string_neg)  


 films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
getting the hughes brothers to direct this seems almost as

In [132]:
#train model to obtain token_dict from first 600 positive/negative review

token_dict2=modelTrain(long_string_pos, long_string_neg)

#logtransform the token_dict to get prob_dict:

prob_dict2=modelLogTransform(token_dict2)

### Testing Model 2

In [133]:
#test model using 'cv6' and 'cv7'
#positive reviews for testing:
list_of_review_pos=[]
for document in os.listdir(dirpath_pos):
    if (document[:3]=='cv6') | (document[:3]=='cv7'):
        #print(document)
        review=open(dirpath_pos+document, 'r').read()
        list_of_review_pos.append(review)
#print(list_of_review_pos) 

#negative reviews for testing:
list_of_review_neg=[]
for document in os.listdir(dirpath_neg):
    if (document[:3]=='cv6') | (document[:3]=='cv7'):
        #print(document)
        review=open(dirpath_neg+document, 'r').read()
        list_of_review_neg.append(review)
#print(list_of_review_neg) 

In [134]:
#collect list of predicted class for positive reviews and negative reviews:
list_of_prediction_pos=[]
for review in list_of_review_pos:
    list_of_prediction_pos.append(modelPredict(review, prob_dict2)[0])

print(list_of_prediction_pos)

list_of_prediction_neg=[]
for review in list_of_review_neg:
    list_of_prediction_neg.append(modelPredict(review, prob_dict2)[0])

print(list_of_prediction_neg)

['Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Negative', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Negative', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Positive', 'Po

In [135]:
#model accuracy: precision, recall, F score
tp=list_of_prediction_pos.count('Positive')
tn=list_of_prediction_neg.count('Negative')
fp=list_of_prediction_neg.count('Positive')
fn=list_of_prediction_pos.count('Negative')

precision1 = tp / (tp + fp)

recall1 = tp / (tp + fn)

F_score1=2*((precision1*recall1)/(precision1+recall1))

print('precision: ', precision1)
print('recall: ', recall1)
print('F-score: ', F_score1)

precision:  0.8104265402843602
recall:  0.855
F-score:  0.8321167883211679
