In [3]:
import pandas as pd
import os
import math
import random
import numpy as np

# Naive Bayes Background

The Baye's rule (for document d and class c) is:
$$
    P(c|d) = \frac{P(d|c) P(c)}{P(d)}
$$

The denominator can be dropped, and allows the rule to become: 
$
    P(c|d) = P(d|c)P(c)
$

$C_{MAP}$ (Maximum A Posteriori) is the most likely class, and can be calculated using:

$$
    C_{MAP} = argmax_{c}[ P(d|c)P(c) ]
$$

In this example, each document is a review. 
The features of this document will be the words contained in the review. A review with n words will have the following probability:
$$
    P(c|d) = P(x_{1}, x_{2}, ..., x_{n}|c)P(c)
$$

Since independence is assumed with Naive Bayes, this formula can be simplified to:

$$
    P(c|d) = P(x_{1}|c)P(x_{2}|c)...P(x_{n}|c)P(c)
$$

Some reviews may have hundreds of words, and the result of this multiplication may get very small, so we will add the log probabilities instead, as it does not affect the ranking of the classes. We will use the following formula:


$$
    argmax_{c}[log(P(c_{j})) + \sum_i log(P(x_{i}|c_{j}))
$$

### Notations and Formulas

$$
P(c_{j}) = \frac{\text{number of documents in class }c_{j}}{\text{total number of documents in all classes}} = \text{What proportion of all classes is }c_{j}?
$$

$$
    P(w_{j}|c_{i}) = \frac{\text{number of times word }w_{j}\text{ appears in class }c_{i}}{\text{total number of words that appear in class }c_{i}} = \text{What proportion of all words is }w_{j}?
$$

### Steps Taken to Improve Performance

* <b>Smoothing the data</b>: Although we have a large dataset, it is possible that certain features (words) will not appear in the test dataset. If words do not appear in the test dataset for either class, they are ignored. Otherwise, we will use the Laplace smoothing algorithm, which prevents 0-probabilities by adding 1:

$$ 
    P(w_{i}|c) = \frac{(\text{number of times word }w_{j}\text{ appears in class }c_{i}) + 1}{(\text{total number of words that appear in class }c_{i}) + 1} 
$$

* <b>Ignoring punctuation</b>: The reviews have been formatted such that each feature (word/punctuation) have been seperated. This allows for punctuation to be ignored easily in a simple 'if' clause. 

I chose not to remove the use of stop words (such as 'the' or 'a') as this has shown to have little/no benefits to the performance of a Naive Bayes Classifier.


# Importing and Concatenating Reviews

In [4]:
pos_directory = '../Part_3/review_polarity/txt_sentoken/pos'
positive_reviews_list = []

for filename in os.listdir(pos_directory):
    file_path = pos_directory + "/" + filename
    file = open(file_path, "r")
    review = []
    for line in file.readlines():
        review.append(line.rstrip())
    positive_reviews_list.append(" ".join(review))

In [5]:
neg_directory = '../Part_3/review_polarity/txt_sentoken/neg'
negative_reviews_list = []

for filename in os.listdir(neg_directory):
    file_path = neg_directory + "/" + filename
    file = open(file_path, "r")
    review = []
    for line in file.readlines():
        review.append(line.rstrip())
    negative_reviews_list.append(" ".join(review))

# Assigning Important Variables

In [6]:
## Assigning the test and training data
negative_training_data = negative_reviews_list[:900]
positive_training_data = positive_reviews_list[:900]
negative_test_data = negative_reviews_list[900:]
positive_test_data = positive_reviews_list[900:]

# Creating Model Functions & Running Model

In [7]:
def create_frequency_dictionary(list_of_reviews):
    review_dictionary = {}
    for review in list_of_reviews:
        words = review.split()
        for word in words:
            if word in review_dictionary:
                review_dictionary[word] += 1
            else:
                review_dictionary[word] = 1
    return review_dictionary

In [8]:
def result_for_class(review, training_data):
    punctuation = ['.', ',', ':', '"', '&', '?', '-', '(', ')', "'", '/']
    class_result = math.log(0.5)
    dictionary = create_frequency_dictionary(training_data)
    for word in review.split():
        if word in punctuation:
            continue
        elif word not in dictionary:
            class_result += math.log(1 / (sum(dictionary.values()) + 1))
        else:
            class_result += math.log((dictionary[word] + 1) / (sum(dictionary.values()) + 1))
    return class_result

In [7]:
accuracy = 0
total = 0

for review in positive_test_data:
    positive_result = result_for_class(review, positive_training_data)
    negative_result = result_for_class(review, negative_training_data)
    if positive_result > negative_result:
        accuracy += 1
    total += 1

In [8]:
print("The model's accuracy in identifying positive reviews is: " + str(accuracy / total * 100) + "%")

The model's accuracy in identifying positive reviews is: 78.0%


The model does not have amazing performance on the positive reviews as it only classifies 78% of reviews correctly, let's see if it performs better in the negative reviews.

In [10]:
accuracy = 0
total = 0

for review in negative_test_data:
    positive_result = result_for_class(review, positive_training_data)
    negative_result = result_for_class(review, negative_training_data)
    if positive_result < negative_result:
        accuracy += 1
    total += 1

In [12]:
print("The model's accuracy in identifying negative reviews is: " + str(accuracy / total * 100) + "%")

The model's accuracy in identifying negative reviews is: 89.0%


The model seems to have a better performance in identifying negative reviews, with an accuracy of 89%.

The following confusion matrix outlines the performance of the model:


|                | Predicted Positives | Predicted Negatives |
| -------------- | ------------------- | ------------------- |
|Actual Positives|         78%         |         11%         |
|Actual Negatives|         22%         |         89%         |

# Analysis of Errors in Output

In [9]:
positive_reviews_misclassified = []

for review in positive_test_data:
    positive_result = result_for_class(review, positive_training_data)
    negative_result = result_for_class(review, negative_training_data)
    if positive_result < negative_result:
        positive_reviews_misclassified.append(review)

In [10]:
negative_reviews_misclassified = []

for review in negative_test_data:
    positive_result = result_for_class(review, positive_training_data)
    negative_result = result_for_class(review, negative_training_data)
    if positive_result > negative_result:
        negative_reviews_misclassified.append(review)

In [23]:
positive_reviews_sample_ids = random.sample(list(np.arange(0,22)), 5)

In [24]:
positive_reviews_sample_ids

[2, 10, 12, 3, 21]

In [62]:
negative_reviews_sample_ids = random.sample(list(np.arange(0,len(negative_reviews_misclassified))), 5)

In [63]:
negative_reviews_sample_ids

[7, 10, 4, 9, 5]

## Analysis of Misclassified Positive Reviews

In each sampled review, we will look at the tokens with the highest count in the review, then we will compare the frequency of these words within the positive review dictionary and the negative review dictionary.

Let's have an initial look at the 12th misclassified review.

In [66]:
review_dict1 = create_frequency_dictionary([positive_reviews_misclassified[12]])

In [32]:
# sorted(review_dict1.items(), key=lambda x: x[1], reverse=True)[:20]

The top 20 most frequent words in this first review are:

1. ('.', 46) -> punctuation

2. ('the', 36) -> determiner
 
3. (',', 23) -> punctuation
 
4. ('to', 19) -> particle
 
5. ('and', 16) -> conjunction
 
6. ('in', 14) -> preposition
 
7. ('a', 13) -> determiner
 
8. ('of', 11) -> preposition
 
9. ('"', 10) -> punctuation
 
10. ('(', 9) -> punctuation
 
11. (')', 9) -> punctuation
 
12. ('that', 9) -> determiner/conjunction
 
13. ('is', 9) -> verb
 
14. ('film', 8) -> noun
 
15. ('as', 8) -> conjunction/preposition
 
16. ('car', 7) -> noun
 
17. ('his', 7) -> pronoun
 
18. ('just', 7) -> adjective/adverb
 
19. ('memphis', 6) -> noun
 
20. ('have', 6) -> verb

This is expected, as punctuation and determiners/particles/prepositions etc. are more common than nouns or adjectives. 

We will have a look at a few of the most common nouns, verbs, adjectives and adverbs.

* ('is', 9)
* ('film', 8)
* ('car', 7)
* ('just', 7)
* ('memphis', 6)
* ('have', 6)

In [72]:
positive_dictionary = create_frequency_dictionary(positive_training_data)
negative_dictionary = create_frequency_dictionary(negative_training_data)

In [48]:
print("'is' appears in the positive dictionary " + str(positive_dictionary['is']) + " times")
print("'is' appears in the negative dictionary " + str(negative_dictionary['is']) + " times")

'is' appears in the positive dictionary 12549 times
'is' appears in the negative dictionary 9952 times


In [49]:
print("'film' appears in the positive dictionary " + str(positive_dictionary['film']) + " times")
print("'film' appears in the negative dictionary " + str(negative_dictionary['film']) + " times")

'film' appears in the positive dictionary 4376 times
'film' appears in the negative dictionary 3598 times


In [50]:
print("'car' appears in the positive dictionary " + str(positive_dictionary['car']) + " times")
print("'car' appears in the negative dictionary " + str(negative_dictionary['car']) + " times")

'car' appears in the positive dictionary 112 times
'car' appears in the negative dictionary 165 times


In [51]:
print("'just' appears in the positive dictionary " + str(positive_dictionary['just']) + " times")
print("'just' appears in the negative dictionary " + str(negative_dictionary['just']) + " times")

'just' appears in the positive dictionary 1197 times
'just' appears in the negative dictionary 1390 times


In [52]:
print("'memphis' appears in the positive dictionary " + str(positive_dictionary['memphis']) + " times")
print("'memphis' appears in the negative dictionary " + str(negative_dictionary['memphis']) + " times")

KeyError: 'memphis'

In [53]:
print("'have' appears in the positive dictionary " + str(positive_dictionary['have']) + " times")
print("'have' appears in the negative dictionary " + str(negative_dictionary['have']) + " times")

'have' appears in the positive dictionary 1992 times
'have' appears in the negative dictionary 2408 times


It is difficult to find a reasonable explanation for the misclassification of the review based on the most common words.

Another possible method to understand this error is to identify the words that have a much higher ratio of appearing in the negative reviews vs. the positive reviews.

In [88]:
def find_ratio_positive_review(word, positive_dict, negative_dict):
    
    # This method finds the ratio of the word count in both the negative and positive dictionaries
    # A result less than 1 means that the word appears more in the positive dictionary
    # A result more than 1 means that the word appears more in the negative dictionary
    
    if word not in negative_dict:
        return 0 # because the word appears more in the positive dictionary (1-1)
    elif word not in positive_dict:
        return 2 # because the word appears more in the negative dictionary (1+1)
    
    return negative_dict[word] / positive_dict[word]    

In [89]:
find_ratio_positive_review('happy', positive_dictionary, negative_dictionary)

0.5614035087719298

This result of ~56% means that the word 'happy' appears in the negative dictionary 56% as much as it appears in the positive dictionary, meaning it appears much more in the positive dictionary than the negative.

In [90]:
score_dict = {}

In [91]:
for word in positive_reviews_misclassified[12].split():
    score_dict[word] = find_ratio_positive_review(word, positive_dictionary, negative_dictionary)

In [98]:
sorted(score_dict.items(), key=lambda x: x[1], reverse=True)

[('bruckheimer', 18.0),
 ('memphis', 17.0),
 ('jolie', 15.0),
 ('angelina', 7.0),
 ('shiny', 6.0),
 ('hottest', 4.0),
 ('mindset', 3.5),
 ('jumping', 3.142857142857143),
 ('dominic', 3.0),
 ('improvise', 3.0),
 ('cage', 2.9166666666666665),
 ('bad', 2.833846153846154),
 ('remake', 2.5454545454545454),
 ('seconds', 2.4210526315789473),
 ('none', 2.3278688524590163),
 ('kalifornia', 2.0),
 ('nicolas', 2.0),
 ('assorted', 2.0),
 ('lindo', 2.0),
 ("lindo's", 2),
 ('chases', 2.0),
 ('ramp', 2),
 ('filmmakers', 1.9482758620689655),
 ('armageddon', 1.9473684210526316),
 ('unfortunately', 1.8956521739130434),
 ('christopher', 1.8571428571428572),
 ('cars', 1.8421052631578947),
 ('loud', 1.78125),
 ('halfway', 1.7777777777777777),
 ('monkey', 1.736842105263158),
 ('speed', 1.7037037037037037),
 ('partner', 1.631578947368421),
 ('postman', 1.5714285714285714),
 ('oh', 1.5277777777777777),
 ('plot', 1.5265225933202358),
 ('sixty', 1.5),
 ("anyone's", 1.5),
 ('car', 1.4732142857142858),
 ('add', 1

Out of the 334 unique words in the review, it appears there are 133 words that scored above 1, 12 words that are scored exactly 1 and 189 words that scored less than 1. 

This means that there are more words that are associated with positive reviews than negative reviews. This does not explain why this review is classified as a negative review.