In [26]:
import pandas as pd
import os
import math

# Naive Bayes Background

The Baye's rule (for document d and class c) is:
$$
    P(c|d) = \frac{P(d|c) P(c)}{P(d)}
$$

The denominator can be dropped, and allows the rule to become: 
$
    P(c|d) = P(d|c)P(c)
$

$C_{MAP}$ (Maximum A Posteriori) is the most likely class, and can be calculated using:

$$
    C_{MAP} = argmax_{c}[ P(d|c)P(c) ]
$$

In this example, each document is a review. 
The features of this document will be the words contained in the review. A review with n words will have the following probability:
$$
    P(c|d) = P(x_{1}, x_{2}, ..., x_{n}|c)P(c)
$$

Since independence is assumed with Naive Bayes, this formula can be simplified to:

$$
    P(c|d) = P(x_{1}|c)P(x_{2}|c)...P(x_{n}|c)P(c)
$$

Some reviews may have hundreds of words, and the result of this multiplication may get very small, so we will add the log probabilities instead, as it does not affect the ranking of the classes. We will use the following formula:


$$
    argmax_{c}[log(P(c_{j})) + \sum_i log(P(x_{i}|c_{j}))
$$

### Notations and Formulas

$$
P(c_{j}) = \frac{\text{number of documents in class }c_{j}}{\text{total number of documents in all classes}} = \text{What proportion of all classes is }c_{j}?
$$

$$
    P(w_{j}|c_{i}) = \frac{\text{number of times word }w_{j}\text{ appears in class }c_{i}}{\text{total number of words that appear in class }c_{i}} = \text{What proportion of all words is }w_{j}?
$$

### Steps Taken to Improve Performance

* <b>Smoothing the data</b>: Although we have a large dataset, it is possible that certain features (words) will not appear in the test dataset. If words do not appear in the test dataset for either class, they are ignored. Otherwise, we will use the Laplace smoothing algorithm, which prevents 0-probabilities by adding 1:

$$ 
    P(w_{i}|c) = \frac{(\text{number of times word }w_{j}\text{ appears in class }c_{i}) + 1}{(\text{total number of words that appear in class }c_{i}) + 1} 
$$

* <b>Ignoring punctuation</b>: The reviews have been formatted such that each feature (word/punctuation) have been seperated. This allows for punctuation to be ignored easily in a simple 'if' clause. 

I chose not to remove the use of stop words (such as 'the' or 'a') as this has shown to have little/no benefits to the performance of a Naive Bayes Classifier.


# Importing and Concatenating Reviews

In [4]:
pos_directory = '../Part_3/review_polarity/txt_sentoken/pos'
positive_reviews_list = []

for filename in os.listdir(pos_directory):
    file_path = pos_directory + "/" + filename
    file = open(file_path, "r")
    review = []
    for line in file.readlines():
        review.append(line.rstrip())
    positive_reviews_list.append(" ".join(review))

In [5]:
neg_directory = '../Part_3/review_polarity/txt_sentoken/neg'
negative_reviews_list = []

for filename in os.listdir(neg_directory):
    file_path = neg_directory + "/" + filename
    file = open(file_path, "r")
    review = []
    for line in file.readlines():
        review.append(line.rstrip())
    negative_reviews_list.append(" ".join(review))

# Assigning Important Variables

In [7]:
## Assigning the test and training data
negative_training_data = negative_reviews_list[:900]
positive_training_data = positive_reviews_list[:900]
negative_test_data = negative_reviews_list[900:]
positive_test_data = positive_reviews_list[900:]

In [58]:
def create_frequency_dictionary(list_of_reviews):
    review_dictionary = {}
    for review in list_of_reviews:
        words = review.split()
        for word in words:
            if word in review_dictionary:
                review_dictionary[word] += 1
            else:
                review_dictionary[word] = 1
    return review_dictionary

In [70]:
def result_for_class(review, training_data):
    punctuation = ['.', ',', ':', '"', '&', '?', '-', '(', ')', "'", '/']
    class_result = math.log(0.5)
    dictionary = create_frequency_dictionary(training_data)
    for word in review.split():
        if word in punctuation:
            continue
        elif word not in dictionary:
            class_result += math.log(1 / (sum(dictionary.values()) + 1))
        else:
            class_result += math.log((dictionary[word] + 1) / (sum(dictionary.values()) + 1))
    return class_result

In [68]:
accuracy = 0
total = 0

for review in positive_test_data:
    positive_result = result_for_class(review, positive_training_data)
    negative_result = result_for_class(review, negative_training_data)
    if positive_result > negative_result:
        accuracy += 1
    total += 1

In [69]:
accuracy / total

0.78

The model does not have amazing performance on the positive reviews as it only classifies 78% of reviews correctly, let's see if it performs better in the negative reviews.

In [49]:
accuracy = 0
total = 0

for review in negative_test_data:
    positive_result = result_for_class(review, positive_probability, positive_training_data)
    negative_result = result_for_class(review, negative_probability, negative_training_data)
    if positive_result < negative_result:
        accuracy += 1
    total += 1

In [50]:
accuracy / total

0.89

The model seems to have a better performance in identifying negative reviews, with an accuracy of 89%.

The following confusion matrix outlines the performance of the model:


|                | Predicted Positives | Predicted Negatives |
| -------------- | ------------------- | ------------------- |
|Actual Positives|         78%         |         11%         |
|Actual Negatives|         22%         |         89%         |