# Multinomial Naive Bayes for IMDB classification

The objective of your multinomial Naïve Bayes algorithm is to:

1. **Build a model** by using the negative and positive movie reviews contained in the train folder.
2. It should then take in the movie reviews in the folder test and **classify** them.
3. You should **compare the predictions** of your model with the true class label and **calculate the accuracy of the model** for positive and negative movie reviews.

The dataset contains 25,000 movie reviews in the train folder, split evenly between positive and negative. 

---
## Import Statements

In [1]:
import os
import re
from collections import Counter, defaultdict
from math import log

from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

---
## Stage 1 - File Parsing and Vocabulary Composition

In [2]:
allWords = Counter()
posWords = Counter()
negWords = Counter()
stats = defaultdict()

classes = ['pos', 'neg']
data = [('pos', posWords), ('neg', negWords)]


def read_file(path, file):
    return open(path+file, 'r', encoding='utf8').read().lower()

def preprocess_reviews(txt):
    REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
    REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
    
    txt = REPLACE_NO_SPACE.sub("", txt)
    txt = REPLACE_WITH_SPACE.sub(" ", txt)
    
    return txt

def tokenize_input(txt, mode='re'):
    words = []
    txt = preprocess_reviews(txt)
    
    if mode == 're':
        words = re.findall('\w+', txt)
    elif mode == 'str':
        words = txt.split()
    elif mode == 'nltk':
        wordsFiltered = []
        stopWords = set(stopwords.words('english'))
        wordsToken = wordpunct_tokenize(txt)
        
        for w in wordsToken:
            #if w not in stopWords:
                wordsFiltered.append(w)
        
        stem = PorterStemmer()
        #stem = WordNetLemmatizer()
        for w in wordsFiltered:
            rootWord = stem.stem(w)
            words.append(rootWord)
        
        words = [' '.join(gram) for gram in ngrams(words,3)]
        #words += re.findall('\w+', txt)
        words += txt.split()
        
        #print(txt)
        #print(words)
        
    return words

def count(c, cnt, stats):
    path = './data/train/{}/'.format(c)
    files = os.listdir(path)
    stats[c] = len(files)
    
    for file in files:
        words = tokenize_input(read_file(path, file), mode='nltk')
        allWords.update(words)
        cnt.update(words)

for c, cnt in data:
    count(c, cnt, stats)
        
print('Number of distinct words in allWords: {}, posWords: {}, negWords: {}'.format(
    len(allWords), len(posWords), len(negWords)))
print('Absolute number of words in allWords: {}, posWords: {}, negWords: {}'.format(
    len(list(allWords.elements())), len(list(posWords.elements())), len(list(negWords.elements()))))

Number of distinct words in allWords: 3604433, posWords: 2013363, negWords: 1925884
Absolute number of words in allWords: 11605357, posWords: 5875296, negWords: 5730061


In [3]:
# Ensure that all words seen during training are included in both dictionaries (posWords and negWords)
for w in list(allWords):
    if not posWords[w]: posWords[w] = 0
    if not negWords[w]: negWords[w] = 0

print('Number of distinct words in allWords: {}, posWords: {}, negWords: {}'.format(
    len(allWords), len(posWords), len(negWords)))
print('Absolute number of words in allWords: {}, posWords: {}, negWords: {}'.format(
    len(list(allWords.elements())), len(list(posWords.elements())), len(list(negWords.elements()))))

Number of distinct words in allWords: 3604433, posWords: 3604433, negWords: 3604433
Absolute number of words in allWords: 11605357, posWords: 5875296, negWords: 5730061


---
## Stage 2 – Word Probability Calculations

<img src="multinomial_bayes.png">

In [4]:
def fit(data, posWords, negWords, allWords):
    # Add total number of positive words, negative words and vocabulary to stats
    stats['w_pos'] = len(list(posWords.elements()))
    stats['w_neg'] = len(list(negWords.elements()))
    stats['voc'] = len(allWords)

    # Create a fancy probability dictionary p_
    # Insert prior class probability into p_
    stats['p_'] = defaultdict()
    stats['p_']['p_pos'] = stats['pos'] / (stats['pos'] + stats['neg'])
    stats['p_']['p_neg'] = stats['neg'] / (stats['pos'] + stats['neg'])
    
    for c, cnt in data:
        print('Calculating conditional probalibities: ', c)
        for k, v in cnt.items():
            p_id = 'p_{}_{}'.format(c, k)
            stats['p_'][p_id] = (v + 1)/(stats['w_'+c] + stats['voc'])

In [5]:
fit(data, posWords, negWords, allWords)

Calculating conditional probalibities:  pos
Calculating conditional probalibities:  neg


---
## Stage 3 – Classifying Unseen Documents and Basic Evaluation

In [6]:
def predict(c, words):
    prediction = defaultdict()
    prediction['pos'] = log(stats['p_']['p_pos'])
    prediction['neg'] = log(stats['p_']['p_neg'])
    
    for cl in classes:
        for w in words:
            p_id = 'p_{}_{}'.format(cl, w)
            if p_id in stats['p_']:
                prediction[cl] += log(stats['p_'][p_id])

    return c == max(prediction, key=prediction.get)


def test(c, cnt):
    path = './data/test/{}/'.format(c)
    files = os.listdir(path)
    correct = 0
    
    for file in files:
        words = tokenize_input(read_file(path, file), mode='nltk')
        if predict(c, words):
            correct +=1

    print('Accuracy for class {}: {}'.format(c, correct/len(files)))

In [8]:
for c, cnt in data:
    test(c, cnt)

Accuracy for class pos: 0.8475
Accuracy for class neg: 0.9025


---

**Some result history**

```
NLTK 3gram + STR + NOSTOP + STEM + preprocess
Accuracy for class pos: 0.8475
Accuracy for class neg: 0.9025

NLTK 3gram + STR + NOSTOP + STEM
Accuracy for class pos: 0.838
Accuracy for class neg: 0.907

NLTK
Accuracy for class pos: 0.793
Accuracy for class neg: 0.9085

NLTK + RE
Accuracy for class pos: 0.8095
Accuracy for class neg: 0.8965
 
NLTK + STR
Accuracy for class pos: 0.8095
Accuracy for class neg: 0.8965
    
NLTK + STR + RE
Accuracy for class pos: 0.801
Accuracy for class neg: 0.891

NLTK 3gram + STR + STOP + STEM + preprocess
Accuracy for class pos: 0.809
Accuracy for class neg: 0.8645
```