To solve this problem, we can do two things. One: combine all positive reviews and say how many times did a word appear over the total number of words in that corpus. **Or: For each word, we can say in how many positive reviews did it appear over the total number of positive reviews.** Both give a good (albeit different) measure of probability that we can use. This solution takes the latter approach.

In [None]:
import glob
import os
from collections import defaultdict
import re
import numpy as np

**Given a filename, this function remove non-alpabets and return a list of words in the file:**

In [None]:
def processFile(filename):
    f = open(filename, 'r')
    content = f.read()
    content = re.sub('[^A-z \n]','',content)
    return content.split()

**This function loops through the files in our training set and updates a dictionary of counts:**

In [None]:
def readReview(path,dic):
    for filename in glob.glob(os.path.join(path, '*.txt')):
        content=processFile(filename)
        for w in set(content):
            dic[w]+=1

**We just call the above function and build a positive and negative dictionaries:**

In [None]:
path1 = '/Users/jb/Desktop/review_polarity/txt_sentoken/pos'
path2 = '/Users/jb/Desktop/review_polarity/txt_sentoken/neg'
posdict = defaultdict(int)
negdict = defaultdict(int)
readReview(path1,posdict)
readReview(path2,negdict)

**The next step is important. For each word in positive dictionary but not in the negative one, we set it's value in the negative dictionary to a low threshold (and vice versa):**

In [None]:
for k,v in posdict.items():
    if not negdict[k]:
        negdict[k]=0.2
for k,v in negdict.items():
    if not posdict[k]:
        posdict[k]=0.2

**Next we change the count into probability and take the log of that:**

In [None]:
for k,v in posdict.items():
    posdict[k]=np.log(v/1000.0)
for k,v in negdict.items():
    negdict[k]=np.log(v/1000.0)

**Now, given a test review, we will go through the words in that review and add up the positive probabilities to a postive score and negative probabilities to a negative score.**

In [None]:
def predictionPercentage(path,posdict,negdict,isPositive=True):
    poscount = 0
    for filename in glob.glob(os.path.join(path, '*.txt')):
        content=processFile(filename)
        posscore = 0
        negscore = 0
        for w in set(content):
            posscore+=posdict[w]
            negscore+=negdict[w]
        if (posscore>0.97*negscore):
            poscount+=1
    if isPositive:
        return 100*poscount/len(glob.glob(os.path.join(path, '*.txt')))
    else:
        return 100-100*poscount/len(glob.glob(os.path.join(path, '*.txt')))

> Create a test set from original, and mix reviews if desired

In [None]:
path1test = '/Users/jb/Desktop/review_polarity/txt_sentoken/pos/test'
print (predictionPercentage(path1test,posdict,negdict))
path2test = '/Users/jb/Desktop/review_polarity/txt_sentoken/neg/test'
print (predictionPercentage(path2test,posdict,negdict, 0))

**Typically, if positive_score>negative_score, we call the review positive. But the cutoff threshold can be varied if wanted to. In the above case, the prediction percentages for the two classes were too different (97 and 45). So I changed the threshold:
posscore>0.97*negscore**

**I found the 0.97 by searching over a few different values.**

**80% is a really good for a model that can be built from scratch in less than an hour. Naive Bayes ingores any interaction between the features. The are many ways to introduce a bit of interaction. For image data, we could do pooling. For text data, we will do n-grams. Below I include 2-grams to the feature set. 2-gram is just pairs of adjacent words.**

**"how are you" as a bag of words is ['how','are','you']**

**"how are you" as 2-grams is ['how are','are you']**

**Everything else remains the same. If you scroll down, you'll see that we have improved accuracy to about 85%**

In [None]:
def readReview(path,dic):
    for filename in glob.glob(os.path.join(path, '*.txt')):
        content=processFile(filename)
        for w in set(content):
            dic[w]+=1
        twogram = [content[i]+' '+content[i+1] for i in range(len(content)-1)]
        for w in set(twogram):
            dic[w]+=1

In [None]:
def predictionPercentage(path,posdict,negdict,isPositive=True):
    poscount = 0
    for filename in glob.glob(os.path.join(path, '*.txt')):
        content=processFile(filename)
        posscore = 0
        negscore = 0
        for w in set(content):
            posscore+=posdict[w]
            negscore+=negdict[w]
        twogram = [content[i]+' '+content[i+1] for i in range(len(content)-1)]
        for w in set(twogram):
            posscore+=posdict[w]
            negscore+=negdict[w]
        if (posscore>0.975*negscore):
            poscount+=1
    if isPositive:
        return 100*poscount/len(glob.glob(os.path.join(path, '*.txt')))
    else:
        return 100-100*poscount/len(glob.glob(os.path.join(path, '*.txt')))

In [None]:
posdict = defaultdict(int)
negdict = defaultdict(int)
readReview(path1,posdict)
readReview(path2,negdict)

In [None]:
for k,v in posdict.items():
    if not negdict[k]:
        negdict[k]=0.2
for k,v in negdict.items():
    if not posdict[k]:
        posdict[k]=0.2

In [None]:
for k,v in posdict.items():
    posdict[k]=np.log(v/900.0)
for k,v in negdict.items():
    negdict[k]=np.log(v/900.0)

In [None]:
print (predictionPercentage(path1test,posdict,negdict))
print (predictionPercentage(path2test,posdict,negdict,0))