## Naive Bayes classification

Can you tell whether a review of a restaurant is positive or negative? What words are most indicative? 
Examined data from Yelp, provided as part of the Yelp Academic Challenge.

\begin{align}
P(positive|w_1, w_2, w_3, ..., w_N) & = \frac{P(w_1, w_2, w_3, ..., w_N | positive)P(positive)}{P(w_1, w_2, w_3, ..., w_N)} \\
&= \frac{P(w_1, w_2, w_3, ..., w_N | positive)P(positive)}{P(w_1, w_2, w_3, ..., w_N | positive)P(positive) + P(w_1, w_2, w_3, ..., w_N | negative)P(negative)} \\
&\approx \frac{P(positive)\prod_i P(w_i|positive)}{P(positive)\prod_i P(w_i|positive) + P(negative)\prod_i P(w_i|negative) }
\end{align}

### Create a pattern

The first step is to count the words in each of the two *training* sub-collections. They are in two files, `positive.txt` and `negative.txt`. 

Each line in these files is one review. 

Break reviews into words using regular expressions.



In [2]:
import re
def word_pattern(string): 
    pos_pattern = re.compile("\w+")
    return pos_pattern.findall(string)
word_pattern("Hello y'all, my name is Jenna!")
        

['Hello', 'y', 'all', 'my', 'name', 'is', 'Jenna']

### Specify variables to store data

To apply Bayes' rule to estimate the probability of polarity given a review, we need the probability of words given polarity and the probability of positive or negative polarity.

In [3]:
from collections import Counter

num_pos_reviews = 0 
pos_word_count = Counter()

num_neg_reviews = 0
neg_word_count = Counter()

### Read data from files


In [4]:
##prob_pos = number positive reviews/ number of all reveiws 

with open("positive.txt", "r") as pos_reader:
    for line in pos_reader: 
        review_words_pos = word_pattern(line)
        pos_word_count.update(review_words_pos)
        num_pos_reviews += 1
        
print(pos_word_count.most_common(5))

with open("negative.txt", "r") as neg_reader:
    for line in neg_reader: 
        review_words_neg = word_pattern(line)
        neg_word_count.update(review_words_neg)
        num_neg_reviews += 1

print(neg_word_count.most_common(5))
    
##prob_neg = number negative reviews/ number of all reveiws 

[('the', 16168), ('and', 15149), ('I', 12937), ('a', 10532), ('to', 9411)]
[('the', 6275), ('I', 5076), ('and', 4483), ('to', 4348), ('a', 3326)]


### Prior probability

Calculate the probability that a randomly selected review is positive. 

In [5]:
prob_pos = num_pos_reviews / (num_pos_reviews + num_neg_reviews)
print(prob_pos)

0.8026043819760231


At first this seems surprising because I expected it to be around 50%. This means that the positive text file just contained many more reviews than the negative text file.

### Count word totals


In [6]:
positive_tokens = sum(pos_word_count.values())
print(positive_tokens)

negative_tokens = sum(neg_word_count.values())
print(negative_tokens)


415727
153876


### Calculate word probability (version 1.0)

`word_prob`: takes three arguments: a word, a counter, and a sum. Returns the probability of the word estimated from that counter.

In [7]:
def word_prob(word, counter, _sum): 
    word_count = counter[word]
    return word_count/_sum

print(word_prob("delicious", pos_word_count, positive_tokens))
print(word_prob("delicious", neg_word_count, negative_tokens))
print(word_prob("manager", pos_word_count, positive_tokens))
print(word_prob("manager", neg_word_count, negative_tokens))
print(word_prob("edgy", pos_word_count, positive_tokens))
print(word_prob("edgy", neg_word_count, negative_tokens))
print(word_prob("vile", pos_word_count, positive_tokens))
print(word_prob("vile", neg_word_count, negative_tokens))

0.001282091372463179
5.8488653201278955e-05
0.00016837972996702164
0.0011047856715797136
4.810849427629189e-06
0.0
0.0
6.498739244586551e-06


### Probability of a sequence (version 1.0)

`review_prob`: Takes a string containing multiple words (for example "the food was delicious"), a counter, and a sum. Returns the product of those probabilities.

In [8]:
def review_prob(string, counter, _sum):
    string_words = word_pattern(string)
    word_probs = 1
    for word in string_words: 
        word_probs = word_probs*word_prob(word, counter, _sum)
    return word_probs

print(review_prob("I loved the carnitas", pos_word_count, positive_tokens))
print(review_prob("I loved the carnitas", neg_word_count, negative_tokens))
print(review_prob("but then the manager came out and told us", pos_word_count, positive_tokens))
print(review_prob("but then the manager came out and told us", neg_word_count, negative_tokens))
print(review_prob("the ambience was edgy but the food was vile", pos_word_count, positive_tokens))
print(review_prob("the ambience was edgy but the food was vile", neg_word_count, negative_tokens))


1.2254483237140511e-11
5.113217890059061e-13
9.508351338741093e-25
2.4727715407612935e-22
0.0
0.0


It makes sense that the reiew with edgy and vile was statistically 0, since vile had a prob of 0 in pos reviews, but edgy had a prob of 0 in neg reviews. This means it would be extremely unlikely it would occur. The more words or more specific the words, the less likely the review is to be a review. 

### Calculate word probability (version 2.0)

 `word_log_prob`: Calculates the log probability of a word.
 
 `review_log_prob`: Calculates the sum of the log probabilities of the words in a string.

In [9]:
import math
def word_log_prob(word, counter,_sum): 
    word_count = counter[word]
    return math.log10(word_count/_sum)

def review_log_prob(string, counter, _sum):
    string_words = word_pattern(string)
    log_word_probs = 0
    for word in string_words: 
        log_word_probs = log_word_probs + word_log_prob(word, counter, _sum)
    return log_word_probs

print(word_log_prob("delicious", pos_word_count, positive_tokens))
print(word_log_prob("delicious", neg_word_count, negative_tokens))
print(word_log_prob("manager", pos_word_count, positive_tokens))
print(word_log_prob("manager", neg_word_count, negative_tokens))
print(word_log_prob("edgy", pos_word_count, positive_tokens))
#print(word_log_prob("edgy", neg_word_count, negative_tokens))
#print(word_log_prob("vile", pos_word_count, positive_tokens))
#this caused an error because the prob of edgy is 0 in negative reviews, but you can't take the log of 0. Same goes
#for vile in positive reviews
print(word_log_prob("vile", neg_word_count, negative_tokens))

print(review_log_prob("I loved the carnitas", pos_word_count, positive_tokens))
print(review_log_prob("I loved the carnitas", neg_word_count, negative_tokens))
print(review_log_prob("but then the manager came out and told us", pos_word_count, positive_tokens))
print(review_log_prob("but then the manager came out and told us", neg_word_count, negative_tokens))
#print(review_log_prob("the ambience was edgy but the food was vile", pos_word_count, positive_tokens))
#print(review_log_prob("the ambience was edgy but the food was vile", neg_word_count, negative_tokens))

#again, there is an error since you can't take the probability 

-2.8920810222879743
-4.232928378875812
-3.7737101913002897
-2.956721966936863
-5.317778235650565
-5.187170888315137
-10.911704997915798
-12.29130570024548
-24.02189477923062
-21.606816006230684


### Collection statistics

In [10]:
both_counts = pos_word_count + neg_word_count
both_tokens = positive_tokens + negative_tokens
vocabulary_size = len(both_counts)
print(vocabulary_size)

24773


### Calculate word probability (version 3.0)

`smoothed_word_log_prob`: Takes the same arguments as `word_prob` (word, counter, sum of the counter), but adds 1 to the count for the word (don't modify the counter, just add 1 in your function). 

In [11]:
def smoothed_word_log_prob(word, counter, _sum):
    word_count = counter[word] + 1
    return math.log10(word_count/(_sum+1))

print(smoothed_word_log_prob("delicious", pos_word_count, positive_tokens))
print(smoothed_word_log_prob("delicious", neg_word_count, negative_tokens))
print(smoothed_word_log_prob("manager", pos_word_count, positive_tokens))
print(smoothed_word_log_prob("manager", neg_word_count, negative_tokens))
print(smoothed_word_log_prob("edgy", pos_word_count, positive_tokens))
print(smoothed_word_log_prob("edgy", neg_word_count, negative_tokens))
print(smoothed_word_log_prob("vile", pos_word_count, positive_tokens))
print(smoothed_word_log_prob("vile", neg_word_count, negative_tokens))

-2.8912680189474136
-4.1871737106725595
-3.7675509272568948
-2.9541776002804054
-5.141688021256308
-5.1871737106725595
-5.61880927597597
-4.886143715008578


### Test the Classifier
Given a review, determine whether that review is more likely to be positive or negative. 

In [12]:
def smoothed_review_log_prob(string, counter, _sum):
    string_words = word_pattern(string)
    log_word_probs = 0
    for word in string_words: 
        log_word_probs = log_word_probs + smoothed_word_log_prob(word, counter, _sum)
    return log_word_probs


words_g_pos = smoothed_review_log_prob("the ambience was edgy but the food was disgusting", pos_word_count, positive_tokens)
pos_conditional = (prob_pos)*words_g_pos


words_g_neg = smoothed_review_log_prob("the ambience was edgy but the food was disgusting", neg_word_count, negative_tokens)
neg_conditional = (1-prob_pos)*words_g_neg

ratio2 = (math.log10(prob_pos)+words_g_pos) - (math.log10(1-prob_pos)+words_g_neg)
print(ratio2)

-0.3026502767593442


In [13]:
ratiofind = (math.log10(.9)+words_g_pos) - (math.log10(1-.9)+words_g_neg)
print(ratiofind)

0.04242821371768102


New suggested prior probability: P(Pos) = 0.9, P(Neg) = 0.1
    When tested, this shows a log ratio of 0.0424, which is greater than 0. In this situation, it would be more 
    likely that the review was positive. 