# Naive Bayes: An Introduction

The problem: I want to know if a sentence is about food, or animals. This is a standard classification problem. I want to classify unknown sentences into one of two categories: "Food" or "Animal"

We could write a bunch of rules to do so, or we can train a machine learning classifier on pre-labeled data.

The question: given four labeled sentences - 

I like to eat broccoli and bananas. : Food  
I ate a banana and spinach smoothie for breakfast. : Food  
Chinchillas and kittens are cute. : Animal  
My sister adopted a kitten yesterday.  : Animal

Into which category should we classify the following sentence:

My cute kittens like to eat bananas. : ??

We can use a naive bayes calculation to do this. Python has a number of nice packages to do so, which we will learn on Wednesday, but we can do this by hand as well.

**Note: Don't worry if you don't follow every line of code below. The goal is to gain an intuition about machine learning. On Wednesday we'll learn how to implement a variety of algorithms using Python's scikit-learn.**


We can think about this problem first in the form of a (variation on a) likelihood table.

In [1]:
#import a module that lets us print out nice tables
from tabulate import tabulate
table = tabulate([['and', 2, 1, '=3/4', .75], ['banana', 1, 0, '=1/4', .25], ['kitten', 0, 1, '=1/4', .25]], headers=['word', 'food', 'animal', 'likelihood', 'likelihood'], tablefmt='orgtbl')
print(table)

| word   |   food |   animal | likelihood   |   likelihood |
|--------+--------+----------+--------------+--------------|
| and    |      2 |        1 | =3/4         |         0.75 |
| banana |      1 |        0 | =1/4         |         0.25 |
| kitten |      0 |        1 | =1/4         |         0.25 |


The naive bayes formula:

P(cat1 | word1 ) = P( word1 | cat1 ) * P(cat1) / P(word1)

We can calculate this for each of the three words above:

P(Food | banana) = P( banana | Food) * P(Food) / P(banana)

In [2]:
p_food_banana =  (1/2) * (2/4) / (1/4)
print(p_food_banana)

1.0


P(Food | and) = P( and | Food) * P(Food) / P (and)

In [3]:
p_food_and =  (2/2) * (2/4) / (3/4)
print(p_food_and)

0.6666666666666666


P(Food | kitten) = P( kitten | Food) * P(Food) / P (kitten)

In [4]:
p_food_kitten =  (0/2) * (2/4) / (1/4)
print(p_food_kitten)

0.0


To calculate the likelihood that a toy sentence, 'banana and kitten', is about Food we add up the probabilities for each word and divide by the number of words.
P('banana and kitten') = Food

In [5]:
print((p_food_banana+p_food_and+p_food_kitten)/3)

0.5555555555555555


We would need to provide a cutoff that determine which category the sentence falls into. We could put the cutoff at .50, which would put this sentence in the "Food" category.

We can do this by transforming our text into a boolean DTM, and determine the category for any example sentence.
Question: What does a boolean DTM mean?

In [6]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()

text_list = ['I like to eat broccoli and bananas', 'I ate a banana and spinach smoothie for breakfast.', 'Chinchillas and kittens are cute.', 'My sister adopted a kitten yesterday.']

dtm_df = pandas.DataFrame(countvec.fit_transform(text_list).toarray(), columns=countvec.get_feature_names())
dtm_df.astype(bool)
dtm_df

Unnamed: 0,adopted,and,are,ate,banana,bananas,breakfast,broccoli,chinchillas,cute,...,for,kitten,kittens,like,my,sister,smoothie,spinach,to,yesterday
0,0,1,0,0,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
1,0,1,0,1,1,0,1,0,0,0,...,1,0,0,0,0,0,1,1,0,0
2,0,1,1,0,0,0,0,0,1,1,...,0,0,1,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0,0,1


In [7]:
#create a category vector, containing the labels for our labeled sentences
#add this to our dtm
dtm_df['cat_vector'] = ['f', 'f', 'a', 'a']
dtm_df

Unnamed: 0,adopted,and,are,ate,banana,bananas,breakfast,broccoli,chinchillas,cute,...,kitten,kittens,like,my,sister,smoothie,spinach,to,yesterday,cat_vector
0,0,1,0,0,0,1,0,1,0,0,...,0,0,1,0,0,0,0,1,0,f
1,0,1,0,1,1,0,1,0,0,0,...,0,0,0,0,0,1,1,0,0,f
2,0,1,1,0,0,0,0,0,1,1,...,0,1,0,0,0,0,0,0,0,a
3,1,0,0,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,0,1,a


In [8]:
#calculate the likelihood that each word occurs in each category
grouped = dtm_df.groupby('cat_vector').sum()/2
grouped

Unnamed: 0_level_0,adopted,and,are,ate,banana,bananas,breakfast,broccoli,chinchillas,cute,...,for,kitten,kittens,like,my,sister,smoothie,spinach,to,yesterday
cat_vector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a,0.5,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.5,...,0.0,0.5,0.5,0.0,0.5,0.5,0.0,0.0,0.0,0.5
f,0.0,1.0,0.0,0.5,0.5,0.5,0.5,0.5,0.0,0.0,...,0.5,0.0,0.0,0.5,0.0,0.0,0.5,0.5,0.5,0.0


In [9]:
#Do the above calculation for each word in our test sentence
#we'll do the pre-processing in the variable assignment stage

test_sentence = 'my cute kittens like to eat bananas'
columns = test_sentence.split()
grouped_test = grouped[columns]
grouped_test

Unnamed: 0_level_0,my,cute,kittens,like,to,eat,bananas
cat_vector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
a,0.5,0.5,0.5,0.0,0.0,0.0,0.0
f,0.0,0.0,0.0,0.5,0.5,0.5,0.5


In [10]:
#calculate the likelihood that a word occurs in a sentence at all
dtm_df['kitten'].sum()/4

0.25

In [11]:
#apply naive bayes formula
#first create a new dataframe
new_grouped = grouped_test
for e in columns:
    new_grouped.loc['f', e] = grouped_test.loc['f',e] * ((2/4) / (dtm_df[e].sum()/4)) 
new_grouped.loc['f']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


my         0.0
cute       0.0
kittens    0.0
like       1.0
to         1.0
eat        1.0
bananas    1.0
Name: f, dtype: float64

In [12]:
#sum likelihoods and divide by the number of words in the sentence
new_grouped.loc['f'].sum()/len(new_grouped.columns)

0.5714285714285714

If our cutoff is .50, this sentence would be classified into the "Food" category.