## Naives Bayes 

- prior : all that can be infered about a situation
- posterior : an inference made after gaining new information
- bayes theorem: is about the probability about some event $P(A)$ we introduce some event that is related to A $P(R|A)$ bayes theorem infers the probability of A given R $P(A|R)$

.6, prob that brenda did not wear red is .4 we can put these together and get

![bayes0](img/bayes0.png)

![](img/bayes1.png)

## Naive Bayes Example
- this theorem is best explained using an example. 
![](img/bayes2.png)


supposed we have recieved these emails 3 spam and 5 not spam, call the not spam ham. 
we can figure out the probability of an email being spam $P(spam) =\frac{3}{8}$ and ham $P(ham) =\frac{5}{8}$. 

now we look at the spam emails $\frac{1}{3}$ of them have the word 'easy' in them and th rest dont, so $\frac{2}{3}$ next we can figure out the probability that an email is spam given it containts the word 'easy' $P(spam|'easy')=\frac{1}{3}*\frac{3}{8}=\frac{1}{8}$ and $P(spam|'easy'^c)=\frac{2}{3}*\frac{3}{8}=\frac{1}{4}$
we do the same for ham $\frac{1}{5}$ contained the word 'easy' and $\frac{4}{5}$ did not contain the word 'easy' and were ham, we get that $P(ham|'easy') = \frac{5}{8}*\frac{1}{5}=\frac{1}{8}$ and $P(ham|'easy'^c) = \frac{5}{8}*\frac{4}{5}=\frac{1}{2}$

we can do the same for the word money $P(spam|'money')= \frac{1}{4}$,  $P(spam|'money'^c) = \frac{1}{4}$, $P(spam|'money')= \frac{1}{8}$ and $P(ham|'money'^c)= \frac{1}{2}$

these are things we know spam emails that contain the word 'easy' or 'money'. We wish to infer if an email contains the word 'easy' or 'money' is it spam. In other words we know $P(spam|'easy')$ and wish to infer $P('easy'|spam)$

we can do this with Naive Bayes theorem, we take the event that wish to know about and normalize it

$P('easy'|spam)=\frac{P(spam|'easy')}{P(spam|'easy') + P(ham|'easy')}=\frac{\frac{1}{8}}{\frac{1}{8} + \frac{1}{8}}=\frac{1}{2}$

$P('easy'|ham)=\frac{1}{2}$

$P('money'|spam) = \frac{2}{3}$

$P('money'|ham) = \frac{1}{3}$

## Naive Bayes Algorithm 

first lets remember that $P(A \& B) = P(A \cap B) = P(A)P(B)$ iff A and B are independent and $P(A|B)P(B) = P(B|A)P(A)$ and $P(A|B)\propto P(B|A)P(A)$

we can use this, we know $P(spam|'easy', 'money')$ and know that it is proportional to $P('easy','money'|spam)P(spam)$ that is to say : 

$P(spam|'easy', 'money') \propto P('easy','money'|spam)P(spam) \\
 P(spam|'easy', 'money') \propto P('easy'|spam)P('money'|spam)P(spam)\\
 P(spam|'easy', 'money') \propto \frac{1}{3}*\frac{2}{3}*\frac{3}{8} \\
 P(spam|'easy', 'money') \propto \frac{1}{12}$
 
 similarly $P(ham|'easy', 'money') \propto \frac{1}{40}$
 
now we know $P(spam|'easy', 'money')$ and can infer $P('easy', 'money'|spam)$ and $P(ham|'easy', 'money')$ and can infer $P('easy', 'money'|ham)$

$P('easy', 'money'|spam) = \frac{P(spam|'easy', 'money')}{P(spam|'easy', 'money') + P(ham|'easy', 'money')} \\ 
 P('easy', 'money'|spam) = \frac{\frac{1}{12}}{\frac{1}{12}\frac{1}{40}}\\
 P('easy', 'money'|spam) = \frac{10}{13}$
 
 similarly $P('easy', 'money'| ham) = \frac{3}{13}$
 
From this example we can see that if we wish to look for the probability of an email being spam or not given some list of words $w_1, w_2, ..., w_n$  

find<br>
$P(spam|w_1,w_2,..., w_n) \propto P(w_1,w_2,...,w_n|spam) \\ 
 P(spam|w_1,w_2,..., w_n) \propto P(w_1|spam)P(w_2|spam)*...*P(w_n|spam)P(spam)$
 
then<br>
$P(ham|w_1,w_2,..., w_n) \propto P(w_1,w_2,...,w_n|ham) \\ 
 P(ham|w_1,w_2,..., w_n) \propto P(w_1|ham)P(w_2|ham)*...*P(w_n|ham)P(ham)$
 
and normalize: <br>
prob spam <br>
$\frac{P(w_1|spam)P(w_2|spam)*...*P(w_n|spam)P(spam)}{P(w_1|spam)P(w_2|spam)*...*P(w_n|spam)P(spam)+ P(w_1|ham)P(w_2|ham)*...*P(w_n|ham)P(ham)}$

prob ham<br>
$\frac{P(w_1|ham)P(w_2|ham)*...*P(w_n|ham)P(ham)}{P(w_1|spam)P(w_2|spam)*...*P(w_n|spam)P(spam)+ P(w_1|ham)P(w_2|ham)*...*P(w_n|ham)P(ham)}$

In [1]:
# dataset: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
import pandas as pd
import zipfile, requests, io
 
zip_file_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'

r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))

df = pd.read_table(z.open('SMSSpamCollection'), sep='\t', names=['label','sms_message'])

# Output printing out first 5 rows
df.head()


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# map ham to 0 spam to 1
df['label0'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message,label0
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [14]:
## before moving on implement bag of words 

import string
import pprint
from collections import Counter

documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

# letters to lower case
lower_case_documents = []
for i in documents:
    lower_case_documents.append(i.lower())
# print(lower_case_documents)

# remove punctuation
sans_punctuation_documents = []
for i in lower_case_documents:
    sans_punctuation_documents.append(i.translate(None ,string.punctuation))

# python 3 remove punctutation
#for i in lower_case_documents:
#    sans_punctuation_documents.append(i.translate(str.maketrans('','',string.punctuation)))
#print(sans_punctuation_documents)

#tokenize 
preprocessed_documents = []
for i in sans_punctuation_documents:
    preprocessed_documents.append(i.split(' '))
#print(preprocessed_documents)

#count the word frequency 
frequency_list = []
for i in preprocessed_documents:
    frequency_list.append(Counter(i))
    
#pprint.pprint(frequency_list)

In [18]:
## do the same with sklearn sklearn.feature_extraction.text.CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

count_vector = CountVectorizer()

count_vector.fit(documents)
#count_vector.get_feature_names()

doc_array = count_vector.transform(documents).toarray()
#doc_array

frequency_matrix = pd.DataFrame(doc_array)
#frequency_matrix



In [20]:
# split into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393
