# Naive Bayes



In [1]:
import pandas as pd
import numpy as np

eat = pd.read_csv("../data/does_james_eat.csv")
eat

Unnamed: 0,is_hungry,has_poptarts,is_driving,will_eat
0,yes,yes,no,yes
1,yes,no,no,yes
2,yes,no,yes,no
3,no,yes,yes,no
4,no,no,yes,no
5,yes,no,no,yes
6,yes,yes,no,yes
7,no,no,yes,no
8,yes,no,no,yes
9,yes,yes,no,yes


## Bayes Theorem

$$ P(c|x) = \frac{P(x|c)\times{P(c)}}{P(x)}$$

$ P(c|x) $: posterior probability; probability of a class (c) given a predictor (x)
- example: probability that I will eat given that I am hungry

$ P(x|c) $: likelihood; probability of the predictor (x) given the class (c)
- example: probability that I am hungry given that I will eat

$ P(c) $: prior probability of the class

$ P(x) $: prior probability of the predictor



## Let's try this out
what is the probability I will eat given that I am hungry?

$$ P(eat|hungry) = \frac{P(hungry|eat)\times{P(eat)}}{P(hungry)} $$

In [2]:
prob_will_eat = 7 / 11
prob_hungry_given_will_eat = 6/7
prob_hungry = 7/11 # probability of hungry given will eat + probability of hungry give won't eat

prob_will_eat_given_hungry = prob_will_eat * prob_hungry_given_will_eat/prob_hungry
print(prob_will_eat_given_hungry)

0.8571428571428571


what is the probability I won't eat given that I am hungry?

$$ P(wont eat|hungry) = \frac{P(hungry|wont eat)\times{P(wont eat)}}{P(hungry)} $$

In [3]:
prob_wont_eat = 4 / 11
prob_hungry_given_wont_eat = 1/4
prob_hungry = 7/11 # probability of hungry given will eat + probability of hungry give won't eat

prob_wont_eat_given_hungry = prob_wont_eat * prob_hungry_given_wont_eat/prob_hungry
print(prob_wont_eat_given_hungry)

0.14285714285714288


note: sometimes the prior probability of the predictor $ P(x) $ is difficult to calculate so it will often be removed in the calculation. However, what remains is not a probability but a score that is proportional to the posterior probability  $ P(c|x) $ 

In other words,

$$  P(x|c)\times{P(c)} \propto P(c|x) $$

so where as we would previously pick the target class given I am hungry that has the highest probability, instead I will take that with the higher score.

In [4]:
prob_will_eat = 7 / 11
prob_hungry_given_will_eat = 6/7

score_will_eat_given_hungry = prob_will_eat * prob_hungry_given_will_eat
print("score proportional to P(eat | hungry):", score_will_eat_given_hungry)

prob_wont_eat = 4 / 11
prob_hungry_given_wont_eat = 1/4

score_wont_eat_given_hungry = prob_wont_eat * prob_hungry_given_wont_eat
print("score proportional to P(wont eat | hungry):", score_wont_eat_given_hungry)

score proportional to P(eat | hungry): 0.5454545454545454
score proportional to P(wont eat | hungry): 0.09090909090909091


In [5]:
buffer = eat['will_eat'] == 'yes'
print("will eat")
print(eat[buffer])

print("won't eat")
print(eat[~buffer])

will eat
   is_hungry has_poptarts is_driving will_eat
0        yes          yes         no      yes
1        yes           no         no      yes
5        yes           no         no      yes
6        yes          yes         no      yes
8        yes           no         no      yes
9        yes          yes         no      yes
10        no          yes         no      yes
won't eat
  is_hungry has_poptarts is_driving will_eat
2       yes           no        yes       no
3        no          yes        yes       no
4        no           no        yes       no
7        no           no        yes       no


What if the question becomes what is prediction (eat or won't eat) given multiple inputs such as hungry and having poptarts?

Well, then it just becomes

$$ p(eat | X) \propto p(hungry | eat) \times p(poptarts | eat) \times p(eat)$$

and

$$ p(wonteat | X) \propto p(hungry | wont eat) \times p(poptarts | wont eat) \times p(wont eat)$$

- $ p(hungry | eat) = \frac{6}{7}$
- $ p(poptarts | eat) = \frac{4}{7}$
- $ p(driving | eat) = \frac{0}{7}$

- $ p(hungry | wont eat) = \frac{1}{4}$
- $ p(poptarts | wont eat) = \frac{1}{4} $
- $ p(driving | wont eat) = \frac{4}{4} $ 

- $ p(eat) =  \frac{7}{11} $
- $ p(wont eat) = \frac{4}{11} $


so let's plug those values in to see if given a new observation with the inputs of hungry = 'yes' and poptarts = 'yes' what we would should predict with respect whether or not eating food will happen.




In [6]:
prob_will_eat = 7 / 11
prob_wont_eat = 4 / 11
prob_hungry_given_will_eat = 6/7
prob_poptart_given_will_eat = 4/7
prob_hungry_given_wont_eat = 1/4
prob_poptart_given_wont_eat = 1/4

score_will_eat = prob_will_eat * prob_hungry_given_will_eat * prob_poptart_given_will_eat
score_wont_eat = prob_wont_eat * prob_hungry_given_wont_eat * prob_poptart_given_wont_eat

print('score will eat:', score_will_eat)
print("score won't eat:", score_wont_eat)



score will eat: 0.3116883116883116
score won't eat: 0.022727272727272728


Okay, what if we take all of the inputs? 



In [7]:
prob_will_eat = 7 / 11
prob_wont_eat = 4 / 11
prob_hungry_given_will_eat = 6/7
prob_poptart_given_will_eat = 4/7
prob_driving_given_will_eat = 0/4
prob_hungry_given_wont_eat = 1/4
prob_poptart_given_wont_eat = 1/4
prob_driving_given_wont_eat = 4/4

score_will_eat = prob_will_eat * prob_hungry_given_will_eat * prob_poptart_given_will_eat * prob_driving_given_will_eat
score_wont_eat = prob_wont_eat * prob_hungry_given_wont_eat * prob_poptart_given_wont_eat * prob_driving_given_wont_eat

print('score will eat:', score_will_eat)
print("score won't eat:", score_wont_eat)

score will eat: 0.0
score won't eat: 0.022727272727272728


**Why? Why did we get 0 for our score for `will eat`?**

Well, it has to do with $ P(driving | eat)$. This value is $ 0 $ and $ 0 $ multiplied by any other value is $ 0 $. Well, how do we prevent this? Well, there is a value $ \alpha $ which we can use to ensure that we don't run into these $ 0 $ counts. We can set this to $1$

In [8]:
prob_will_eat = 8/ 12
prob_wont_eat = 5 / 12
prob_hungry_given_will_eat = 7/8
prob_poptart_given_will_eat = 5/8
prob_driving_given_will_eat = 1/5 
prob_hungry_given_wont_eat = 2/5
prob_poptart_given_wont_eat = 2/5
prob_driving_given_wont_eat = 5/5

score_will_eat = prob_will_eat * prob_hungry_given_will_eat * prob_poptart_given_will_eat * prob_driving_given_will_eat
score_wont_eat = prob_wont_eat * prob_hungry_given_wont_eat * prob_poptart_given_wont_eat * prob_driving_given_wont_eat

print('score will eat:', score_will_eat)
print("score won't eat:", score_wont_eat)

score will eat: 0.07291666666666666
score won't eat: 0.06666666666666668


## Naive Bayes

- For years, best spam filtering methods used naive Bayes.

- Classification technique based on Bayes’ Theorem **with an assumption of independence among predictors** - hence the Naive. 
    - The presence of a particular feature in a class is unrelated to the presence of any other feature.
    - This is like saying: If you are hungry the probability of the symptoms(growling stomach, mouth watering, weakness,...) manifesting are independant 

E.g. You receive a spam mail that contains the words "Money", "URGENT!", "Prize!". Even if these features depend on each other or others, all of these properties independently contribute to the probability that this email is SPAM.

- Naive Bayes is easy to build and useful for very large data sets. 

- Naive Bayes outperforms even highly sophisticated classification methods and works well with text data.

## Naive Bayes Classifier


Before understanding the theory, let's try `scikit-learn`'s implementation of Naive Bayes on Kaggle's [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset).

#### We will use `CountVectorizer` to get bag-of-words (BOW) representation - We will come back to this later in more detail!

- So we used `CountVectorizer` to convert text data into feature vectors where
    - each feature is a unique word in the text  
    - each feature value represents the frequency or presence/absence of the word in the given message         
    
<img src='../images/bag-of-words.png' width="600">

[Source](https://web.stanford.edu/~jurafsky/slp3/4.pdf)     


In [9]:
# And import the libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


# pip install git+git://github.com/mgelbart/plot-classifier.git
from plot_classifier import plot_classifier

from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import (
    CountVectorizer,
    TfidfTransformer,
    TfidfVectorizer,
)

# train test split and cross validation
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)

from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

%matplotlib inline
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB

pd.set_option("display.max_colwidth", 200)

In [10]:
sms_df = pd.read_csv("../data/spam.csv", encoding="latin-1")
sms_df = sms_df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})

In [11]:
train_df, test_df = train_test_split(sms_df, test_size=0.2, random_state=123)
X_train, y_train = train_df["sms"], train_df["target"]
X_test, y_test = test_df["sms"], test_df["target"]

train_df.head()

Unnamed: 0,target,sms
385,ham,It took Mr owl 3 licks
4003,ham,Well there's a pattern emerging of my friends telling me to drive up and come smoke with them and then telling me that I'm a weed fiend/make them smoke too much/impede their doing other things so ...
1283,ham,Yes i thought so. Thanks.
2327,spam,"URGENT! Your mobile number *************** WON a å£2000 Bonus Caller prize on 10/06/03! This is the 2nd attempt to reach you! Call 09066368753 ASAP! Box 97N7QP, 150ppm"
1103,ham,Aiyah sorry lor... I watch tv watch until i forgot 2 check my phone.


In [12]:
from sklearn.naive_bayes import MultinomialNB

pipe_nb = make_pipeline(CountVectorizer(), MultinomialNB())

pipe_nb.fit(X_train, y_train)
print("Training Acc.: ", pipe_nb.score(X_train,y_train))
print("Valid Acc.: ", pipe_nb.score(X_test,y_test))

Training Acc.:  0.9932690150325331
Valid Acc.:  0.9865470852017937


### Naive Bayes `predict`

- Given a new message, we want to predict whether it's spam or non spam (ham).
- Example: Predict whether the following message is spam or non spam (ham). 
> "URGENT! Free!!"

In [13]:
deploy_test = ["URGENT! Free!!", "I like learning about stats!"]
pipe_nb.predict(deploy_test)

array(['spam', 'ham'], dtype='<U4')

### Probabilistic classifiers: `predict` by hand 

- What's it's doing under the hood? 
- Let's look at an example with a toy dataset. 

In [14]:
X = [
    "URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!",
    "Lol you are always so convincing.",
    "Block 2 has interesting courses.",
    "URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!",
    "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free!",
    "Block 2 has been interesting so far.",
]
y = ["spam", "non spam", "non spam", "spam", "spam", "non spam"]

In this quick example, we aren't going to look at all of the possible words but only four of them. In other words, each observation will just have four input features pertaining to the number of occurences of each of those four words.

In [15]:
pipe_nb_toy = make_pipeline(CountVectorizer(max_features = 4, stop_words='english'), MultinomialNB())
pipe_nb_toy.fit(X, y);

In [16]:
data = pipe_nb_toy['countvectorizer'].transform(X)
train_bow_df = pd.DataFrame(data.toarray(), columns=pipe_nb_toy['countvectorizer'].get_feature_names_out(), index=X)
train_bow_df['target'] = y
train_bow_df

Unnamed: 0,block,free,prize,urgent,target
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!,0,0,1,1,spam
Lol you are always so convincing.,0,0,0,0,non spam
Block 2 has interesting courses.,1,0,0,0,non spam
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,0,1,1,1,spam
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free!,0,1,0,0,spam
Block 2 has been interesting so far.,1,0,0,0,non spam


Suppose we are given text messages in `deploy_test` and we want to find the targets for these examples, how do we do it using naive Bayes?

First, let's get numeric representation of our text messages. 

In [17]:
deploy_test = ["URGENT! Free!!", "I like Week 5 block better."]
data = pipe_nb_toy['countvectorizer'].transform(deploy_test).toarray()
bow_df = pd.DataFrame(data, columns=pipe_nb_toy['countvectorizer'].get_feature_names_out(), index=deploy_test)
bow_df

Unnamed: 0,block,free,prize,urgent
URGENT! Free!!,0,1,0,1
I like Week 5 block better.,1,0,0,0


### Naive Bayes prediction idea

Suppose we want to predict whether the following message is "spam" or "non spam".
> "URGENT! Free!!"

Representation of the message: `[0, 1, 0, 1]`

To predict the correct class, naive Bayes calculates the following probability scores using Bayes Theorem. 

- $P(\text{spam} \mid \text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent=1})$ 
- $P(\text{non spam} \mid  \text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent=1})$
- **Picks the label with higher probability scores**. 

### Applying Bayes' theorem 

Uses Bayes' theorem to calculate probabilities:

$$P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}$$

$$P(\text{spam} \mid \text{message})= \frac{P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1 \mid \text{spam}) \times P(\text{spam})}{P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent=1})}$$

$$P(\text{non spam} \mid \text{message}) = \frac{P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1 \mid \text{non spam}) \times P( \text{non spam})}{P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent=1})}$$

- $P(\text{message})$: marginal probability that a message has the given set of words 
    - Hard to calculate but can be ignored in our scenario as it occurs in the denominator for both $P(\text{spam} \mid \text{message})$ and $P(\text{non spam} \mid \text{message})$.
    - So we ignore the denominator in both cases. 
    
### Let's focus on $P(\text{spam} \mid \text{message})$

- After ignoring the denominator: 
$$P(\text{spam} \mid \text{message}) \propto P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1 \mid \text{spam}) \times P(\text{spam})$$

- To calculate $P(\text{spam} \mid \text{message})$, we need:  
    - $P(\text{spam})$: marginal probability that a message is spam
    - $P(\text{message}\mid\text{spam})$: conditional probability that message has words $w_1, w_2, \dots, w_d$, given that it is spam.
        - Hard to calculate because it would require huge numbers of parameters and impossibly large training sets. But we need it. 
        - with $d$ binary features, how many possible "text messages" are there?
        - we cannot possibly have access to all the data

### Going back to estimating $P(\text{spam} | \text{message})$

With naive Bayes' assumption, to calculate $P(\text{spam} \mid \text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1)$ or $P(spam | \text{free} \text{urgent})$, we need the following:  
1. Prior probability: $P(\text{spam})$ 
2. Conditional probabilities (calculated from training): 
    1. $P(\text{free} = 1 \mid \text{spam})$
    2. $P(\text{urgent} = 1 \mid \text{spam})$

We use our training data to calculate these probabilities. 

In [18]:
train_bow_df

Unnamed: 0,block,free,prize,urgent,target
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!,0,0,1,1,spam
Lol you are always so convincing.,0,0,0,0,non spam
Block 2 has interesting courses.,1,0,0,0,non spam
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,0,1,1,1,spam
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free!,0,1,0,0,spam
Block 2 has been interesting so far.,1,0,0,0,non spam


- Prior probability
    - $P(\text{spam}) = 3/6$
    - $P(\text{not spam} = 3/6$
    
- Conditional probabilities
    - $P(\text{block} \mid \text{spam} = 0/6$ 
    - $P(\text{free}  \mid \text{spam}) = 1/6$ 
    - $P(\text{prize} \mid \text{spam}) = 2/6 = 1/3$
    - $P(\text{urgent} \mid \text{spam}) = 1/3$
    - $P(\text{block} \mid \text{not spam} = 2/2$ 
    - $P(\text{free}  \mid \text{not spam}) = 0/2$ 
    - $P(\text{prize} \mid \text{not spam}) = 0/2$
    - $P(\text{urgent} \mid \text{not spam}) = 0/2$

In [19]:
## URGENT! free

spam_prior = 3/6
free_spam = 1/6
urgent_spam = 1/3
print((spam_prior  * free_spam  * urgent_spam))

notspam_prior = 3/6
free_notspam = 0/2
urgent_notspam = 0/2
print((notspam_prior  * free_notspam  * urgent_notspam))

0.027777777777777776
0.0


### What is our toy pipeline's prediction? 

In [20]:
deploy_test = ["URGENT! Free!!"]
pipe_nb_toy.predict(deploy_test)

array(['spam'], dtype='<U8')

# Adding an $\alpha$
Let's deal with those missing values. Let's ensure that each word count has 1 added to it to combat the presence of 0s.

In [21]:
## URGENT! free

spam_prior = 4/10
free_spam = 2/10
urgent_spam = 3/10
print((spam_prior  * free_spam  * urgent_spam))

notspam_prior = 3/6
free_notspam = 1/6
urgent_notspam = 1/6
print((notspam_prior  * free_notspam  * urgent_notspam))

0.024000000000000004
0.013888888888888888
