# Naïve Bayes
Probabilistic classifiers can give you a prediction and a probability - something you can't get from nearest neighbour classifiers.

## Lazy vs Eager Learners
Lazy learners save a training set and have to remember it each time to use it while eager learners will generate a model that it can use to classify new data - this tends to be faster,

## Probability
Prior probability of a hypothesis: P(h). For a coin: P(heads) = 0.5. Rolling 1 with a die: P(1) = 1/6. Probability person is woman if even number of men and women: P(Female) = 0.5.

If you research a uni (say Stanford) a person is at and realise that it is 86% female, you may change the probability of a random person there being female. You can denote this as P(Female | D) - probability of hypothesis h given data D:  P(Female | attends stanford) = 0.86. The formula is:

$$
P(A|B) = \frac{P(A \cap B)}
{p(B)}
$$

Terms
- P(h) is the prior probability before evidence
- P(D) is the posterior probability after we observe data D. Also called conditional probability.

## Laptop and Phone Probabilities

| Name  |  Laptop |  Phone |
|:-:|:-:|:-:|
| Kate  | PC  | Android  |
| Tom  | PC |  Android |
| Harry  | PC |  Android |
| Annika  | Mac  | iPhone  |
| Naomi  | Mac  | Android  |
| Joe  | Mac  | iPhone  |
| Chakotay  | Mac  | iPhone  |
| Neelix  | Mac  | Android  |
| Kes  | PC | iPhone  |
| B’Elanna  | Mac  | iPhone  |

### Probability randomly selected person has an iPhone?
p(iPhone) =  5 /10 = 0.5

### Probability randomly selected person has an iPhone if they already have a Mac?
$$
P(iPhone|mac) = \frac{P(mac \cap iPhone)}
{P(mac)}
$$

4 People have both a Mac and iPhone: $P(mac \cap iPhone) = 4/10 = 0.4$  
Probability random person has a mac: $P(mac) = 6/10 = 0.6$

Probability that someone uses an iPhone if that person has a Mac:
$$
P(iPhone|mac) = \frac{P(0.4)}
{p(0.6)} = 0.667
$$

### Probability  someone owns a Mac if they also own an iPhone?
$$
P(mac|iPhone) = \frac{P(iPhone \cap mac)}
{P(iPhone)} =  \frac{P(4 / 10)}
{P(5/10)} =  0.4 / 0.5 = 0.8
$$

## Shopping Cart Example
Want to determine whether to show you an ad for a green tea if you're likely to purchase it.  
P(D): probability some training data will be observed. If half the people live in the post code 88005, then the probability of someone being from that postcode P(88005) = 0.5  
P(D|H): probability that a value holds given the hypothesis. For example, the probability that someone lives in 88005 given they've bought green tea P(88005|green tea)

In [124]:
import pandas as pd
import math

customers = {   'Customer ID': range(10),
                'Zipcode': [88005, 88001, 88001, 88005, 88003, 88005, 88005, 88001, 88005, 88003],
                'Bought Organic': [True, False, True, False, True, False, False, False, True, True],
                'Bought Green Tea': [True, False, True, False, False, True, False, False, True, True]}

df = pd.DataFrame(customers)
df

Unnamed: 0,Bought Green Tea,Bought Organic,Customer ID,Zipcode
0,True,True,0,88005
1,False,False,1,88001
2,True,True,2,88001
3,False,False,3,88005
4,False,True,4,88003
5,True,False,5,88005
6,False,False,6,88005
7,False,False,7,88001
8,True,True,8,88005
9,True,True,9,88003


## Probability someone who bought green tea lives in post code 88005
P(D|H): P(88005 | green tea)

In [28]:
num_bought_green_tea = sum(df['Bought Green Tea'])
bought_tea_zip = df[(df['Bought Green Tea']==True) & (df['Zipcode']==88005)]
bought_tea_zip

Unnamed: 0,Bought Green Tea,Bought Organic,Customer ID,Zipcode
0,True,True,0,88005
5,True,False,5,88005
8,True,True,8,88005


In [29]:
num_bought_tea_and_live_88005 = len(bought_tea_zip['Zipcode'])

print('{} / {} = {}'.format(num_bought_tea_and_live_88005, num_bought_green_tea, num_bought_tea_and_live_88005 / num_bought_green_tea))

3 / 5 = 0.6


## Opposite: Probability someone who did NOT buy green tea lives in post code 88005
P(D|H): P(88005 | !green tea)

In [30]:
num_didnt_buy_green_tea = len(df['Bought Green Tea']) - sum(df['Bought Green Tea'])
no_tea_zip = df[(df['Bought Green Tea']==False) & (df['Zipcode']==88005)]
no_tea_zip

Unnamed: 0,Bought Green Tea,Bought Organic,Customer ID,Zipcode
3,False,False,3,88005
6,False,False,6,88005


In [33]:
num_didnt_buy_tea_live_88005 = len(no_tea_zip['Zipcode'])
print('{} / {} = {}'.format(num_didnt_buy_tea_live_88005, num_didnt_buy_green_tea, num_didnt_buy_tea_live_88005 / num_didnt_buy_green_tea))

2 / 5 = 0.4


## Probability someone being in postcode 88001?

In [37]:
lives_88001 = df[df['Zipcode'] == 88001]
lives_88001

Unnamed: 0,Bought Green Tea,Bought Organic,Customer ID,Zipcode
1,False,False,1,88001
2,True,True,2,88001
7,False,False,7,88001


In [43]:
p = len(lives_88001['Zipcode']) / len(df['Zipcode'])
p

0.3

# Bayes Theorem
Describes relationship between P(h), P(h | D), P(D), amd P(D | h)

$$
P(h | D) = \frac{P(D | h)P(h)}{P(D)}
$$

Can use this theorem to decide between multiple hypotheses. For example, if you're given some data you can use this to determine what sport someone plays.

In the tea example, we have two hypotheses:
1. They will buy green tea: P(buy tea | 88005)
2. They will not buy green tea: P(-buy tea | 88005)

Once we calculate the probability of 0.6 that they will buy green tea, we can say its likely they will make the purchase.

## Electronics Store
3 sales fliers to send in email, they can show:
1. Laptop
2. Desktop
3. Tablet

Using the information we have about the customer, we want ot send the flier that will be most likely to generate a sale. They hypotheses for which flier is best:  
$P(laptop | D) = \frac{P(D | laptop)P(laptop)}{P(D)}$  
$P(desktop | D) = \frac{P(D | desktop)P(desktop)}{P(D)}$  
$P(tablet | D) = \frac{P(D | tablet)P(tablet)}{P(D)}$

We can refer to these hypotheses as h₁, h₂, h₃ etc and they tend to be the different classes we want to predict e.g. different sports, has disease, does not have disease etc. Once we've calculated all the probabilities, we can pick the hypotheses that is most likely - called **the maximum a posteriori hypothesis** or $h_{MAP}$.

$$
h_{MAP} = arg max_{h \in H} P(h|D)
$$

H is the set of all hypotheses, so this works out for each hypothesis out of all the hypotheses, compute the probabilities and find the one with the highest. With bayes theorem, we convert this to:

$$
h_{MAP} = arg max_{h \in H} \frac{P(D|h)P(h)}
{P(D)}
$$

P(D) is independent of the hypotheses. If you find the nominator before you divide by this denominator, you can compare the hypotheses to find the largest number - most likely.

## Determine if patient has cancer or not
- 0.8% of people have some form of cancer
- T gives binary result: positive or negative
- Cancer present - test returns true positive 98% of the time
- Cancer abssent - test returns correct negative result 97% of the time

Therefore I can work out:
- P(cancer) = 0.008
- P(-cancer) = 0.992
- P(POS|cancer) = 0.98
- P(NEG|cancer) = 0.02
- P(POS|-cancer) = 0.03
- P(NEG|-cancer) = 0.97

If you have a blood test done and the test result is positive, is it more likely that you have cancer than you don't?
- P(POS|cancer): 0.98 * P(cancer): 0.008 = 0.00784
- P(POS|-cancer): 0.03 * P(cancer) = 0.0294

Most likely to not have cancer, to determine how likely the person is to have cancer:

$$
P(cancer|POS) = \frac{0.00784}{0.00784 + 0.0294} = 0.21
$$

There is a 21% chance of having cancer.

## Why use Bayes Theorem?

The example of the green tea and zipcode from earlier presented us with two hypotheses:
- P(h₁|D) = P(buygreentea|8805)
- P(h₂|D) = P(buygreentea|8805)

Can rewrite as

$$
\frac{P(8805 | buygreentea)P(buygreentea)}
{8805}
$$

Since we can calculate it directly from data in the table, why use the equation? Because normally its hard to calculate P(h|D) directly.

## Naïve Bayes and Green Tea Problem
Use more evidence than a single piece of data when calculating the probability with Bayes theorem. In the tea example, we have two types of evidence: zip code and whether a person has purchased organic items. To calculate the probability of a hypothesis, we multiply the individual probabilities.

What is the probability that someone who lives in 8805 zipcode and bought organic items will buy green tea? 

P(tea|8805 & organic) = P(8805|tea)P(organic|tea)P(tea) = .6(.8)(.5) = .24
P(-tea|8805 & organic) = P(8805|-tea)P(organic|-tea)P(-tea) = .4(.25)(.5) = .05

A person who lives in 8805 and buys organic is more likely to buy green tea than not.

## Exercise Wearable Example
Company makes two models of health wearable: i100 (heart rate, GPS, etc) and the i500 (features of i100 plus blood oxygen levels, 3G connection to site). When someone buys a model, they fill out a questionaire.

In [125]:
questionaire = {'Main Interest': ['both', 'both', 'health', 'appearance', 'appearance', 'appearance', 'health', 'both', 'both', 'appearance', 'both', 'health', 'health', 'appearance', 'health'],
                'Current Exercise Level': ['sedentary', 'sedentary', 'sedentary', 'active', 'moderate', 'moderate', 'moderate', 'active', 'moderate', 'active', 'active', 'active', 'sedentary', 'active', 'sedentary'],
                'How Motivated': ['moderate', 'moderate', 'moderate', 'moderate', 'aggressive', 'aggressive', 'aggressive', 'moderate', 'aggressive', 'aggressive', 'aggressive', 'moderate', 'aggressive', 'moderate', 'moderate'],
                'Comfotable with Tech': ['yes', 'no', 'yes', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'no', 'no', 'yes', 'no', 'no'],
                'Model #': ['i100', 'i100', 'i500', 'i500', 'i500', 'i100', 'i500', 'i100', 'i500', 'i500', 'i500', 'i500', 'i500', 'i100', 'i100']}

In [126]:
df = pd.DataFrame(questionaire)
df

Unnamed: 0,Comfotable with Tech,Current Exercise Level,How Motivated,Main Interest,Model #
0,yes,sedentary,moderate,both,i100
1,no,sedentary,moderate,both,i100
2,yes,sedentary,moderate,health,i500
3,yes,active,moderate,appearance,i500
4,yes,moderate,aggressive,appearance,i500
5,no,moderate,aggressive,appearance,i100
6,no,moderate,aggressive,health,i500
7,yes,active,moderate,both,i100
8,yes,moderate,aggressive,both,i500
9,yes,active,aggressive,appearance,i500


Using naive Bayes, which model should you recommend for someone whose:
- main interest is health
- current exercise level is moderate
- moderately motivated
- comfortable with technological devices

Two hypotheses:
- P(i100|health, modexercise, modmotivation, techcomfortable)
- P(i500|health, modexercise, modmotivation, techcomfortable)

Probabilities
- P(i100) = d

In [127]:
n = df.shape[0]
model_grouped = df.groupby(['Model #'])#.count()
i100 = model_grouped.get_group('i100')
i500 = model_grouped.get_group('i500')
i500

Unnamed: 0,Comfotable with Tech,Current Exercise Level,How Motivated,Main Interest,Model #
2,yes,sedentary,moderate,health,i500
3,yes,active,moderate,appearance,i500
4,yes,moderate,aggressive,appearance,i500
6,no,moderate,aggressive,health,i500
8,yes,moderate,aggressive,both,i500
9,yes,active,aggressive,appearance,i500
10,no,active,aggressive,both,i500
11,no,active,moderate,health,i500
12,yes,sedentary,aggressive,health,i500


In [128]:
# conditional probability: number of people with health as priority that had also bought an i100
P = {   'i100': i100.shape[0] / n,
        'i500': i500.shape[0] / n,
        'health|i100': i100[i100['Main Interest'] == 'health'].shape[0] / i100.shape[0],
        'modexercise|i100': i100[i100['Current Exercise Level'] == 'moderate'].shape[0] / i100.shape[0],
        'modmotivation|i100': i100[i100['How Motivated'] == 'moderate'].shape[0] / i100.shape[0],
        'techcomfortable|i100': i100[i100['Comfotable with Tech'] == 'yes'].shape[0] / i100.shape[0],
    
        'health|i500': i500[i500['Main Interest'] == 'health'].shape[0] / i500.shape[0],
        'modexercise|i500': i500[i500['Current Exercise Level'] == 'moderate'].shape[0] / i500.shape[0],
        'modmotivation|i500': i500[i500['How Motivated'] == 'moderate'].shape[0] / i500.shape[0],
        'techcomfortable|i500': i500[i500['Comfotable with Tech'] == 'yes'].shape[0] / i500.shape[0],
    }
P

{'health|i100': 0.16666666666666666,
 'health|i500': 0.4444444444444444,
 'i100': 0.4,
 'i500': 0.6,
 'modexercise|i100': 0.16666666666666666,
 'modexercise|i500': 0.3333333333333333,
 'modmotivation|i100': 0.8333333333333334,
 'modmotivation|i500': 0.3333333333333333,
 'techcomfortable|i100': 0.3333333333333333,
 'techcomfortable|i500': 0.6666666666666666}

In [129]:
# P(i100|evidence
p_i100 = P['health|i100'] * P['modexercise|i100'] * P['modmotivation|i100'] * P['techcomfortable|i100'] * P['i100']

# P(i500|evidence
p_i500 = P['health|i500'] * P['modexercise|i500'] * P['modmotivation|i500'] * P['techcomfortable|i500'] * P['i500']

print('Probability of i100', p_i100)
print('Probability of i500', p_i500)
print('The best recommendation is the i500')

Probability of i100 0.0030864197530864196
Probability of i500 0.019753086419753083
The best recommendation is the i500


## Estimating Probabilities
Probabilities in Naive Bayes are estimates of the true probabilities (probabilities from entire population). We can estimate the true probability via random representative sampling. This normally works, but when the true probability is tiny, these estimates can be poor.

### 0 Probability
When a probability is 0, it dominates the naive Bayes algorithm. Probabilities based on sample ratios can also be a biased underestimate.

## Snooping Bill
If 0 Democrats voted for a snooping bill, then if we want to predict the probability of someone else being a Democrat based on all the bills they voted on and they voted against the snooping bill, we'll have 100% confidence that they are a Democrat.

P(votedNo|Democrat) = number that are both democrats and voted on snooping bill / Number that are democrats

$$
P(x|y = \frac{x_c}{n}
$$

n is the total number of instances of class y, divided by the total number of classes that have value x. When $n_c=0$ we can have a problem. To solve this, we can rewrite the equation:

$$
P(x|y = \frac{x_c + mp}{n + m}
$$

- m = constant (equivalent sample size). Value can vary, but you can use the number of different values an attribute takes: e.g. if the response is boolean, we can say m=2
- p = prior estimate of probability. We usually assume uniform probability, so if we assume there's a 50% chance of voting yes or no, then p=1/2

## Numbers

Naive Bayes uses categorical data (data in discrete categories). If your data is along a scale, you need to be able to group things so you can count them. You have two options:

1. Make categories: Discretize a continuous attribute
2. Gaussian Distributions and use a probability density function

In [130]:
df['Income ($1000s)'] = [60, 75, 90, 125, 100, 90, 150, 85, 100, 120, 95, 90, 85, 70, 45]
df

Unnamed: 0,Comfotable with Tech,Current Exercise Level,How Motivated,Main Interest,Model #,Income ($1000s)
0,yes,sedentary,moderate,both,i100,60
1,no,sedentary,moderate,both,i100,75
2,yes,sedentary,moderate,health,i500,90
3,yes,active,moderate,appearance,i500,125
4,yes,moderate,aggressive,appearance,i500,100
5,no,moderate,aggressive,appearance,i100,90
6,no,moderate,aggressive,health,i500,150
7,yes,active,moderate,both,i100,85
8,yes,moderate,aggressive,both,i500,100
9,yes,active,aggressive,appearance,i500,120


If we wanted to identify the typical purchaser of an i500, we could find their average income:

In [131]:
income_i500 = df[df['Model #']=='i500']['Income ($1000s)']
income_i500.mean()

106.11111111111111

### Standard Deviation

In [136]:
def standard_deviation(x):
    return math.sqrt( (sum( [math.pow(xi - x.mean(), 2) for xi in x] )) / (len(x)) )

sd = standard_deviation(income_i500)
sd

20.10773452317095

### Standard Score

In [138]:
def standard_score(x, std):
    return [((xi - x.mean()) / std) for xi in x]
    
z_scores = standard_score(income_i500, sd)
z_scores

[-0.80123949779352976,
 0.93938423879241395,
 -0.30391843019754589,
 2.1826869077823736,
 -0.30391843019754589,
 0.69072370499442193,
 -0.55257896399553785,
 -0.80123949779352976,
 -1.0499000315915217]

## Population vs Sample Standard Deviation
If you have data on the complete population, you can use the SD formula above, but if you have a sample, then you need to modify it by dividing by n-1.

## Gaussian Distribution
Otherwise called the normal distribution or bell curve. People tend to assume that the data follows a normal distribution which has the following properties:
- 68% of instances fall within 1 std of the mean
- 95% within 2 std of the mean

Since our mean was 106.1 and the std was 21.3, 95% of the people who purchase an i500 earn between \$42k and \$149k. So if you asked the likelihood that someone who bought an i500 earns $100k, you'd say it was pretty likely. We can calculate the probability with the following equation:

The probability equation for this is:

$$
P(x_i|y_j) = \frac{1}
{\sqrt{2\pi\sigma}}
e^{\frac{-(x_i-\mu_ij)^2}
{2\sigma^2_{ij}}}
$$

If we are interested in calculated P(100k|i500): the probability that a person earns \$100k given they bought an i500. We've already calculated the mean of the income of people who bought the i500 and also their standard deviation. Mean is represented as $\mu$ and the std as $\sigma$.

- $\mu_{ij}$ = 106.111
- $\sigma_{ij}$ = 21.327
- $x_i$ = 100
- e = constant that is the base of the natural logarithm: value ~ 2.718

$$
P(x_i|y_j) = \frac{1}
{\sqrt{2\pi(21.327)}}
e^{\frac{-(100-106.111)^2}
{2(21.327)^2_{ij}}}
$$

$$
P(x_i|y_j) = \frac{1}
{53.458}
e^{-0.0411}
$$

$$
P(x_i|y_j) = 0.0180
$$

The probability that the income of the person who bought an i500 is \$100k is 0.0180.

## Bayes vs KNN
### Bayes
- Simple implementation (counting things)
- Requires less training data
- Works well and is fast
- Cannot learn interactions between features - e.g. cannot learn that I like foods with cheese and foods with rice but not with both

### KNN
- Simple, no assumptions about structure in data
- Requires large amount of memory for training set
- Useful when training set is large
- KNN is flexible and can be used in a wide variety of fields from recommendation engines, proteomics, image classification etc