# Naive bayes by hand

In [88]:
import numpy as np
import pandas as pd

Imagine you have 4 apples with these attributes

In [89]:
apples = [
    "red round",
    "red round",
    "green sour round",
    "green round",
]

and 3 bananas with these attributes:

In [90]:
bananas = [
    "yellow skinny",
    "yellow skinny",
    "green skinny"
]

Split into list of lists:

In [91]:
apples = [a.split() for a in apples]
bananas = [b.split() for b in bananas]

**Q.** What is the sorted set of all attributes (the vocabulary $V$)?

<details>
<summary>Solution</summary>
['green', 'red', 'round', 'skinny', 'sour', 'yellow']
    
You can compute like this:
    
```
Va = set(np.concatenate(apples).ravel())
Vb = set(np.concatenate(bananas).ravel())
V = sorted(Va.union(Vb))
```
</details>

**Q**. What is the "fruit vector" for the "red round" apple?

The column values are 1 if the word is mentioned otherwise 0. Assume the sorted column order.

<details>
<summary>Solution</summary>
    The row vector is <tt>[0, 1, 1, 0, 0, 0]</tt> for "red round"
</details>

**Q**. What is the "fruit vector" for the "green sour round" apple?

The column values are 1 if the word is mentioned otherwise 0. Assume the sorted column order.

<details>
<summary>Solution</summary>
    The row vector is <tt>[1, 0, 1, 0, 1, 0]</tt> for "green sour round"
</details>

Let's look at all fruit vectors now and fruit target column

In [92]:
data = np.zeros((7,len(V)))
for i,row in enumerate(apples+bananas):
    for w in row:
        data[i,V.index(w)] = 1
df = pd.DataFrame(data,columns=V,dtype=int)
df['fruit'] = [0,0,0,0,1,1,1]
df

Unnamed: 0,green,red,round,skinny,sour,yellow,fruit
0,0,1,1,0,0,0,0
1,0,1,1,0,0,0,0
2,1,0,1,0,1,0,0
3,1,0,1,0,0,0,0
4,0,0,0,1,0,1,1
5,0,0,0,1,0,1,1
6,1,0,0,1,0,0,1


**Q.** What is a good estimate of P(apple) and P(banana)?

<details>
<summary>Solution</summary>
P_apple = 4/7
P_banana = 3/7
</details>

In [93]:
P_apple = 4/7
P_banana = 3/7

**Q.** What is a good estimate of P(red|apple) and P(red|banana)?

<details>
<summary>Solution</summary>
Probably best to take ratio of number of apples that are red to the number of apples.  When vector values are binary it feels wrong to do as we did for doc classification.  (In that case, we'd count how many times, say, "red" appears in apple rows and divide by total number of words in apple descriptions.) So, 2/4 apples are red and 0/3 bananas are red.
</details>

**Q.** What is a good estimate of P(green|apple) and P(green|banana)?

<details>
<summary>Solution</summary>
2/4 apples are green and 1/3 bananas are green.
</details>

In [101]:
P_w_apple = df[df.fruit==0].sum(axis=0) / 4
P_w_apple

green     0.50
red       0.50
round     1.00
skinny    0.00
sour      0.25
yellow    0.00
fruit     0.00
dtype: float64

In [100]:
P_w_banana = df[df.fruit==1].sum(axis=0) / 3

In [104]:
P_w_apple['red']*P_w_apple['round']

0.5

In [105]:
P_w_banana['red']*P_w_banana['round']

0.0