# Naive Bayes

### Basic Assumptions

In [16]:
# create a random beatles song
import random

words = 'me you she love loves yeah oh yellow submarine'.split()

song = []
for i in range(20):
    #number = random.randint(0,8) # uniform distribution
    word= random.choices(words, weights=[1,1,1,1,1,5,1,1,1]) #weighted probabilities
    song += word
    
' '.join(song)

# big brother: Talktotransformer

'yeah yellow love she you you oh you yeah love you loves she yeah love she you loves yeah love'

## 1. Prior

**Prior: we know nothing about the song yet**

$p(A)$ and $P(B)$ based on class frequency from the training data

We are looking for $p(A)$, the probability that a song is by Abba.

We consider two artists: **A** - Abba and **B** - Beatles

In [24]:
abba_songs = 100
beatles_songs = 100

pA = abba_songs / (abba_songs + beatles_songs)
pA

0.5

## 2. With information we get a conditional probability

we want to know the **probability that a song with the word love is from Abba**

$p(A|w)$

this is different from **the probability that an Abba song contains the word love**:

$p(w|A)$

In [25]:
import pandas as pd

In [26]:
abba_songs_with_love = 50
beatles_songs_with_love = 100

# p(love|Abba) the probability of love under condition that it is an Abba song
p_love_A = abba_songs_with_love / abba_songs

# p(Abba|love)
p_A_love = abba_songs_with_love / (abba_songs_with_love + beatles_songs_with_love)

p_love_A, p_A_love

(0.5, 0.3333333333333333)

## 3. Bayes Theorem

a statistical tool for converting conditional probabilities

$p(A|w) = \frac{p(w|A) \cdot p(A)}{p(w)}$

$p(B|w) = \frac{p(w|B) \cdot p(B)}{p(w)}$

we assume that $p(w|A)$ and $p(A)$ is known.

the *marginal probability* $p(w)$ is usually ignored.

For $p(w|A)$ we can use the TF-IDF instead of the count.

## 4. What if we have multiple words?

$p(A|song) = \frac{p(song|A) \cdot p(A)}{p(song)}$

What is $p(song|A)$ ?

In [30]:
words = ['love', 'you', 'yeah']
data = [[0.5, 0.25, 0],
        [1.0, 0.33, 0.5]]

df = pd.DataFrame(data, columns = words, index = ['A', 'B'])
df

Unnamed: 0,love,you,yeah
A,0.5,0.25,0.0
B,1.0,0.33,0.5


Naive Bayes assumption:

$p(song|A) = p(w1|A) \cdot p(w2|A) \cdot p(w3|A)$

we assume that songs are written by putting individual words randomly.

-> we assume that all words are independent events.

In [31]:
song = "love you"

# go through all words in the vocabulary
# if the word occurs: p(w)
# if the word does not occur: 1-p(w)
p_abba = 0.5 * 0.25 * (1 - 0.0)
p_abba

0.125

In [34]:
p_beatles = 1 * 0.33 * (1 - 0.5)
p_beatles

0.165

## Problem: zero probs?

We use a **smoothing term**: 

* we assume that every word occurs k times at least
* so that probability is always > 0
* we assume that the artist attached a copy of each word in the dictionary to the song

In [35]:
# training data + copy of the vocabulary
song = 'love yeah' + 'love you yeah' * k

If we increase the smoothing term, differences between props decrease.

If we decrease the smoothing term, the differences between props increase.

For a very high smoothing term, the propability p(A|song) for a song converges to the prior p(A).

**the smoothing term is a regularization hyperparameter!**

## Problem: floating-point precision

In [38]:
# create a very low float number
(0.1 ** 100) * (0.2 ** 90)
# python can run into precission limitations

1.2379400392853934e-163

In [47]:
0.2 + 0.1 # these are periodic binary numbers
# be careful with floating point precision...

0.30000000000000004

problematic:

$p(w1) \cdot p(w2) ... \cdot p(wn)$

**Calculate log-probabilites instead:**

$e ^ {log(p(w1)) + log(p(w2)) + log(p(wn))}$

In [48]:
import math

0.1 * 0.2 * 0.3

0.006000000000000001

In [52]:
math.exp(math.log(0.1) + math.log(0.2) + math.log(0.3))

0.006000000000000004

## import from sklearn

In [None]:
from sklearn.naive_bayes import MultinomialNB