## Naive Bayes Document Classification

## Classifying Articles 

-------------------------------
### Here are our articles
#### Music Articles:

* 'the song was popular'
* 'band leaders disagreed on sound'
* 'played for a sold out arena stadium'

#### Politics Articles

* 'world leaders met lask week'
* 'the election was close'
* 'the officials agreed on a compromise'
--------------------------------------------------------
Let's try and predict one example phrase:


* "world leaders agreed to fund the stadium"

To do this, we'll use a bag of words representation and assume that words are independent of one another.

In [1]:
from collections import defaultdict
import numpy as np
music = ['the song was popular',
         'band leaders disagreed on sound',
         'played for a sold out arena stadium']

politics = ['world leaders met lask week',
            'the election was close',
            'the officials agreed on a compromise']

test_statement = 'world leaders agreed to fund the stadium'

In [2]:
#labels : 'music' 'politics'
#features: words
test_statement_2 = 'officials met at the arena'

<img src ="./resources/naive_bayes_icon.png">

### Another way of looking at it
<img src = "./resources/another_one.png">

## So, in the context of our problem......



## $ P(politics | phrase) = \frac{P(phrase|politics)P(politics)}{P(phrase)}$

## $ P(politics) = \frac{ \# politics}{\# all\ articles} $

*where phrase is our test statement*

Estimating the parameters of our model (using Maximum a Posteriori):

<img src = "./resources/solving_theta.png" width="400">

### How should we calculate P(politics)?

This is essentially the distribution of the probability of either type of article. We have three of each type of article, therefore, we assume that there is an equal probability of either article

In [3]:
p_politics = len(politics)/(len(politics) + len(music))

In [4]:
p_politics

0.5

In [5]:
p_music = len(music)/(len(politics) + len(music))

In [6]:
p_music

0.5

### How do you think we should calculate: $ P(phrase | politics) $ ?

## $ P(phrase | politics) = \prod_{i=1}^{d} P(word_{i} | politics) $

### We need to make a *Naive* assumption.

### $ P(word_{i} | politics) = \frac{\#\ of\ word_{i}\ in\ politics\ art.} {\#\ of\ total\ words\ in\ politics\ art.} $

### Can you foresee any issues with this?

## Laplace Smoothing
## $ P(word_{i} | politics) = \frac{\#\ of\ word_{i}\ in\ politics\ art. + \alpha} {\#\ of\ total\ words\ in\ politics\ art. + \alpha d} $

## $ P(word_{i} | music) = \frac{\#\ of\ word_{i}\ in\ music\ art. + \alpha} {\#\ of\ total\ words\ in\ music\ art. + \alpha d} $

This correction process is called Laplace smoothing:
* d : number of features (in this instance total number of vocabulary words)
* $\alpha$ can be any number greater than 0 (it is usually 1)


#### Now let's find this calculation

<img src="./resources/IMG_0041.jpg">

In [None]:
# p(phrase|politics)

In [None]:
#P(politics | leaders agreed to fund the stadium)

### $ P(politics | article) = P(politics) x \prod_{i=1}^{d} P(word_{i} | politics) $

#### Deteriming the winner of our model:

<img src = "./resources/solvingforyhat.png" width= "400">

In [29]:
classes = ['Music', 'Politics']
probabilities = [likelihood_m, likelihood_p]
prediction_class_idx = np.argmax([likelihood_m, likelihood_p]) #Choose the higher index of the probabilities
prediction = classes[prediction_class_idx]
prediction

'Politics'