## Naive Bayes Classification

Suppose we are using an API to gather articles from a news website and grabbing phrases from two different types of articles:  music and politics.

We have a problem though! Only some of our articles have their category (music or politics). Is there a way we can use Machine Learning to help us label our data quickly?

Let's start with one example phrase:


#### "world leaders agreed to fund the stadium"

How can we make a model that labels this for us rather than having to go through by hand?

In [163]:
music = ['the song was popular',
         'band leaders disagreed on sound',
         'played for a sold out arena stadium']

politics = ['world leaders met lask week',
            'the election was close',
            'the officials agreed on a compromise']
test_statement = 'world leaders agreed to fund the stadium'



#### $ P(politics | phrase) = \frac{P(phrase|politics)P(politics)}{P(phrase)}$

#### $ P(politics) = \frac{ \# politics}{\# all\ articles} $

*where phrase is our test statement*

<img src = "./resources/solvingforyhat.png" width= "400">

<img src = "./resources/solving_theta.png" width="400">

### How should we calculate P(politics)?

In [152]:
p_politics = len(politics)/(len(music)+len(politics))
p_music = 1- p_politics

### How do you think we should calculate: $ P(phrase | politics) $ ?

## $ P(phrase | politics) = \prod_{i=1}^{d} P(word_{i} | politics) $

### We need to make a *Naive* assumption.

### $ P(word_{i} | politics) = \frac{\#\ of\ word_{i}\ in\ politics\ art.} {\#\ of\ total\ words\ in\ politics\ art.} $

### Can you foresee any issues with this?

## $ P(word_{i} | politics) = \frac{\#\ of\ word_{i}\ in\ politics\ art. + \alpha} {\#\ of\ total\ words\ in\ politics\ art. + \alpha d} $

This correction process is called Laplace smoothing:
* d : number of features (in this instance total number of vocabulary words)
* $\alpha$ can be any number greater than 0 (it is usually 1)


#### Now let's find this calculation

In [157]:

def vocab_maker(category):
    """returns the vocabulary for a given type of article"""
    vocab_category = set()
    for art in category:
        words = art.split()
        for word in words:
            vocab_category.add(word)
    return vocab_category
        
voc_music = vocab_maker(music)
voc_pol = vocab_maker(politics)
total_vocabulary = voc_music.union(voc_pol)


In [160]:
voc_all = voc_music.union(voc_pol)

In [161]:
total_vocab_count = len(voc_all)
total_music_count = len(vocab_music)
total_politics_count = len(vocab_politics)

In [95]:
#P(politics | leaders agreed to fund the stadium)

In [179]:
def find_number_words_in_category(phrase,category):
    statement = phrase.split()
    str_category=' '.join(category)
    cat_word_list = str_category.split()
    word_count = defaultdict(int)
    for word in statement:
        for art_word in cat_word_list:
            if word == art_word:
                word_count[word] +=1
            else:
                word_count[word]
    return word_count
                
            

In [200]:
test_music_word_count = find_number_words_in_category(test_statement,music)


In [201]:
test_music_word_count

defaultdict(int,
            {'world': 0,
             'leaders': 1,
             'agreed': 0,
             'to': 0,
             'fund': 0,
             'the': 1,
             'stadium': 1})

In [182]:
test_politic_word_count = find_number_words_in_category(test_statement,politics)

In [183]:
test_politic_word_count

defaultdict(int,
            {'world': 1,
             'leaders': 1,
             'agreed': 1,
             'to': 0,
             'fund': 0,
             'the': 2,
             'stadium': 0})

In [194]:
def find_likelihood(category_count,test_category_count,alpha):
    num = np.product(np.array(list(test_category_count.values())) + alpha)
    denom = (category_count + total_word_count*alpha)**(len(test_category_count))
    
    return num/denom

In [198]:
likelihood_m = find_likelihood(total_music_count,test_music_word_count,1)

In [197]:
likelihood_p = find_likelihood(total_politics_count,test_politic_word_count,1)

### $ P(politics | article) = P(politics) x \prod_{i=1}^{d} P(word_{i} | politics) $

In [208]:
likelihood_p * 0.5  > likelihood_m * 0.5

True

Many times, the probabilities we end up are exceedingly small, so we can transform them using logs to save on computation speed

### $ log(P(politics | article)) = log(P(politics)) + \sum_{i=1}^{d}log( P(word_{i} | politics)) $







### Different Types of Naive Bayes Classifiers

Multinomial Naive Bayes Classifier: this is the example we just did! It is essentially a collection of a Bernoulli Naive Bayes Classifier. This classifer cannot handle negative values!



The Bernoulli Naive Bayes Classifier: used when your features are binary (0 or 1). In the context of a text based classification task, this would be whether or not a word appears in a document at all and calculating the probability of it occuring.

<img src = "./resources/bernoulli_nb_formula.svg">


There is also the Gaussian Naive Bayes Classifier, which assumes that the features that you are predicting based off of are normally distributed.


## General machine learning information 


Whenever we are fitting data to a machine learning model, we need to put the data into a training and testing set to prevent overfitting.


<img src="./resources/train_test.png">

In [203]:
from sklearn.model_selection import train_test_split
# X_train,X_test, y_train, y_test = train_test_split(data)


### Using Naive Bayes in sklearn

In [204]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

model = GaussianNB()
model.fit(X_train,y_train)
model.predict(X_test)

In [None]:
y_train = [1,0,0,0,1,1,1,1,1]