# Naive Bayes and NLP Modeling

Before returning to our Satire/No Satire example, let's consider an example with a smaller but similar scope.

Suppose we are using an API to gather articles from a news website and grabbing phrases from two different types of articles:  music and politics.

We have a problem though! Only some of our articles have their category (music or politics). Is there a way we can use Machine Learning to help us label our data quickly?

-------------------------------
### Here are our articles
#### Music Articles:

* 'the song was popular'
* 'band leaders disagreed on sound'
* 'played for a sold out arena stadium'

#### Politics Articles

* 'world leaders met lask week'
* 'the election was close'
* 'the officials agreed on a compromise'
--------------------------------------------------------
Let's try and predict one example phrase:


* "world leaders agreed to fund the stadium"

How can we make a model that labels this for us rather than having to go through by hand?

In [2]:
from collections import defaultdict
import numpy as np
music = ['the song was popular',
         'band leaders disagreed on sound',
         'played for a sold out arena stadium']

politics = ['world leaders met lask week',
            'the election was close',
            'the officials agreed on a compromise']

test_statement = 'world leaders agreed to fund the stadium'

In [3]:
#labels : 'music' 'politics'
#features: words
test_statement_2 = 'officials met at the arena'

<img src ="./resources/naive_bayes_icon.png">

### Another way of looking at it
<img src = "./resources/another_one.png">

## So, in the context of our problem......



$\large P(politics | phrase) = \frac{P(phrase|politics)P(politics)}{P(phrase)}$

$\large P(politics) = \frac{ \# politics}{\# all\ articles} $

*where phrase is our test statement*

<img src = "./resources/solving_theta.png" width="400">

### How should we calculate P(politics)?

This is essentially the distribution of the probability of either type of article. We have three of each type of article, therefore, we assume that there is an equal probability of either article

In [None]:
p_politics = len(politics)/(len(politics) + len(music))
p_music = len(music)/(len(politics) + len(music))

### How do you think we should calculate: $ P(phrase | politics) $ ?

In [123]:
# we need to break the phrases down into individual words


 $\large P(phrase | politics) = \prod_{i=1}^{d} P(word_{i} | politics) $

In [127]:
### We need to make a *Naive* assumption.

 $\large P(word_{i} | politics) = \frac{\#\ of\ word_{i}\ in\ politics\ art.} {\#\ of\ total\ words\ in\ politics\ art.} $

### Can you foresee any issues with this?

In [125]:
# we can't have a probability of 0


## Laplace Smoothing
 $\large P(word_{i} | politics) = \frac{\#\ of\ word_{i}\ in\ politics\ art. + \alpha} {\#\ of\ total\ words\ in\ politics\ art. + \alpha d} $

 $\large P(word_{i} | music) = \frac{\#\ of\ word_{i}\ in\ music\ art. + \alpha} {\#\ of\ total\ words\ in\ music\ art. + \alpha d} $

This correction process is called Laplace smoothing:
* d : number of features (in this instance total number of vocabulary words)
* $\alpha$ can be any number greater than 0 (it is usually 1)


#### Now let's find this calculation

<img src="./resources/IMG_0041.jpg">

p(phrase|politics)

In [128]:
def vocab_maker(category):
    """returns the vocabulary for a given type of article"""
    vocab_category = set()
    for art in category:
        words = art.split()
        for word in words:
            vocab_category.add(word)
    return vocab_category
        
voc_music = vocab_maker(music)
voc_pol = vocab_maker(politics)
# total_vocabulary = voc_music.union(voc_pol)


In [129]:
voc_music

{'a',
 'arena',
 'band',
 'disagreed',
 'for',
 'leaders',
 'on',
 'out',
 'played',
 'popular',
 'sold',
 'song',
 'sound',
 'stadium',
 'the',
 'was'}

In [130]:
voc_music

{'a',
 'arena',
 'band',
 'disagreed',
 'for',
 'leaders',
 'on',
 'out',
 'played',
 'popular',
 'sold',
 'song',
 'sound',
 'stadium',
 'the',
 'was'}

In [131]:
voc_all = voc_music.union(voc_pol)

In [132]:
total_vocab_count = len(voc_all)
total_music_count = len(voc_music)
total_politics_count = len(voc_pol)

In [133]:
#P(politics | leaders agreed to fund the stadium)

In [134]:
def find_number_words_in_category(phrase,category):
    statement = phrase.split()
    str_category=' '.join(category)
    cat_word_list = str_category.split()
    word_count = defaultdict(int)
    for word in statement:
        for art_word in cat_word_list:
            if word == art_word:
                word_count[word] +=1
            else:
                word_count[word]
    return word_count
                
            

In [135]:
test_music_word_count = find_number_words_in_category(test_statement,music)


In [136]:
test_music_word_count

defaultdict(int,
            {'world': 0,
             'leaders': 1,
             'agreed': 0,
             'to': 0,
             'fund': 0,
             'the': 1,
             'stadium': 1})

In [137]:
test_politic_word_count = find_number_words_in_category(test_statement,politics)

In [138]:
test_politic_word_count

defaultdict(int,
            {'world': 1,
             'leaders': 1,
             'agreed': 1,
             'to': 0,
             'fund': 0,
             'the': 2,
             'stadium': 0})

In [139]:
def find_likelihood(category_count,test_category_count,alpha):
    num = np.product(np.array(list(test_category_count.values())) + alpha)
    denom = (category_count + total_vocab_count*alpha)**(len(test_category_count))
    
    return num/denom

In [140]:
likelihood_m = find_likelihood(total_music_count,test_music_word_count,1)

In [141]:
likelihood_p = find_likelihood(total_politics_count,test_politic_word_count,1)

In [142]:
print(likelihood_m)
print(likelihood_p)

4.107740405680756e-11
1.748875897714495e-10


 $ P(politics | article) = P(politics) x \prod_{i=1}^{d} P(word_{i} | politics) $

#### Deteriming the winner of our model:

<img src = "./resources/solvingforyhat.png" width= "400">

In [143]:
p_politics = .5
p_music = .5

In [144]:
# p(politics|article)  > p(music|article)
likelihood_p * p_politics  > likelihood_m * p_music

True

Many times, the probabilities we end up are exceedingly small, so we can transform them using logs to save on computation speed

$\large log(P(politics | article)) = log(P(politics)) + \sum_{i=1}^{d}log( P(word_{i} | politics)) $





Good Resource: https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.htmlm

# Back to Satire

In [150]:
import pandas as pd
import numpy as np
corpus = pd.read_csv('data/satire_nosatire.csv')
corpus.head()

Unnamed: 0,body,target
0,Noting that the resignation of James Mattis as...,1
1,Desperate to unwind after months of nonstop wo...,1
2,"Nearly halfway through his presidential term, ...",1
3,Attempting to make amends for gross abuses of ...,1
4,Decrying the Senate’s resolution blaming the c...,1


Like always, we will perform a train test split...

In [151]:
X=corpus.body
y=corpus.target

In [152]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, test_size=.25)

...and preprocess the training set like we learned.

In [153]:
import nltk
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords

In [154]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
token_docs = [regexp_tokenize(doc, pattern) for doc in X_train]
sw = stopwords.words('english')
sw.extend(['would', 'one', 'say'])

In [155]:
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer 
  

def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [156]:
def doc_preparer(doc, stop_words=sw):
    '''
    
    :param doc: a document from the satire corpus 
    :return: a document string with words which have been 
            lemmatized, 
            parsed for stopwords, 
            made lowercase,
            and stripped of punctuation and numbers.
    '''
    
    regex_token = RegexpTokenizer(r"([a-zA-Z]+(?:’[a-z]+)?)")
    doc = regex_token.tokenize(doc)
    doc = [word.lower() for word in doc]
    doc = [word for word in doc if word not in stop_words]
    doc = pos_tag(doc)
    doc = [(word[0], get_wordnet_pos(word[1])) for word in doc]
    lemmatizer = WordNetLemmatizer() 
    doc = [lemmatizer.lemmatize(word[0], word[1]) for word in doc]
    return ' '.join(doc)

In [195]:
token_docs = [doc_preparer(doc, sw) for doc in X_train]

In [196]:
from sklearn.feature_extraction.text import CountVectorizer

For demonstration purposes, we will limit our count vectorizer to 5 words (the top 5 words by frequency).

In [197]:
# Secondary train-test split to build our best model
X_t, X_val, y_t, y_val = train_test_split(token_docs, y_train, test_size=.25, random_state=42)

In [198]:
cv = CountVectorizer(max_features=5)
X_t_vec = cv.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)

In [199]:
X_t_vec

Unnamed: 0,people,say,state,trump,year
159,3,0,0,0,0
246,1,0,0,7,1
640,0,4,1,0,4
809,2,10,2,0,7
130,0,0,0,0,0
...,...,...,...,...,...
148,1,0,0,0,1
300,0,1,0,0,0
356,1,3,0,0,0
36,1,4,0,0,3


In [200]:
X_val_vec = cv.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)


# Knowledge Check

The word say shows up in our count vectorizer, but it is excluded in the stopwords.  What is going on?

# Multinomial Naive Bayes

Let's break down MNB with our X_t_vec, and y_t arrays in mind.

What are the priors for each class as calculated from these arrays?

In [202]:
np.log(prior_0)

-0.665075161781259

Let's train our model.

In [203]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
mnb.__dict__

{'alpha': 1.0,
 'fit_prior': True,
 'class_prior': None,
 'n_features_': 5,
 'classes_': array([0, 1]),
 'class_count_': array([273., 289.]),
 'feature_count_': array([[ 211., 1419.,  371.,  283.,  348.],
        [ 385.,  241.,  111.,  152.,  264.]]),
 'feature_log_prob_': array([[-2.52081091, -0.61898504, -1.95850333, -2.22842295, -2.02232526],
        [-1.09861229, -1.56551193, -2.33595079, -2.02401174, -1.47471983]]),
 'class_log_prior_': array([-0.72203005, -0.66507516])}

In [204]:
# https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

In [205]:
random_sample = X_val_vec.sample(1, random_state=40)
random_sample.head()

Unnamed: 0,people,say,state,trump,year
520,1,7,6,0,1


Our Likelihoods would look like so:

$$ \Large P(satire|count\_people, count\_say...count\_year)$$

$$ \Large P(not\_satire|count\_people, count\_go...count\_year)$$

In [206]:
likelihood_nosat = mnb.feature_log_prob_[0]*random_sample
likelihood_sat =  mnb.feature_log_prob_[1]*random_sample
likelihood_nosat = likelihood_nosat.agg(sum, axis=1)
likelihood_sat = likelihood_sat.agg(sum, axis=1)

print(likelihood_nosat, likelihood_sat)

520   -20.627051
dtype: float64 520   -27.54762
dtype: float64


In [207]:
likelihood_nosat + mnb.class_log_prior_[0]

520   -21.349081
dtype: float64

In [208]:
likelihood_sat + mnb.class_log_prior_[1]

520   -28.212696
dtype: float64

In [209]:
mnb.predict(random_sample)

array([0])

In [210]:
y_val.loc[random_sample.index]

520    0
Name: target, dtype: int64

In [211]:

y_hat = mnb.predict(X_val_vec)
y_hat

array([0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0])

In [212]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

accuracy_score(y_val, y_hat)

0.8297872340425532

In [213]:
f1_score(y_val, y_hat)

0.8202247191011236

In [214]:
confusion_matrix(y_val, y_hat)

array([[83, 16],
       [16, 73]])

That performs very well for only having 5 features.

Let's see what happens when we increase our feature set

In [215]:
cv = CountVectorizer()
X_t_vec = cv.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)
X_t_vec.shape

(562, 14819)

In [216]:
X_val_vec = cv.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

In [217]:
mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)
y_hat

array([1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0])

In [218]:
accuracy_score(y_val, y_hat)

0.9627659574468085

In [219]:
f1_score(y_val, y_hat)

0.96045197740113

In [220]:
confusion_matrix(y_val, y_hat)

array([[96,  3],
       [ 4, 85]])

That performs very well. 

Let's see whether or not we can maintain that level of accuracy with less words.

In [221]:
cv = CountVectorizer(min_df=.05, max_df=.95)
X_t_vec = cv.fit_transform(X_t)
X_t_vec  = pd.DataFrame.sparse.from_spmatrix(X_t_vec)
X_t_vec.columns = sorted(cv.vocabulary_)
X_t_vec.set_index(y_t.index, inplace=True)

X_val_vec = cv.transform(X_val)
X_val_vec  = pd.DataFrame.sparse.from_spmatrix(X_val_vec)
X_val_vec.columns = sorted(cv.vocabulary_)
X_val_vec.set_index(y_val.index, inplace=True)

mnb = MultinomialNB()

mnb.fit(X_t_vec, y_t)
y_hat = mnb.predict(X_val_vec)

f1_score(y_val, y_hat)

0.9378531073446328

In [222]:
X_t_vec.shape

(562, 650)

In [223]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_t_vec, y_t)
rf.score(X_val_vec, y_val)

0.9680851063829787

# Bonus NLP EDA

In [122]:
policies = pd.read_csv('data/2020_policies_feb_24.csv')
policies.head()

Unnamed: 0.1,Unnamed: 0,name,policy,candidate
0,0,100% Clean Energy for America,"As published on Medium on September 3rd, 2019:...",warren
1,1,A Comprehensive Agenda to Boost America’s Smal...,Small businesses are the heart of our economy....,warren
2,2,A Fair and Welcoming Immigration System,"As published on Medium on July 11th, 2019:\nIm...",warren
3,3,A Fair Workweek for America’s Part-Time Workers,Working families all across the country are ge...,warren
4,4,A Great Public School Education for Every Student,I attended public school growing up in Oklahom...,warren


# Question set 1:
After remove punctuation and ridding the text of numbers and other low semantic value text, answer the following questions.

1. Which document has the greatest average word length?
2. What is the average word length of the entire corpus?
3. Which is greater, the average word length for the documents in the Warren or Sanders campaigns? 


Proceed through the remaining standard preprocessing steps in whatever manner you see fit. Make sure to:
- Make text lowercase
- Remove stopwords
- Stem or lemmatize

# Question set 2:
1. What are the most common words across the corpus?
2. What are the most common words across each campaign?

> in order to answer these questions, you may find the nltk FreqDist function helpful.

3. Use the FreqDist plot method to make a frequency plot for the corpus as a whole.  
4. Based on that plot, should any more words be added to our stopword library?


# Question set 3:

1. What are the most common bigrams in the corpus?
2. What are the most common bigrams in the Warren campain and the Sanders campaign, respectively?
3. Answer questions 1 and 2 for trigrams.

> Hint: You may find it useful to leverage the nltk.collocations functions

After answering the questions, transform the data into a document term matrix using either CountVectorizor or TFIDF.  

Run a Multinomial Naive Bayes classifier and judge how accurately our models can separate documents from the two campaigns.