# 11/11 Notebook - Naive Bayes Classification

Hi everyone! In this notebook we'll be taking a look at another interdisciplinary aspect of NLP, the `Niave Bayes Model`, which combines NLP and Statistics to create fairly intelligent models. We'll also get some experience using some external APIs, specifically `Tweepy`, to use in our own data collection

**Fun Fact: When I was making this notebook, I tried getting Twitter data from Donald Trump. However, I was unable to because his Twitter is so ridiculous**

Objectives:
- Gain experience with external `APIs` in Python
- Learn the fundamentals of the `Naive Bayes` model
- Classify and predict tweets between two people

To finish this notebook, you'll have to complete the following methods:
1. `build_dict()`
2. `get_bayes_constants()`
3. `calc_word_prob()`
4. `calc_likelihood()`
5. `build_likelihood_dict()`

## Part 1: Accessing the Twitter Data

In this example, we'll be looking at the Twitter data from `Kanye West` and `Joe Biden`, to see if we can find any significant and recurrent differences between their uses of language

To accomplish this, we need to install `Tweepy`, a library that makes it very simple to access the `Twitter API`

In [1]:
pip install tweepy -q

Note: you may need to restart the kernel to use updated packages.


Now we'll read in some data that we need to access the API

In [2]:
# import library
import json
# open file
api_file = open("api_data.json")
api_json = json.load(api_file)
# read data
api_key = api_json["api_key"]
api_secret = api_json["api_secret"]

The credentials are loaded in, so we can activate the API

First let's load in our needed `Tweepy` libaries

In [3]:
import tweepy
from tweepy import OAuthHandler
from tweepy import API
from tweepy import Cursor
import datetime

As well as some additional libraries

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk

Now we can create an API object to access tweets in real time

In [5]:
auth = tweepy.AppAuthHandler(api_key, api_secret)
api = tweepy.API(auth)

We're going to be comparing the tweets from `Joe Biden` and `Kanye West`, so let's collect a sample of their most recent tweets

We'll define a helper function to help us accomplish this

In [6]:
def get_tweets(handles, num_tweets):
    
    # initialize the dictionary
    tweet_dict = dict()
    
    # iterate through each twitter handle
    for handle in handles:
        # get the tweets
        tweets = tweepy.Cursor(api.user_timeline, screen_name = handle, include_rts = False).items(num_tweets)
        # iterate through each tweet and add it to the dictionary
        for tweet in tweets:
            tweet_dict[tweet.id] = [tweet.text, handle]
    
    # create a pandas dataframe
    return pd.DataFrame.from_dict(tweet_dict, orient = "index", columns = ["tweet", "handle"])

Now we can run the cell below with our helper function to get all of the tweets

*Note: The next cell might take a couple of minutes to finish running*

In [7]:
tweet_data = get_tweets(["kanyewest", "joebiden"], 5000)

Now we'll split the data by candidate, and into testing and training data

In [8]:
# split the kanye tweets
kanye_tweets = tweet_data[tweet_data["handle"] == "kanyewest"].to_numpy()
np.random.shuffle(kanye_tweets)
kanye_train = kanye_tweets[0:int(0.8 * len(kanye_tweets))]
kanye_test = kanye_tweets[int(0.8 * len(kanye_tweets)):]

# split the biden tweets
biden_tweets = tweet_data[tweet_data["handle"] == "joebiden"].to_numpy()
np.random.shuffle(biden_tweets)
biden_train = biden_tweets[0:int(0.8 * len(biden_tweets))]
biden_test = biden_tweets[int(0.8 * len(biden_tweets)):]

# combine our data
train_data = np.concatenate((kanye_train, biden_train))
test_data = np.concatenate((kanye_test, biden_test))

# one more shuffle for good measure
np.random.shuffle(train_data)
np.random.shuffle(test_data)

Finally, we need to do one more split: into inputs and outputs (x and y)

In [9]:
# split training data
train_data_x = train_data[:, 0]
train_data_y = train_data[:, 1]

# split testing data
test_data_x = test_data[:, 0]
test_data_y = test_data[:, 1]

## Part 2: Setting Up the Naive Bayes Metrics

The first thing we need to do is create a dictionary to store the frequency of words for each twitter handle

We're going to need some additional libraries from `nltk` to aid us with this

In [10]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import re
import string

We've also provided a helper function to process the tweet into different, cleaned words

In [11]:
def process_tweet(tweet):
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
            word not in string.punctuation and
            word.isalpha()):  # remove punctuation
            tweets_clean.append(word)

    return tweets_clean

Recall our schema for the frequency dictionary:

`{ "word" : [kanye_count, biden_count]}`

Complete the function `build_dict()` below, which does the following:
1. Sets the values of `tweet` and `handle`, the 0th and 1st elements of `entry`, respectively
2. Sets the value of `is_biden`, which checks if the `handle` is Joe Biden's handle
3. Gets `counts` from the dictionary, `[0, 0]` if it doesn't exist
4. Increments the appropriate index of `counts`
5. Update the `freqs` dictionary

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li><code>tweet</code> is the 0th element of <code>entry</code> and <code>handle</code> is the 1st element of <code>entry</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 2</b></font>
</summary>
<p>
<ul>
    <li><code>is_biden</code> should be true when <code>handle</code> equals <code>"joebiden"</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 3</b></font>
</summary>
<p>
<ul>
    <li><code>counts</code> can be found with <code>freqs.get()</code>, with parameters <code>word</code> and <code>[0, 0]</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 4</b></font>
</summary>
<p>
<ul>
    <li>Increment this value by 1</li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 5</b></font>
</summary>
<p>
<ul>
    <li>Set <code>freqs[word]</code> equal to <code>counts</code></li>
</ul>
</p>

In [12]:
def build_dict(tweets):
    # declare a dictionary
    freqs = dict()
    
    # iterate through each token in each tweet
    for entry in tweets:
        
        # grab the tweet and handle
        tweet = entry[0]
        handle = entry[1]
        
        for word in process_tweet(tweet):
            # check if the tweet is a biden tweet
            is_biden = handle == "joebiden"
            
            # get the tweet from the dictionary, [0, 0] if it doesn't exist
            counts = freqs.get(word, [0, 0])
            # increment the count
            counts[is_biden] += 1
            
            # update the dictionary
            freqs[word] = counts
    
    # return the dictionary
    return freqs

Let's store this dictionary in `word_freqs`

In [13]:
word_freqs = build_dict(train_data)

In [14]:
# run this cell to test your code 
# (by the time you run this cell the tweet counts might have changed, but I tried to make it failry general)
if (word_freqs["kanye"][0] >= 10 and word_freqs["trump"][1] >= 144):
    print("Lookin' good")
else:
    print("Looking bad bad oh no so bad")

Lookin' good


To calculate the probability a given word is a "Kanye Word" or "Biden Word", we need normally need the following formulas:

<img src="https://latex.codecogs.com/gif.latex?\dpi{200}&space;P(word_{kanye})&space;=&space;\frac{freq_{word_{kanye}}}{count_{kanye}}" title="P(word_{kanye}) = \frac{freq_{word_{kanye}}}{count_{kanye}}" />

<img src="https://latex.codecogs.com/gif.latex?\dpi{200}&space;P(word_{biden})&space;=&space;\frac{freq_{word_{biden}}}{count_{biden}}" title="P(word_{biden}) = \frac{freq_{word_{biden}}}{count_{biden}}" />

To make our model more robust, we'll add a `smoothing` term, changing our formulas to:

<img src="https://latex.codecogs.com/gif.latex?\dpi{200}&space;P(word_{kanye})&space;=&space;\frac{freq_{word_{kanye}}&space;&plus;&space;1}{count_{kanye}&space;&plus;&space;count_{unique}}" title="P(word_{kanye}) = \frac{freq_{word_{kanye}} + 1}{count_{kanye} + count_{unique}}" />

<img src="https://latex.codecogs.com/gif.latex?\dpi{200}&space;P(word_{biden})&space;=&space;\frac{freq_{word_{biden}}&space;&plus;&space;1}{count_{biden}&space;&plus;&space;count_{unique}}" title="P(word_{biden}) = \frac{freq_{word_{biden}} + 1}{count_{biden} + count_{unique}}" />

Complete the function `get_bayes_constants()`, which does the following:
1. Finds `num_unique`, the number of unique words in `freqs`
2. Sets `word` and `counts` using `item`, representing each entry in `freqs`
3. Increments `num_kanye_words` and `num_biden_words` if the word has a `count` greater than 0

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li><code>num_unique</code> would be the length of the dictionary!</li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 2</b></font>
</summary>
<p>
<ul>
    <li><code>word</code> is the 0th element of <code>item</code> and <code>counts</code> is the 1st element of <code>item</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 3</b></font>
</summary>
<p>
<ul>
    <li>You can increment <code>num_kanye_words</code> by <code>counts[0] > 0 </code></li>
    <li>You can increment <code>num_biden_words</code> by <code>counts[1] > 0 </code></li>
</ul>
</p>

In [15]:
def get_bayes_constants(freqs):
    # finds the number of unique words in the dictionary
    num_unique = len(freqs)

    # initializes variables
    num_kanye_words = 0
    num_biden_words = 0
    
    # itereates through dictionary
    for item in freqs.items():
        
        # gets the word and count
        word = item[0]
        counts = item[1]
        
        # increments words when appropriate
        num_kanye_words += (counts[0] > 0)
        num_biden_words += (counts[1] > 0)
        
    # returns values
    return (num_unique, num_kanye_words, num_biden_words)

In [16]:
# run this cell to test
# (by the time you run this cell the tweet counts might have changed, but I tried to make it failry general)
num_unique, num_kanye_words, num_biden_words = get_bayes_constants(word_freqs)
if (num_unique > 3000 and num_kanye_words > 1500 and num_biden_words > 2000):
    print("You're an actual lexicographic warlock")
else:
    print("Better go back to wizarding school :(")

You're an actual lexicographic warlock


Now we have the values for $num_{unique}$, $count_{kanye}$, and $count_{biden}$, so we can calculate $P(word_{kanye})$ and $P(word_{biden})$ using the formula above

The formulas have been repasted below for your convenience

<img src="https://latex.codecogs.com/gif.latex?\dpi{200}&space;P(word_{kanye})&space;=&space;\frac{freq_{word_{kanye}}&space;&plus;&space;1}{count_{kanye}&space;&plus;&space;count_{unique}}" title="P(word_{kanye}) = \frac{freq_{word_{kanye}} + 1}{count_{kanye} + count_{unique}}" />

<img src="https://latex.codecogs.com/gif.latex?\dpi{200}&space;P(word_{biden})&space;=&space;\frac{freq_{word_{biden}}&space;&plus;&space;1}{count_{biden}&space;&plus;&space;count_{unique}}" title="P(word_{biden}) = \frac{freq_{word_{biden}} + 1}{count_{biden} + count_{unique}}" />

Complete the function `calc_word_prob()`, which does the following:
1. Finds the value of `count` using `freqs`, returning `[0, 0]` if the word is non existant
2. Calculates `word_kanye_numerator` and `word_kanye_denominator` using the formula above
3. Calculates `word_biden_numerator` and `word_biden_denominator` using the formula above
4. Divides the values in `2` and `3` appropriately to calculate `p_word_kanye` and `p_word_biden`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li>You can use the dictionary's <code>get()</code> function with <code>word</code> and <code>[0, 0]</code> as parameters</li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 2</b></font>
</summary>
<p>
<ul>
    <li><code>word_kanye_numerator = (count[0] + 1)</code></li>
    <li><code>word_kanye_denominator = num_kanye_words + num_unique</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 3</b></font>
</summary>
<p>
<ul>
    <li><code>word_biden_numerator = (count[1] + 1)</code></li>
    <li><code>word_biden_denominator = num_biden_words + num_unique</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 4</b></font>
</summary>
<p>
<ul>
    <li><code>p_word_kanye = word_kanye_numerator / word_kanye_denominator</code></li>
    <li><code>word_biden_numerator / word_biden_denominator</code></li>
</ul>
</p>

In [17]:
def calc_word_prob(word, freqs):
    # get our needed constants
    num_unique, num_kanye_words, num_biden_words = get_bayes_constants(freqs)
    
    # gets the count in the dictionary
    count = freqs.get(word, [0, 0])
    
    # calculates the kanye numerator and denominator
    word_kanye_numerator = (count[0] + 1)
    word_kanye_denominator = num_kanye_words + num_unique
    
    # calculates the biden numerator and denominator
    word_biden_numerator = (count[1] + 1)
    word_biden_denominator = num_biden_words + num_unique
    
    # calculates the probabilities
    p_word_kanye = word_kanye_numerator / word_kanye_denominator
    p_word_biden = word_biden_numerator / word_biden_denominator
    
    # returns the probabilities
    return (p_word_kanye, p_word_biden)

In [18]:
# run this cell to test your code
# (by the time you run this cell the tweet counts might have changed, but I tried to make it failry general)
trump_prob = calc_word_prob("trump", word_freqs)
kanye_prob = calc_word_prob("kanye", word_freqs)
jesus_prob = calc_word_prob("kanye", word_freqs)

if (trump_prob[1] > trump_prob[0] and kanye_prob[0] > kanye_prob[1] and jesus_prob[0] > jesus_prob[1]):
    print("That's some spunky code you got there! (good job)")
else:
    print("Probabili-deez! iykyk (try again)")

That's some spunky code you got there! (good job)


There's one more metric we need before we can start training the model, and that's `likelihood`. This value is essentially a ratio between how likely a given word is to have come from `Kanye` or `Biden`. We take the `natural log` of this function because it is strictly increasing, so it doesn't change where the maximum occurs

Theory aside, the formula for `likelihood` is:

<img src="https://latex.codecogs.com/gif.latex?\dpi{300}&space;likelihood&space;=&space;log(\frac{P(word_{kanye})}{P(word_{biden})})" title="likelihood = log(\frac{P(word_{kanye})}{P(word_{biden})})" />

Complete the function `calc_likelihood()`, which does the following:

1. Stores the probabilties in `probs` by using `calc_word_prob()` 
2. Uses array indexing to get the values for `kanye_prob` and `biden_prob`
3. Calculates `likelihood` using the above formula

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li>Set <code>prob</code> using the helper function with parameters <code>word</code> and <code>freqs</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 2</b></font>
</summary>
<p>
<ul>
    <li><code>kanye_prob</code> is the 0th element of <code>probs</code>, while <code>biden_prob</code> is the first</li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints for Step 3</b></font>
</summary>
<p>
<ul>
    <li>Use <code>np.log()</code> to take the natural log</li>
    <li>Set <code>likelihood</code> equal to the natural log of <code>kanye_prob</code> divide by <code>biden_prob</code>
</ul>
</p>

In [19]:
def calc_likelihood(word, freqs):
    # uses the helper function to get the probabilities
    probs = calc_word_prob(word, freqs)
    
    # indexes the kanye and biden probabilities
    kanye_prob = probs[0]
    biden_prob = probs[1]
    
    # calculate logprior
    likelihood = np.log(kanye_prob / biden_prob)
    
    # return the value
    return likelihood

In [20]:
# run this cell to test your code
# (by the time you run this cell the tweet counts might have changed, but I tried to make it failry general)
if calc_likelihood("god", word_freqs) > 0 and calc_likelihood("president", word_freqs) < 0:
    print("Nice job buster!")
else:
    print("Back to the lab again, bustee")

Nice job buster!


## Part 3: "Training" the Model

Technically, we're not really training a model per say. There's no regression going on here; all we need to do is create a dictionary that associates every `word` with a `likelihood`

Complete the function `build_likelihood_dict()`, which does the following:
1. Finds the value of `word` using array indexing with `item`
2. Calulates the `likelihood` using the helper function `calc_likelihood`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li><code>word</code> is the 0th element of <code>item</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 2</b></font>
</summary>
<p>
<ul>
    <li>Set <code>likelihood</code> equal to the return value of the helper function, with <code>word</code> and <code>freqs</code> as parameters</li>
</ul>
</p>

In [21]:
def build_likelihood_dict(freqs):
    # intiailizes the dictionary
    likelihood_dict = dict()
    
    # iterates through each item in the dictionary
    for item in freqs.items():
        # grab the word
        word = item[0]
        
        # calculate the logprior and append it to the dictionary
        likelihood = calc_likelihood(word, freqs)
        likelihood_dict[word] = likelihood
    
    return likelihood_dict

In [22]:
# store the dictionary we need for testing
likelihoods = build_likelihood_dict(word_freqs)

## Part 4: Testing the Model

Now that we have the dictionary we need, we can test our model with our testing data we declared in the beginning. To get the prediction of a new tweet, we'll sum the likelihood of each word in the tweet. Mathematically:


<img src="https://latex.codecogs.com/gif.latex?\dpi{300}&space;p&space;=&space;\sum&space;likelihood(word)" title="p = \sum likelihood(word)" />

**Note: The way our model was set up, $p < 0$ corresponds to Biden, while $p > 0$ corresponds to Kanye**

Complete the function `predict_tweet()` which does the following:
1. Increments the `likelihood_sum` using the `likelihoods_dict`
2. Inserts the appropriate conditional to decide between `Kanye` and `Biden`

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 1</b></font>
</summary>
<p>
<ul>
    <li>Increment the value by <code>likelihoods.get()</code>, with parameters <code>word</code> and <code>0</code></li>
</ul>
</p>

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hint for Step 2</b></font>
</summary>
<p>
<ul>
    <li>In our case, we want to predict <code>"Kanye"</code> if <code>likelihood_sum > 0</code></li>
</ul>
</p>

In [23]:
def predict_tweet(tweet, likelihoods):
    # intiialize the sum
    likelihood_sum = 0
    
    # iterate through each word in the cleaned tweet
    for word in process_tweet(tweet):
        # increment the sum
        likelihood_sum += likelihoods.get(word, 0)
    # get the prediction
    prediction = "Kanye" if likelihood_sum > 0 else "Biden"
    
    # return the prediction and likelihood in a tuple
    return (prediction, likelihood_sum)

In the next cell I made some very generic `Joe Biden` and `Kanye West` tweets which your model should hopefully be able to predict! 

In [24]:
biden_pred = predict_tweet("Joe Biden is my name, being the President is my game", likelihoods)
kanye_pred = predict_tweet("Kanye west jesus is my lord", likelihoods)

print(f"The model predicted the first tweet as {biden_pred[0]} with likelihood {biden_pred[1]}")
print(f"The model predicted the second tweet as {kanye_pred[0]} with likelihood {kanye_pred[1]}")

The model predicted the first tweet as Biden with likelihood -4.383901757385802
The model predicted the second tweet as Kanye with likelihood 8.5766997386725


I have created a function below that you can use to test your accuracy!

In [25]:
def test_accuracy(tweets, handles, likelihoods):
    handles[handles == "joebiden"] = "Biden"
    handles[handles == "kanyewest"] = "Kanye"
    
    predict_tweet_vectorized = np.vectorize(predict_tweet)
    predictions = predict_tweet_vectorized(tweets, likelihoods)
    
    return np.average(predictions[0] == handles)

In [26]:
train_accuracy = test_accuracy(train_data_x, train_data_y, likelihoods)
test_accuracy = test_accuracy(test_data_x, test_data_y, likelihoods)

print(f"The training accuracy is {train_accuracy}")
print(f"The testing accuracy is {test_accuracy}")

The training accuracy is 0.839572192513369
The testing accuracy is 0.7564102564102564


Feel free to try out your own tweets!

In [27]:
tweet = "Bless up"
print(f"The model thinks this was tweeted by {predict_tweet(tweet, likelihoods)[0]}")

The model thinks this was tweeted by Kanye


## Analysis

The predictions are decent, but if you recall from earlier in the semester, not as precise as the `logistic regression` model. There are a couple of reasons for the mediocre performance. Some of the Twitter data I retrieved was difficult to parse because it contained links and images, so there are probably some tweets I did not clean properly, so that's on me

Additionally, this model has some assumptions that are not always true. The main assumption this model makes is that all of the data is independent. However, with something like twitter data, it is entirely possible that the two tweets are similar because the same thing was happening at the same time when they were tweeting (like the election!)

One good thing about this model is that it is not susceptible to outliers as easily as the logistic model, so definitely keep it in mind if you want an easy and fast model to make decent predictions!