# Sentiment Analysis: Naive Bayes

We will be implementing naive bayes for sentiment analysis on tweets. Given a tweet, we will decide if it has a positive sentiment or a negative one.


Let's first download the necessary datasets.
- ``twitter_samples``: Check out the documentation for the [``twitter_samples`` dataset](http://www.nltk.org/howto/twitter.html).
- ``stopwords``

Uncomment the next cell if you have not downloaded these datasets.


In [1]:
# import nltk
# nltk.download('twitter_samples')
# nltk.download('stopwords')

In [2]:
import re
import string

import numpy as np

from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

## Prepare the data

The ``twitter_samples`` contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets.
- If we used all three datasets, we would introduce duplicates of the positive tweets and negative tweets.
- We will select just the five thousand positive tweets and five thousand negative tweets.


In [3]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

Train test split: 20% will be in the test set, and 80% in the training set.


In [4]:
# split the data into two pieces, one for training and one for testing (validation set) 
train_pos = all_positive_tweets[:4000]
test_pos = all_positive_tweets[4000:]
train_neg = all_negative_tweets[:4000]
test_neg = all_negative_tweets[4000:]

train_x = train_pos + train_neg 
test_x = test_pos + test_neg

Create the numpy array of positive labels and negative labels.


In [5]:
# combine positive and negative labels
train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

# Print the shape train and test sets
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))

train_y.shape = (8000,)
test_y.shape = (2000,)


## Preprocessing

Data preprocessing is one of the critical steps in any machine learning project. It includes cleaning and formatting the data before feeding into a machine learning algorithm. For NLP, the preprocessing steps are comprised of the following tasks:
- Tokenizing the string
- Lowercasing
- Removing stop words and punctuation
- Stemming

Since we have a Twitter dataset, we'd like to remove some substrings commonly used on the platform like the hashtag, retweet marks, and hyperlinks.


In [6]:
def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet
    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and     # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)        # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [7]:
# test the function below
print('\033[0mAn example of a positive tweet: \n\033[34m', train_x[0])
print('\033[0m\nAn example of the processed version of the tweet: \n\033[32m', process_tweet(train_x[0]))
print('\033[0m')

[0mAn example of a positive tweet: 
[34m #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
[0m
An example of the processed version of the tweet: 
[32m ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
[0m


Create the frequency dictionary ``freqs``. The key is the tuple ``(word, label)``, such as ``("happy",1)`` or ``("happy",0)``. The value stored for each key is the count of how many times the word "happy" was associated with a positive label, or how many times "happy" was associated with a negative label.


In [8]:
def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its frequency
    """
    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1

    return freqs

In [9]:
# create frequency dictionary
freqs = build_freqs(train_x, train_y)

# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

type(freqs) = <class 'dict'>
len(freqs) = 11340


# Training Naive Bayes model

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

#### How to train a Naive Bayes classifier?

- The first part of training a naive bayes classifier is to identify the number of classes that we have.
- We will create a probability for each class.
$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.
We use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.

#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative. In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$$

#### Positive and Negative Probability of a Word

To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Notice that we add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing.

#### Log likelihood

To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$


In [10]:
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikelihood = {}
    logprior = 0

    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)

    # calculate N_pos and N_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1] > 0:
            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += freqs[pair]
        # else, the label is negative
        else:
            # increment the number of negative words by the count for this (word, label) pair
            N_neg += freqs[pair]

    D = len(train_y)        # the number of documents
    D_pos = np.sum(train_y) # the number of positive documents
    D_neg = D - D_pos       # the number of negative documents

    # logprior
    logprior = np.log(D_pos) - np.log(D_neg)

    # For each word in the vocabulary...
    for word in vocab:
        # the positive and negative frequency of the word
        freq_pos = freqs.get((word, 1), 0)
        freq_neg = freqs.get((word, 0), 0)

        # the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1.) / (N_pos + V)
        p_w_neg = (freq_neg + 1.) / (N_neg + V)

        # the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)

    return logprior, loglikelihood

In [11]:
# train the model
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
# the `logprior` is expected to be `zero` since the number of positive and negative tweets are equal
print(logprior)
print(len(loglikelihood))

0.0
9085


## Testing Naive Bayes model

Now that we have the `logprior` and `loglikelihood`, we can test the naive bayes function by making predicting on some tweets!

Let's implement the `naive_bayes_predict` function to make predictions on tweets.
- The function takes in the `tweet`, `logprior`, `loglikelihood`.
- It returns the probability that the tweet belongs to the positive or negative class.
- For each tweet, sum up loglikelihoods of each word in the tweet.
- Also add the logprior to this sum to get the predicted sentiment of that tweet.

$$ p = logprior + \sum_i^N (loglikelihood_i)$$

**Note** we calculate the prior from the training data, and that the training data is evenly split between positive and negative labels (4000 positive and 4000 negative tweets).  This means that the ratio of positive to negative 1, and the logprior is 0.

The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood.  However, remember to include the logprior, because whenever the data is not perfectly balanced, the logprior will be a non-zero value.


In [12]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet 
           (if found in the dictionary) + logprior (a number)

    '''
    # process the tweet to get a list of words
    word_l = process_tweet(tweet)

    # initialize probability to the logprior
    p = logprior

    for word in word_l:
        # check if the word exists in the loglikelihood dictionary
        if word in loglikelihood:
            # add the log likelihood of that word to the probability
            p += loglikelihood[word]

    return p

Now, we write a function, ``naive_bayes_evaluate``, that given 
the test data and the ``logprior`` and the ``loglikelihood``, 
it calculates the accuracy of our naive bayes model.


In [13]:
def naive_bayes_evaluate(test_x, test_y, logprior, loglikelihood):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    accuracy = 0  # return this properly
    y_hats = []   # the list for storing predictions
    
    for tweet in test_x:
        # if the prediction is > 0
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            # the predicted class is 1
            y_hat_i = 1
        else:
            # otherwise the predicted class is 0
            y_hat_i = 0

        # append the predicted class to the list y_hats
        y_hats.append(y_hat_i)

    # error is the average of the absolute values of the differences between y_hats and test_y
    error = np.sum(np.abs(y_hats - test_y)) / float(len(test_y))
    # Accuracy is 1 minus the error
    accuracy = 1 - error

    return accuracy

In [14]:
accuracy = naive_bayes_evaluate(test_x, test_y, logprior, loglikelihood)
print(f"Naive Bayes model's accuracy = {accuracy:.4f}")

Naive Bayes model's accuracy = 0.9940


## Error Analysis

In this part we will see some tweets that our model misclassified. Let's see what kind of tweets does our model misclassify?


In [15]:
# Some error analysis
print('Truth Predicted Tweet\n')
for x, y in zip(test_x, test_y):
    y_hat = naive_bayes_predict(x, logprior, loglikelihood)
    
    if y != (np.sign(y_hat) > 0):
        # print('\033[0mTHE TWEET IS: \033[34m', x)
        # print('\033[0mTHE PROCESSED TWEET IS: \033[32m', process_tweet(x))
        print('\033[31m%d\t%0.2f\t\033[32m%s' % (y, y_hat, ' '.join(
            process_tweet(x)).encode('ascii', 'ignore')))

Truth Predicted Tweet

[31m1	0.00	[32mb''
[31m1	-1.50	[32mb'truli later move know queen bee upward bound movingonup'
[31m1	-0.89	[32mb'new report talk burn calori cold work harder warm feel better weather :p'
[31m1	-0.42	[32mb'harri niall 94 harri born ik stupid wanna chang :D'
[31m1	0.00	[32mb''
[31m1	0.00	[32mb''
[31m1	-0.94	[32mb'park get sunlight'
[31m1	-0.40	[32mb'uff itna miss karhi thi ap :p'
[31m0	0.74	[32mb'hello info possibl interest jonatha close join beti :( great'
[31m0	1.58	[32mb'u prob fun david'
[31m0	1.39	[32mb'pat jay'
[31m0	0.02	[32mb'whatev stil l young >:-('


## Filter words by Ratio of positive to negative counts

Some words have more positive counts than others, and can be considered "more positive".  Likewise, some words can be considered more negative than others.

One way for us to define the level of positiveness or negativeness, without calculating the log likelihood, is to compare the positive to negative frequency of the word.
- Note that we can also use the log likelihood calculations to compare relative positivity or negativity of words.

We can calculate the ratio of positive to negative frequencies of a word.
- Once we're able to calculate these ratios, we can also filter a subset of words that have a minimum ratio of positivity / negativity or higher.
- Similarly, we can also filter a subset of words that have a maximum ratio of positivity / negativity or lower (words that are at least as negative, or even more negative than a given threshold).

Given the `freqs` dictionary of words and a particular word, use ``freqs.get((word, 1), 0)`` to get the positive count of the word. Similarly, we. can use the ``freqs.get((word, 0), 0)`` to get the negative count of that word. Then, we can calculate the ratio of positive divided by negative counts:

$$ ratio = \frac{\text{pos_words} + 1}{\text{neg_words} + 1} $$

Where ``pos_words`` and ``neg_words`` correspond to the frequency of the words in their respective classes.


In [16]:
def get_ratio(freqs, word):
    '''
    Input:
        freqs: dictionary containing the words
        word: string to lookup

    Output: a dictionary with keys 'positive', 'negative', and 'ratio'.
            Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
    '''
    pos_neg_ratio = {'positive': 0, 'negative': 0, 'ratio': 0.0}
    
    # the positive counts for the word (denoted by the integer 1)
    pos_neg_ratio['positive'] = freqs.get((word, 1), 0)

    # the negative counts for the word (denoted by integer 0)
    pos_neg_ratio['negative'] = freqs.get((word, 0), 0)

    # the ratio of positive to negative counts for the word
    pos_neg_ratio['ratio'] = (1.0 + pos_neg_ratio['positive']) / (1.0 + pos_neg_ratio['negative'])
    
    return pos_neg_ratio

In [17]:
get_ratio(freqs, 'happi')

{'positive': 161, 'negative': 18, 'ratio': 8.526315789473685}

In [18]:
def get_words_by_threshold(freqs, label, threshold):
    '''
    Input:
        freqs: dictionary of words
        label: 1 for positive, 0 for negative
        threshold: ratio that will be used as the cutoff for including 
                   a word in the returned dictionary
    Output:
        word_set: dictionary containing the word and information on 
                  its positive count, negative count, and ratio of 
                  positive to negative counts.
                  example of a key value pair:
                  {'happi':
                      {'positive': 10, 'negative': 20, 'ratio': 0.5}
                  }
    '''
    word_list = {}

    for key in freqs.keys():
        word, _ = key
        # get the positive/negative ratio for a word
        pos_neg_ratio = get_ratio(freqs, word)
        # if the label is 1 and the ratio is greater than or equal to the threshold...
        if label == 1 and pos_neg_ratio['ratio'] >= threshold:
            # Add the pos_neg_ratio to the dictionary
            word_list[word] = pos_neg_ratio
        # If the label is 0 and the pos_neg_ratio is less than or equal to the threshold...
        elif label == 0 and pos_neg_ratio['ratio'] <= threshold:
            # Add the pos_neg_ratio to the dictionary
            word_list[word] = pos_neg_ratio
        # otherwise, do not include this word in the list (do nothing)

    return word_list

In [19]:
# negative words at or below a threshold
get_words_by_threshold(freqs, label=0, threshold=0.05)

{':(': {'positive': 1, 'negative': 3663, 'ratio': 0.0005458515283842794},
 ':-(': {'positive': 0, 'negative': 378, 'ratio': 0.002638522427440633},
 'zayniscomingbackonjuli': {'positive': 0, 'negative': 19, 'ratio': 0.05},
 '26': {'positive': 0, 'negative': 20, 'ratio': 0.047619047619047616},
 '>:(': {'positive': 0, 'negative': 43, 'ratio': 0.022727272727272728},
 'lost': {'positive': 0, 'negative': 19, 'ratio': 0.05},
 '♛': {'positive': 0, 'negative': 210, 'ratio': 0.004739336492890996},
 '》': {'positive': 0, 'negative': 210, 'ratio': 0.004739336492890996},
 'beli̇ev': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776},
 'wi̇ll': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776},
 'justi̇n': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776},
 'ｓｅｅ': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776},
 'ｍｅ': {'positive': 0, 'negative': 35, 'ratio': 0.027777777777777776}}

In [20]:
# positive words at or above a threshold
get_words_by_threshold(freqs, label=1, threshold=10)

{'followfriday': {'positive': 23, 'negative': 0, 'ratio': 24.0},
 'commun': {'positive': 27, 'negative': 1, 'ratio': 14.0},
 ':)': {'positive': 2847, 'negative': 2, 'ratio': 949.3333333333334},
 'flipkartfashionfriday': {'positive': 16, 'negative': 0, 'ratio': 17.0},
 ':D': {'positive': 498, 'negative': 0, 'ratio': 499.0},
 ':p': {'positive': 103, 'negative': 0, 'ratio': 104.0},
 'influenc': {'positive': 16, 'negative': 0, 'ratio': 17.0},
 ':-)': {'positive': 543, 'negative': 0, 'ratio': 544.0},
 "here'": {'positive': 20, 'negative': 0, 'ratio': 21.0},
 'youth': {'positive': 14, 'negative': 0, 'ratio': 15.0},
 'bam': {'positive': 44, 'negative': 0, 'ratio': 45.0},
 'warsaw': {'positive': 44, 'negative': 0, 'ratio': 45.0},
 'shout': {'positive': 11, 'negative': 0, 'ratio': 12.0},
 ';)': {'positive': 22, 'negative': 0, 'ratio': 23.0},
 'stat': {'positive': 51, 'negative': 0, 'ratio': 52.0},
 'arriv': {'positive': 57, 'negative': 4, 'ratio': 11.6},
 'via': {'positive': 60, 'negative': 1, 

Notice the difference between the positive and negative ratios. Emojis like ':(' and words like 'me' tend to have a negative connotation. Other words like 'glad', 'community', and 'arrives' tend to be found in the positive tweets.
