# Assignment 2: Naive Bayes
Welcome to week two of this specialization. You will learn about Naive Bayes. Concretely, you will be using Naive Bayes for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will: 

* Train a naive bayes model on a sentiment analysis task
* Test using your model
* Compute ratios of positive words to negative words
* Do some error analysis
* Predict on your own tweet

You may already be familiar with Naive Bayes and its justification in terms of conditional probabilities and independence.
* In this week's lectures and assignments we used the ratio of probabilities between positive and negative sentiments.
* This approach gives us simpler formulas for these 2-way classification tasks.

Load the cell below to import some packages.
You  may want to browse the documentation of unfamiliar libraries and functions.

In [1]:
import numpy as np
import pandas as pd

import pdb
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords, twitter_samples

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report

import re
import string

from my_nlp_utils import preprocess_tweet, build_freq, extract_features, predict_tweet, test_model

If you are running this notebook in your local computer,
don't forget to download the twitter samples and stopwords from nltk.

```
nltk.download('stopwords')
nltk.download('twitter_samples')
```

In [2]:
if 'C:/Users/pulki/OneDrive/Documents/Jupyter/NLP - Deeplearning.ai/nltk_data' not in nltk.data.path:
    # add path from our local workspace containing pre-downloaded corpora files to nltk's data path
    nltk.data.path.append('C:/Users/pulki/OneDrive/Documents/Jupyter/NLP - Deeplearning.ai/nltk_data')

In [3]:
# get the sets of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# combine positive and negative labels
tweets = all_positive_tweets+all_negative_tweets

# Create a labels array
labels = np.append(np.ones(len(all_positive_tweets)), np.zeros(len(all_negative_tweets)))

# split the data into two pieces, one for training and one for testing (validation set) 
x_train = all_positive_tweets[1000:] + all_negative_tweets[:-1000]
y_train = labels[1000:9000]
x_test = all_positive_tweets[:1000] + all_negative_tweets[-1000:]
y_test = np.append(labels[:1000], labels[-1000:])

# Part 1: Process the Data

For any machine learning project, once you've gathered the data, the first step is to process it to make useful inputs to your model.
- **Remove noise**: You will first want to remove noise from your data -- that is, remove words that don't tell you much about the content. These include all common words like 'I, you, are, is, etc...' that would not give us enough information on the sentiment.
- We'll also remove stock market tickers, retweet symbols, hyperlinks, and hashtags because they can not tell you a lot of information on the sentiment.
- You also want to remove all the punctuation from a tweet. The reason for doing this is because we want to treat words with or without the punctuation as the same word, instead of treating "happy", "happy?", "happy!", "happy," and "happy." as different words.
- Finally you want to use stemming to only keep track of one variation of each word. In other words, we'll treat "motivation", "motivated", and "motivate" similarly by grouping them within the same stem of "motiv-".

We have given you the function `process_tweet()` that does this for you.

In [4]:
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

# print cleaned tweet
print(preprocess_tweet(custom_tweet))

['hello', 'great', 'day', ':)', 'good', 'morn', 'httpchapagain.com.np']


## Part 1.1 Implementing your helper functions

To help train your naive bayes model, lets build a table where the keys are a words and the values are the corresponding positive and negative frequency.  Note that the labels we'll use here are 1 for positive and 0 for negative.

In [32]:
# test the function below
freq_df = build_freq([preprocess_tweet(tweet) for tweet in x_train], y_train)
freq_df.head()

Unnamed: 0,vocab,pos,neg
0,(-:,2,0
1,(:,0,6
2,):,7,6
3,);,1,0
4,--->,1,0


In [14]:
(np.log(freq_df.pos/freq_df.pos.sum()+1)*1000).max()

96.99921088992485

In [33]:
freq_df.pos = np.log(freq_df.pos/freq_df.pos.sum()+1.0)*1000
freq_df.neg = np.log(freq_df.neg/freq_df.neg.sum()+1.0)*1000
freq_df.head()

Unnamed: 0,vocab,pos,neg
0,(-:,0.068868,0.0
1,(:,0.0,0.216474
2,):,0.241018,0.216474
3,);,0.034435,0.0
4,--->,0.034435,0.0


In [34]:
# Calling the function on test set
train_features = [extract_features(preprocess_tweet(tweet), freq_df) for tweet in x_train]
train_features[:5]

[[1, 97.37794838278371, 0.10824528385622484],
 [1, 19.259081232765254, 3.3892461594286543],
 [1, 129.5604900299557, 12.82805422340109],
 [1, 127.28114515739703, 5.947078757620799],
 [1, 21.333716268304208, 0.18039795838442155]]

In [35]:
# Training a Naive Bayes Model
gnb = GaussianNB()
gnb.fit(train_features, y_train)

GaussianNB()

In [36]:
# Test with a tweet
print(f'Tweet: {x_test[1]}. \nPrediction: {gnb.predict([extract_features(preprocess_tweet(x_test[1]), freq_df)])}')

Tweet: @Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!. 
Prediction: [1.]


# Part 2: Train your model using Naive Bayes

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

#### So how do you train a Naive Bayes classifier?
- The first part of training a naive bayes classifier is to identify the number of classes that you have.
- You will create a probability for each class.
$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.
Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.

#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$$

#### Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Notice that we add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing.

#### Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

##### Create `freqs` dictionary
- Given your `count_tweets()` function, you can compute a dictionary called `freqs` that contains all the frequencies.
- In this `freqs` dictionary, the key is the tuple (word, label)
- The value is the number of times it has appeared.

We will use this dictionary in several parts of this assignment.

In [9]:
# Build the freqs dictionary for later uses


#### Instructions
Given a freqs dictionary, `train_x` (a list of tweets) and a `train_y` (a list of labels for each tweet), implement a naive bayes classifier.

##### Calculate $V$
- You can then compute the number of unique words that appear in the `freqs` dictionary to get your $V$ (you can use the `set` function).

##### Calculate $freq_{pos}$ and $freq_{neg}$
- Using your `freqs` dictionary, you can compute the positive and negative frequency of each word $freq_{pos}$ and $freq_{neg}$.

##### Calculate $N_{pos}$, $N_{neg}$, $V_{pos}$, and $V_{neg}$
- Using `freqs` dictionary, you can also compute the total number of positive words and total number of negative words $N_{pos}$ and $N_{neg}$.
- Similarly, use `freqs` dictionary to compute the total number of **unique** positive words, $V_{pos}$, and total **unique** negative words $V_{neg}$.

##### Calculate $D$, $D_{pos}$, $D_{neg}$
- Using the `train_y` input list of labels, calculate the number of documents (tweets) $D$, as well as the number of positive documents (tweets) $D_{pos}$ and number of negative documents (tweets) $D_{neg}$.
- Calculate the probability that a document (tweet) is positive $P(D_{pos})$, and the probability that a document (tweet) is negative $P(D_{neg})$

##### Calculate the logprior
- the logprior is $log(D_{pos}) - log(D_{neg})$

##### Calculate log likelihood
- Finally, you can iterate over each word in the vocabulary, use your `lookup` function to get the positive frequencies, $freq_{pos}$, and the negative frequencies, $freq_{neg}$, for that specific word.
- Compute the positive probability of each word $P(W_{pos})$, negative probability of each word $P(W_{neg})$ using equations 4 & 5.

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

**Note:** We'll use a dictionary to store the log likelihoods for each word.  The key is the word, the value is the log likelihood of that word).

- You can then compute the loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right)$.

In [10]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    

In [None]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

**Expected Output**:

0.0

9089

# Part 3: Test your naive bayes

Now that we have the `logprior` and `loglikelihood`, we can test the naive bayes function by making predicting on some tweets!

#### Implement `naive_bayes_predict`
**Instructions**:
Implement the `naive_bayes_predict` function to make predictions on tweets.
* The function takes in the `tweet`, `logprior`, `loglikelihood`.
* It returns the probability that the tweet belongs to the positive or negative class.
* For each tweet, sum up loglikelihoods of each word in the tweet.
* Also add the logprior to this sum to get the predicted sentiment of that tweet.

$$ p = logprior + \sum_i^N (loglikelihood_i)$$

#### Note
Note we calculate the prior from the training data, and that the training data is evenly split between positive and negative labels (4000 positive and 4000 negative tweets).  This means that the ratio of positive to negative 1, and the logprior is 0.

The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood.  However, please remember to include the logprior, because whenever the data is not perfectly balanced, the logprior will be a non-zero value.

In [None]:
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)

    '''
    


In [None]:
# from my_nlp_utils import predict_tweet
# import sys
# del sys.modules['my_nlp_utils']
# from importlib import reload
# reload()

In [37]:
# Experiment with your own tweet.
my_tweet = 'She smiled.'
p = predict_tweet(my_tweet, gnb, freq_df)
print('The expected output is', p)

The expected output is [1.]


In [38]:
gnb.predict_log_proba([extract_features(preprocess_tweet(my_tweet), freq_df)])

array([[-2.86055203, -0.05894052]])

**Expected Output**:
- The expected output is around 1.57
- The sentiment is positive.

#### Implement test_naive_bayes
**Instructions**:
* Implement `test_naive_bayes` to check the accuracy of your predictions.
* The function takes in your `test_x`, `test_y`, log_prior, and loglikelihood
* It returns the accuracy of your model.
* First, use `naive_bayes_predict` function to make predictions for each tweet in text_x.

In [None]:
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Input:
        test_x: A list of tweets
        test_y: the corresponding labels for the list of tweets
        logprior: the logprior
        loglikelihood: a dictionary with the loglikelihoods for each word
    Output:
        accuracy: (# of tweets classified correctly)/(total # of tweets)
    """
    

In [39]:
print("Naive Bayes accuracy = %0.4f" %
      (test_model(x_test, y_test, gnb, freq_df)))

Confusion Matrix: 
 [[932  68]
 [  1 999]]
              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96      1000
         1.0       0.94      1.00      0.97      1000

    accuracy                           0.97      2000
   macro avg       0.97      0.97      0.97      2000
weighted avg       0.97      0.97      0.97      2000

Naive Bayes accuracy = 0.9655


**Expected Accuracy**:

0.9940

In [40]:
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    p = predict_tweet(tweet, gnb, freq_df)
    print(f'{tweet} -> {p}')

I am happy -> [1.]
I am bad -> [1.]
this movie should have been great. -> [1.]
great -> [1.]
great great -> [1.]
great great great -> [1.]
great great great great -> [1.]


**Expected Output**:
- I am happy -> 2.15
- I am bad -> -1.29
- this movie should have been great. -> 2.14
- great -> 2.14
- great great -> 4.28
- great great great -> 6.41
- great great great great -> 8.55

In [41]:
# Feel free to check the sentiment of your own tweet below
my_tweet = 'you are bad :('
predict_tweet(my_tweet, gnb, freq_df)

array([0.])

# Part 4: Filter words by Ratio of positive to negative counts

- Some words have more positive counts than others, and can be considered "more positive".  Likewise, some words can be considered more negative than others.
- One way for us to define the level of positiveness or negativeness, without calculating the log likelihood, is to compare the positive to negative frequency of the word.
    - Note that we can also use the log likelihood calculations to compare relative positivity or negativity of words.
- We can calculate the ratio of positive to negative frequencies of a word.
- Once we're able to calculate these ratios, we can also filter a subset of words that have a minimum ratio of positivity / negativity or higher.
- Similarly, we can also filter a subset of words that have a maximum ratio of positivity / negativity or lower (words that are at least as negative, or even more negative than a given threshold).

#### Implement `get_ratio()`
- Given the `freqs` dictionary of words and a particular word, use `lookup(freqs,word,1)` to get the positive count of the word.
- Similarly, use the `lookup()` function to get the negative count of that word.
- Calculate the ratio of positive divided by negative counts

$$ ratio = \frac{\text{pos_words} + 1}{\text{neg_words} + 1} $$

Where pos_words and neg_words correspond to the frequency of the words in their respective classes. 
<table>
    <tr>
        <td>
            <b>Words</b>
        </td>
        <td>
        Positive word count
        </td>
         <td>
        Negative Word Count
        </td>
  </tr>
    <tr>
        <td>
        glad
        </td>
         <td>
        41
        </td>
    <td>
        2
        </td>
  </tr>
    <tr>
        <td>
        arriv
        </td>
         <td>
        57
        </td>
    <td>
        4
        </td>
  </tr>
    <tr>
        <td>
        :(
        </td>
         <td>
        1
        </td>
    <td>
        3663
        </td>
  </tr>
    <tr>
        <td>
        :-(
        </td>
         <td>
        0
        </td>
    <td>
        378
        </td>
  </tr>
</table>

In [42]:
def get_ratio(word, df):
    '''
    Input:
        df: a dataframe mapping each word to its positive and negative sentiment counts

    Output: a dictionary with keys 'positive', 'negative', and 'ratio'.
        Example: {'positive': 10, 'negative': 20, 'ratio': 0.5}
    '''
    ratio = df.loc[df.vocab == word, ['pos', 'neg']].to_dict(orient='list')
    
    ratio['pos'] = ratio['pos'][0]
    ratio['neg'] = ratio['neg'][0]
    ratio['ratio'] = ratio['pos']/ratio['neg']
    
    return ratio

In [43]:
print(get_ratio('happi', freq_df))

{'pos': 6.179216365958118, 'neg': 0.6492804037206948, 'ratio': 9.517022738632155}


#### Implement `get_words_by_threshold(freqs,label,threshold)`

* If we set the label to 1, then we'll look for all words whose threshold of positive/negative is at least as high as that threshold, or higher.
* If we set the label to 0, then we'll look for all words whose threshold of positive/negative is at most as low as the given threshold, or lower.
* Use the `get_ratio()` function to get a dictionary containing the positive count, negative count, and the ratio of positive to negative counts.
* Append a dictionary to a list, where the key is the word, and the dictionary is the dictionary `pos_neg_ratio` that is returned by the `get_ratio()` function.
An example key-value pair would have this structure:
```
{'happi':
    {'positive': 10, 'negative': 20, 'ratio': 0.5}
}
```

In [49]:
# UNQ_C9 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_words_by_threshold(label, threshold, df):
    '''
    Input:
        freqs: dictionary of words
        pos_neg_ratio: dictionary of positive counts, negative counts, and ratio of positive / negative counts.
        label: 1 for positive, 0 for negative
        threshold: ratio that will be used as the cutoff for including a word in the returned dictionary
    Output:
        word_set: dictionary containing the word and information on its positive count, negative count, and ratio of positive to negative counts.
        example of a key value pair:
        {'happi':
            {'positive': 10, 'negative': 20, 'ratio': 0.5}
        }
    '''
    if label:
        return df[(df.pos+1)/(df.neg+1) > threshold] 
    else:
        return df[(df.pos+1)/(df.neg+1) < threshold]

In [50]:
# Test your function: find negative words at or below a threshold 0.05
get_words_by_threshold(0, 0.05, freq_df)

Unnamed: 0,vocab,pos,neg
441,:(,0.034435,124.583527


In [46]:
# Test your function; find positive words at or above a threshold 10
get_words_by_threshold(1, 10, freq_df)

Unnamed: 0,vocab,pos,neg
0,(-:,0.068868,0.0
3,);,0.034435,0.0
4,--->,0.034435,0.0
5,-->,0.034435,0.0
6,->,0.034435,0.0
...,...,...,...
10870,🙆,0.034435,0.0
10871,🙌,0.034435,0.0
10875,🚮,0.034435,0.0
10876,🚲,0.068868,0.0


Notice the difference between the positive and negative ratios. Emojis like :( and words like 'me' tend to have a negative connotation. Other words like 'glad', 'community', and 'arrives' tend to be found in the positive tweets.

# Part 5: Error Analysis

In this part you will see some tweets that your model missclassified. Why do you think the misclassifications happened? Were there any assumptions made by the naive bayes model?

In [47]:
# Some error analysis done for you
print('Label Predicted Tweet')
for x,y in zip(x_test, y_test):
    y_hat = predict_tweet(x, gnb, freq_df)

    if np.abs(y - (y_hat > 0.5)) > 0:
        print('THE TWEET IS:', x)
        print('THE PROCESSED TWEET IS:', preprocess_tweet(x))
        print('%d\t%0.2f\t%s' % (y, y_hat, ' '.join(preprocess_tweet(x)).encode('ascii', 'ignore')))

Label Predicted Tweet
THE TWEET IS: Remember that one time I didn't go to flume/kaytranada/alunageorge even though I had tickets? I still want to kms. : ) : )
THE PROCESSED TWEET IS: ['rememb', 'one', 'time', 'go', 'flume', 'kaytranada', 'alunageorg', 'even', 'though', 'ticket', 'still', 'want', 'km']
1	0.00	b'rememb one time go flume kaytranada alunageorg even though ticket still want km'
THE TWEET IS: It's really hot :-(
THE PROCESSED TWEET IS: ['realli', 'hot', ':-(']
0	1.00	b'realli hot :-('
THE TWEET IS: @Mjbulanhagui13 agh, sorry :-(
THE PROCESSED TWEET IS: ['agh', 'sorri', ':-(']
0	1.00	b'agh sorri :-('
THE TWEET IS: Alone :-( :'( :-\
THE PROCESSED TWEET IS: ['alon', ':-(', ":'(", ':-\\']
0	1.00	b"alon :-( :'( :-\\"
THE TWEET IS: @baexrv pcy mine :-(
THE PROCESSED TWEET IS: ['pci', 'mine', ':-(']
0	1.00	b'pci mine :-('
THE TWEET IS: Get in the bin, OSX/Chrome/Voiceover &gt;:( http://t.co/0bcvA6YjWu
THE PROCESSED TWEET IS: ['get', 'bin', 'osx', 'chrome', 'voiceov', '>:(', 'httpt.

0	1.00	b'cheer :-('
THE TWEET IS: worried :-(((((((((((
THE PROCESSED TWEET IS: ['worri', ':-(']
0	1.00	b'worri :-('
THE TWEET IS: When Jessica calls and quits on power abs at 5:15 :-(
THE PROCESSED TWEET IS: ['jessica', 'call', 'quit', 'power', 'ab', '5:15', ':-(']
0	1.00	b'jessica call quit power ab 5:15 :-('
THE TWEET IS: my beloved grandmother : ( https://t.co/wt4oXq5xCf
THE PROCESSED TWEET IS: ['belov', 'grandmoth', 'httpst.co/wt4oxq5xcf']
0	1.00	b'belov grandmoth httpst.co/wt4oxq5xcf'
THE TWEET IS: when you don't have enough time to listen to all your artists' music :-(
THE PROCESSED TWEET IS: ['enough', 'time', 'listen', 'artist', 'music', ':-(']
0	1.00	b'enough time listen artist music :-('
THE TWEET IS: @hyungwons_ tELL HIM TO PLS EAT MORE :-(((
THE PROCESSED TWEET IS: ['tell', 'pl', 'eat', ':-(']
0	1.00	b'tell pl eat :-('
THE TWEET IS: @CHEDA_KHAN Thats life. I get calls from people I havent seen in 20 years and its always favours : (
THE PROCESSED TWEET IS: ['that', 'life', 

# Part 6: Predict with your own tweet

In this part you can predict the sentiment of your own tweet.

In [48]:
# Test with your own tweet - feel free to modify `my_tweet`
my_tweet = 'I am happy because I am learning :)'

p = predict_tweet(my_tweet, gnb, freq_df)
print(p)

[1.]


Congratulations on completing this assignment. See you next week!