<a href="https://colab.research.google.com/github/martin-fabbri/advanced-react-components/blob/master/deeplearning.ai/nlp/c1_w2_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: Naive Bayes
Welcome to week two of this specialization. You will learn about Naive Bayes. Concretely, you will be using Naive Bayes for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will: 

* Train a naive bayes model on a sentiment analysis task
* Test using your model
* Compute ratios of positive words to negative words
* Do some error analysis
* Predict on your own tweet

You may already be familiar with Naive Bayes and its justification in terms of conditional probabilities and independence.
* In this week's lectures and assignments we used the ratio of probabilities between positive and negative sentiments.
* This approach gives us simpler formulas for these 2-way classification tasks.

Load the cell below to import some packages.
You  may want to browse the documentation of unfamiliar libraries and functions.

In [1]:
import pdb
import numpy as np
import pandas as pd
import nltk
import string
import re

from nltk.corpus import twitter_samples
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

In [2]:
nltk.download('stopwords')
nltk.download('twitter_samples')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [3]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

In [4]:
train_x[:3]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!']

In [5]:
test_x[:3]

['Bro:U wan cut hair anot,ur hair long Liao bo\nMe:since ord liao,take it easy lor treat as save $ leave it longer :)\nBro:LOL Sibei xialan',
 "@heyclaireee is back! thnx God!!! i'm so happy :)",
 '@BBCRadio3 thought it was my ears which were malfunctioning, thank goodness you cleared that one up with an apology :-)']

## Part 1. Process the Data

For anu machine learning project, once you've gathered the data, the firts step is to make useful inputs to your model.

- **Remove noise (stop words?)**: You will first want to remove noise from your data --that is, *remove words* that don't tell you much about the content. These include all common words like "I, you, are, is, etc..." that would not give us enough information on the sentiment.

- **Remove symbols** such as stock market tickers, retweet symbols, hyperlink, and hastags because they cannot tell you a lot of information of the sentiment.

- **Remove punctuation** from a tweet. The reason for doing this is because we want to treat words with or without the punctuation as the same word, instead of treating "happy", "happy?", "happy!", "happy,", "happy." as different words.

- **Stemm the words**.


In [6]:
def process_tweet(tweet):
    '''
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    '''
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
            word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [7]:
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

process_tweet(custom_tweet)

['hello', 'great', 'day', ':)', 'good', 'morn']

### Part 1.1. Implementing your helper functions

To help train your naive bayes model, you will need to build a distionary hwre the keys are a (word, label) tuple and the values are the corresponding frequency. Note that the labels we'll use here are 1 for positive and 0 for negative.

You wil also implement a `lookup()` helper function that takes in the `freqs` dictionary, a word and a label to return the number of times that tuple appears in the collection of tweets.

For example: given a list of tweets `["I am rather excited", "you are rather happy]` and the label 1, the function will return a dictionary that contains the following key-value pairs:

{("raher", 1): 2, ("happi", 1): 1, ("excit", 1): 1}

- Notice how for each word in the given string, the same label 1 is assigned to each word.

- Notice how the words "i" and "am" are not saved, since they were removed as part of the cleaning process(stop words removal). 

In [8]:
def count_tweets(results, tweets, ys):
  '''
  Input:
    result: a dictionary that will be used to map each pair to its frequency 
    tweets: a list of tweets
    ys: a list correspoding mapping each pair to its frequency

  Output:
    result: a dictionary mapping each pair to its frequency 
  '''
  for y, tweet in zip(ys, tweets):
    for word in process_tweet(tweet):
      pair = (word, y)
      result[pair] = result.get(pair, 0) + 1
  return result

In [9]:
result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
count_tweets(result, tweets, ys)

{('happi', 1): 1, ('sad', 0): 1, ('tire', 0): 2, ('trick', 0): 1}

Expected Output: {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

## Part 2. Train a Naive Bayes model

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also a short prediction time.

How do you train a Naive Bayes classifier?

- The first step if to identify the number of classes.
- Create a probability for each class. $P(D_{pos})$ is the probability that the document is positive. $P(D_{neg})$ is the probability that the document is negative. Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$
$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where:

*   $D$ is the total number of documents, or tweets in this case.
*   $D_{pos}$ is the total number of positive tweets
*   $D_{neg}$ is the total number of negative tweets




#### Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$$

#### Positive and Negative Probability of a Word (Laplace Smoothing)
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Notice that we add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing.

#### Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

##### Create `freqs` dictionary
- Given your `count_tweets()` function, you can compute a dictionary called `freqs` that contains all the frequencies.
- In this `freqs` dictionary, the key is the tuple (word, label)
- The value is the number of times it has appeared.

We will use this dictionary in several parts of this assignment.

In [10]:
# Build the freqs dictionary for later uses

freqs = count_tweets({}, train_x, train_y)
len(freqs)

11346

#### Instructions
Given a freqs dictionary, `train_x` (a list of tweets) and a `train_y` (a list of labels for each tweet), implement a naive bayes classifier.

##### Calculate $V$
- `V  is the number of unique words in the entire set of documents, for all classes, whether positive or negative.`

- You can then compute the number of unique words that appear in the `freqs` dictionary to get your $V$ (you can use the `set` function).


In [11]:
vocab = set(word for word, _ in freqs) 
V= len(vocab)
V

9089

##### Calculate $freq_{pos}$ and $freq_{neg}$
- Using your `freqs` dictionary, you can compute the positive and negative frequency of each word $freq_{pos}$ and $freq_{neg}$.

**Calculate $N_{pos}$ and $N_{neg}$**
- Using `freqs` dictionary, you can also compute the total number of positive words and total number of negative words $N_{pos}$ and $N_{neg}$.


In [12]:
# calculate N_pos and N_neg
NEGATIVE = 0
POSITIVE = 1
N_pos = N_neg = 0
for word, sentiment in freqs:
    freq = freqs[(word, sentiment)]
    # if the label is positive (greater than zero)
    if sentiment == POSITIVE:
        # Increment the number of positive words by the count for this (word, label) pair
        N_pos += freq

    # else, the label is negative
    else:

        # increment the number of negative words by the count for this (word,label) pair
        N_neg += freq
N_neg, N_pos

(27044, 26846)

##### Calculate $D$, $D_{pos}$, $D_{neg}$
- Using the `train_y` input list of labels, calculate the number of documents (tweets) $D$, as well as the number of positive documents (tweets) $D_{pos}$ and number of negative documents (tweets) $D_{neg}$.
- Calculate the probability that a document (tweet) is positive $P(D_{pos})$, and the probability that a document (tweet) is negative $P(D_{neg})$

In [13]:
# calculate D, the number of documents
D = len(train_y)
D

8000

In [14]:
# calculate D_pos, the number of possitive documents 
D_pos = sum(train_y)
D_pos

4000.0

In [15]:
# calculate D_neg, the number of negative documents 
D_neg = D - D_pos 
D_neg

4000.0

##### Calculate the logprior
- the logprior is $log(D_{pos}) - log(D_{neg})$

In [16]:
logprior = np.log(D_pos) - np.log(D_neg)
logprior

0.0

##### Calculate log likelihood
- Finally, you can iterate over each word in the vocabulary, use your `lookup` function to get the positive frequencies, $freq_{pos}$, and the negative frequencies, $freq_{neg}$, for that specific word.
- Compute the positive probability of each word $P(W_{pos})$, negative probability of each word $P(W_{neg})$ using equations 4 & 5.

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

**Note:** We'll use a dictionary to store the log likelihoods for each word.  The key is the word, the value is the log likelihood of that word).

- You can then compute the loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$.

In [17]:
def lookup(freqs, word, label):
    '''
    Input:
        freqs: a dictionary with the frequency of each pair (or tuple)
        word: the word to look up
        label: the label corresponding to the word
    Output:
        n: the number of times the word with its corresponding label appears.
    ''' 
    return freqs.get((word, label), 0)     

In [24]:
test_freqs = {('sad', 0): 4,
          ('happy', 1): 12,
          ('oppressed', 0): 7}

assert lookup(test_freqs, 'happy', 1) == 12
assert lookup(test_freqs, 'happy', 0) == 0
assert lookup(test_freqs, 'does_not_exists', 1) == 0


$ P(W_{class}) = \frac{freq_{class} + 1}{N_{class} + V}$

In [21]:
loglikelihood = {}
# For each word in the vocabulary...
for word in vocab:
    # get the positive and negative frequency of the word
    freq_pos = lookup(freqs, word, POSITIVE)
    freq_neg = lookup(freqs, word, NEGATIVE)

    # calculate the probability that each word is positive, and negative
    p_w_pos = (freq_pos + 1) / (N_pos + V)
    p_w_neg = (freq_neg + 1) / (N_neg + V)

    # calculate the log likelihood of the word
    loglikelihood[word] = np.log(p_w_pos / p_w_neg)
loglikelihood

{'steroid': 0.005494824282227851,
 "tomorrow'": 0.005494824282227851,
 'waseem': 0.005494824282227851,
 'employ': 0.005494824282227851,
 'hop': 0.005494824282227851,
 'manifest': 0.005494824282227851,
 '. ..': 0.005494824282227851,
 '4am': 0.005494824282227851,
 'poootek': 0.005494824282227851,
 'drinkt': 0.005494824282227851,
 'donutsss': 0.005494824282227851,
 'anyth': 0.005494824282227851,
 'sin': 0.005494824282227851,
 'let': 0.005494824282227851,
 'jenni': 0.005494824282227851,
 'angle.nelson': 0.005494824282227851,
 'ja': 0.005494824282227851,
 'hahahahahaah': 0.005494824282227851,
 '737gold': 0.005494824282227851,
 'homegirl': 0.005494824282227851,
 'penacova': 0.005494824282227851,
 'subgam': 0.005494824282227851,
 'scroll': 0.005494824282227851,
 'mist': 0.005494824282227851,
 'cakehamp': 0.005494824282227851,
 'smug': 0.005494824282227851,
 '1tb': 0.005494824282227851,
 '334': 0.005494824282227851,
 'noseble': 0.005494824282227851,
 'umair': 0.005494824282227851,
 'hav': 0.00

In [22]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikelihood = {}
    logprior = 0

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

    # calculate V, the number of unique words in the vocabulary
    vocab = set(word for word, _ in freqs) 
    V= len(vocab)

    # calculate N_pos and N_neg
    NEGATIVE = 0
    POSITIVE = 1
    N_pos = N_neg = 0
    for word, sentiment in freqs:
        freq = freqs[(word, sentiment)]
        # if the label is positive (greater than zero)
        if sentiment == POSITIVE:
            # Increment the number of positive words by the count for this (word, label) pair
            N_pos += freq

        # else, the label is negative
        else:

            # increment the number of negative words by the count for this (word,label) pair
            N_neg += freq

    # Calculate D, the number of documents
    D = len(train_y)

    # Calculate D_pos, the number of positive documents (*hint: use sum(<np_array>))
    D_pos = sum(train_y)

    # Calculate D_neg, the number of negative documents (*hint: compute using D and D_pos)
    D_neg = D - D_pos

    # Calculate logprior
    logprior = np.log(D_pos) - np.log(D_neg)

    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = lookup(freqs, word, POSITIVE)
        freq_neg = lookup(freqs, word, NEGATIVE)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)

    ### END CODE HERE ###

    return logprior, loglikelihood

In [23]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# You do not have to input any code in this cell, but it is relevant to grading, so please do not change anything
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

0.0
3
