<a href="https://colab.research.google.com/github/martin-fabbri/advanced-react-components/blob/master/deeplearning.ai/nlp/c1_w2_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: Naive Bayes
Welcome to week two of this specialization. You will learn about Naive Bayes. Concretely, you will be using Naive Bayes for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will: 

* Train a naive bayes model on a sentiment analysis task
* Test using your model
* Compute ratios of positive words to negative words
* Do some error analysis
* Predict on your own tweet

You may already be familiar with Naive Bayes and its justification in terms of conditional probabilities and independence.
* In this week's lectures and assignments we used the ratio of probabilities between positive and negative sentiments.
* This approach gives us simpler formulas for these 2-way classification tasks.

Load the cell below to import some packages.
You  may want to browse the documentation of unfamiliar libraries and functions.

In [1]:
import pdb
import numpy as np
import pandas as pd
import nltk
import string
import re

from nltk.corpus import twitter_samples
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

In [2]:
nltk.download('stopwords')
nltk.download('twitter_samples')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [3]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

In [4]:
train_x[:3]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!']

In [5]:
test_x[:3]

['Bro:U wan cut hair anot,ur hair long Liao bo\nMe:since ord liao,take it easy lor treat as save $ leave it longer :)\nBro:LOL Sibei xialan',
 "@heyclaireee is back! thnx God!!! i'm so happy :)",
 '@BBCRadio3 thought it was my ears which were malfunctioning, thank goodness you cleared that one up with an apology :-)']

## Part 1: Process the Data

For anu machine learning project, once you've gathered the data, the firts step is to make useful inputs to your model.

- **Remove noise (stop words?)**: You will first want to remove noise from your data --that is, *remove words* that don't tell you much about the content. These include all common words like "I, you, are, is, etc..." that would not give us enough information on the sentiment.

- **Remove symbols** such as stock market tickers, retweet symbols, hyperlink, and hastags because they cannot tell you a lot of information of the sentiment.

- **Remove punctuation** from a tweet. The reason for doing this is because we want to treat words with or without the punctuation as the same word, instead of treating "happy", "happy?", "happy!", "happy,", "happy." as different words.

- **Stemm the words**.


In [6]:
def process_tweet(tweet):
    '''
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    '''
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
            word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [7]:
custom_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"

process_tweet(custom_tweet)

['hello', 'great', 'day', ':)', 'good', 'morn']

## Part 1.1 Implementing your helper functions

To help train your naive bayes model, you will need to build a distionary hwre the keys are a (word, label) tuple and the values are the corresponding frequency. Note that the labels we'll use here are 1 for positive and 0 for negative.

You wil also implement a `lookup()` helper function that takes in the `freqs` dictionary, a word and a label to return the number of times that tuple appears in the collection of tweets.

For example: given a list of tweets `["I am rather excited", "you are rather happy]` and the label 1, the function will return a dictionary that contains the following key-value pairs:

{("raher", 1): 2, ("happi", 1): 1, ("excit", 1): 1}

- Notice how for each word in the given string, the same label 1 is assigned to each word.

- Notice how the words "i" and "am" are not saved, since they were removed as part of the cleaning process(stop words removal). 

In [11]:
def count_tweets(results, tweets, ys):
  '''
  Input:
    result: a dictionary that will be used to map each pair to its frequency 
    tweets: a list of tweets
    ys: a list correspoding mapping each pair to its frequency

  Output:
    result: a dictionary mapping each pair to its frequency 
  '''
  for y, tweet in zip(ys, tweets):
    for word in process_tweet(tweet):
      pair = (word, y)
      result[pair] = result.get(pair, 0) + 1
  return result

In [12]:
result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
count_tweets(result, tweets, ys)

{('happi', 1): 1, ('sad', 0): 1, ('tire', 0): 2, ('trick', 0): 1}