In this DataLab you will implement a Naive Bayes classifier as described in Chapter 4 of the book Speech and Language Processing.

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import random
import re                                  
import string  

import nltk
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords          
from nltk.stem import PorterStemmer        
from nltk.tokenize import TweetTokenizer

In [3]:
nltk.download('stopwords')
stopwords_english = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [5]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [6]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('The type of a tweet entry is: ', type(all_negative_tweets[0]))

Number of positive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  <class 'list'>
The type of a tweet entry is:  <class 'str'>


In [7]:
# print positive in greeen
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# print negative in red
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

[92m@Gurmeetramrahim #OurDaughtersOurPride dhan dhan satguru tera hi aasra...many congratulations Pita G...Keep them blessed as always :-)
[91m@GABRlEIIE not as much as my brother :(


**Tweet preprocessing**

Last week you learned how to use regular expressions to process tweets. Use the function `tweet_processor()` you created in the last DataLab here:

In [8]:
def tweet_processor(tweet):
    """
    Input:
        tweet: a string containing a tweet
    Output:
        processed_tweet: a list of token
        
    Processing steps:
    - Removes hyperlinks
    - Removes # sign
    - Tokenizes
    - Removes stopwords and punctuation
    - Stem tokens
        
    """
    # Remove URLs using regular expression
    processed_tweet = re.sub(r'http\S+|www\S+|https\S+', '', tweet)
    
    # Remove hashtags
    processed_tweet = re.sub(r'#([^\s]+)', r'\1', processed_tweet)
    
    # Tokenize the tweet using TweetTokenizer
    tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=False)
    processed_tweet = tokenizer.tokenize(processed_tweet)
    
    # Remove stopwords
    stopwords_english = set(stopwords.words('english'))
    processed_tweet = [word for word in processed_tweet if word not in stopwords_english]
    
    # Remove punctuation
    processed_tweet = [word for word in processed_tweet if word not in string.punctuation]
    
    # Stem the tokens
    stemmer = PorterStemmer()
    processed_tweet = [stemmer.stem(word) for word in processed_tweet]

    return processed_tweet

And sanity check if it works.
    
Example tweet:
    
`My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i`

Expected output:

`['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']`

In [9]:
example_tweet = ('My beautiful sunflowers on a sunny Friday morning off :)'
                 ' #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i')
print(example_tweet)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i


In [10]:
tweet_processor(example_tweet)

['beauti',
 'sunflow',
 'sunni',
 'friday',
 'morn',
 ':)',
 'sunflow',
 'favourit',
 'happi',
 'friday',
 '…']

Before going any further, let's split the dataset into training and test sets.

In [11]:
# 80% training 20% testing
positive_tweets_tr = all_positive_tweets[:4000]
positive_tweets_te = all_positive_tweets[4000:]

negative_tweets_tr = all_negative_tweets[:4000]
negative_tweets_te = all_negative_tweets[4000:]

**Task 1**

The function `tweet_processor()` expects a single tweet to process. But you have lists of tweets to process. Write a function called `tweet_processor_list()` that accept a list of strings (tweets) and returns a list of processed tweets. A processed tweet is a list of tokens. Therefore  `tweet_processor_list()` should return a list of lists.

The first two items in the `positive_tweets_tr` are:

```
['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!']
 ```
 
 the expected output of `tweet_processor_list()` is:
 
 ```
 [['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)'],
 ['hey',
  'jame',
  'odd',
  ':/',
  'pleas',
  'call',
  'contact',
  'centr',
  '02392441234',
  'abl',
  'assist',
  ':)',
  'mani',
  'thank']]
 
 ```

In [12]:
def tweet_processor_list(tweet_list):
    # YOUR CODE HERE #
    processed_tweet_list = []
    for tweet in tweet_list:
        processed_tweet = tweet_processor(tweet)
        processed_tweet_list.append(processed_tweet)
    return processed_tweet_list

In [13]:
positive_tweets_tr = tweet_processor_list(positive_tweets_tr)
positive_tweets_te = tweet_processor_list(positive_tweets_te)

negative_tweets_tr = tweet_processor_list(negative_tweets_tr)
negative_tweets_te = tweet_processor_list(negative_tweets_te)

**Task 2**

Now it is time to creative the _vocabulary_ as defined in Chapter 4, Section 4.2:

> vocabulary V consists of the union of all the word types in all classes

Combine all the tokens in `positive_tweets_tr` and `negative_tweets_tr` into one big list and get the unique tokens from this list.

Expected length of the vocabulary is `9085` unique tokens. Notice that if you use a different train/test split or different preprocessing this number will be different.

First 50 tokens in the vocabulary:

```
['(-:',
 '(:',
 '):',
 '--->',
 '-->',
 '->',
 '.\n.',
 '.\n.\n.',
 '. .',
 '. . .',
 '. ..',
 '. ...',
 '..',
 '...',
 '0',
 '0-100',
 '0-2',
 '0.001',
 '0.7',
 '00',
 '00128835',
 '009',
 '00962778381',
 '01282',
 '01482',
 '01:15',
 '01:16',
 '02079',
 '02392441234',
 '0272 3306',
 '0330 333 7234',
 '0345',
 '05.15',
 '07:02',
 '07:17',
 '07:24',
 '07:25',
 '07:32',
 '07:34',
 '08',
 '0878 0388',
 '08962464174',
 '0ne',
 '1',
 '1,300',
 '1,500',
 '1-0',
 '1.300',
 '1.8',
 '1/2']
```

In [14]:
# YOUR CODE HERE #
# Combine all tokens from positive_tweets_tr and negative_tweets_tr
all_tweets = positive_tweets_tr + negative_tweets_tr

# Flatten the list of lists into a single list
all_tokens = [token for tweet in all_tweets for token in tweet]

# Get unique tokens
vocabulary = list(set(all_tokens))

# Sort the vocabulary for consistency
vocabulary.sort()

# Print the length of the vocabulary
print("Length of vocabulary:", len(vocabulary))

Length of vocabulary: 14884


**Task 3**

In order to calculate the equation 4.12

$P(w_i|c)=count(w_i, c)/\Sigma_{w∈V} count(w, c)$

We first need to calculate $count(w_i, c)$ which is the number of times each token in the vocabulary occurs in tweets from class c. This is also called the word frequency table.

|$w_i$| count($w_i$, +) | count($w_i$, -) |
| ----------- | ----------- |----------- |
|(-:|1|0|
|(:|1|6|
|):|6|6|
|--->|1|0|
|happi|161|18|


First create a dictionary called `freq` where keys are tokens and values are lists containing positive and negative counts

```
{'(-:': [1, 0],
 '(:': [1, 6],
 ...
}
```

and convert it to a dataframe.

In [15]:
# Initialize an empty dictionary to store frequencies
freqs = {}

# Iterate over all tokens in the vocabulary
for token in vocabulary:
    # Initialize counts for positive and negative tweets
    positive_count = 0
    negative_count = 0
    
    # Iterate over all positive tweets and count occurrences of the token
    for tweet in positive_tweets_tr:
        if token in tweet:
            positive_count += 1
    
    # Iterate over all negative tweets and count occurrences of the token
    for tweet in negative_tweets_tr:
        if token in tweet:
            negative_count += 1
    
    # Add counts to the freqs dictionary
    freqs[token] = [positive_count, negative_count]

In [16]:
df = pd.DataFrame.from_dict(freqs, orient='index', columns=['count(w_i, +)', 'count(w_i, -)'])
df.head(10)

Unnamed: 0,"count(w_i, +)","count(w_i, -)"
#bbmme,1,0
#segalakatakata,1,0
(-:,1,0
(:,1,5
):,4,4
--->,1,0
-->,2,0
->,1,0
.\n.,0,1
.\n.\n.,1,0


**Task 4**

We can calculate the equation 4.12 now:

$P(w_i|c)=count(w_i, c)/\Sigma_{w∈V} count(w, c)$

The denominator $\Sigma_{w∈V} count(w, c)$ is simply sum of each column.

|$w_i$| count($w_i$, +) | count($w_i$, -) | P(w_i\|+) | P(w_i\|-) |
| ----------- | ----------- |----------- |----------- |----------- |
|(-:|1|0|0.000037|0.000000|
|(:|1|6|0.000037|0.000222|
|):|6|6|0.000224|0.000222|
|--->|1|0|0.000037|0.000000|
|happi|161|18|0.005998|0.000666|


In [17]:
# YOUR CODE HERE #
# Calculate the sum of each column
sum_positive = df['count(w_i, +)'].sum()
sum_negative = df['count(w_i, -)'].sum()

# Calculate the probabilities P(w_i|+) and P(w_i|-)
df['P(w_i|+)'] = df['count(w_i, +)'] / sum_positive
df['P(w_i|-)'] = df['count(w_i, -)'] / sum_negative

# Print the DataFrame
df

Unnamed: 0,"count(w_i, +)","count(w_i, -)",P(w_i|+),P(w_i|-)
#bbmme,1,0,0.000033,0.000000
#segalakatakata,1,0,0.000033,0.000000
(-:,1,0,0.000033,0.000000
(:,1,5,0.000033,0.000176
):,4,4,0.000131,0.000141
...,...,...,...,...
🚂,1,0,0.000033,0.000000
🚖,0,1,0.000000,0.000035
🚙,0,1,0.000000,0.000035
󾆖,0,1,0.000000,0.000035


**Task 5**

Apply Laplacian smoothing as described in equation 4.14

$P(w_i|c)=[count(w_i, c)+1]/[\Sigma_{w∈V} count(w, c)$+len(vocabulary)]

|$w_i$| count($w_i$, +) | count($w_i$, -) | P(w_i\|+) | P(w_i\|-) |P(w_i\|+) smooth | P(w_i\|-) smooth |
| ----------- | ----------- |----------- |----------- |----------- |----------- |----------- |
|(-:|1|0|0.000037|0.000000|0.000056|0.000028|
|(:|1|6|0.000037|0.000222|0.000056|0.000194|
|):|6|6|0.000224|0.000222|0.000195|0.000194|
|--->|1|0|0.000037|0.000000|0.000056|0.000028|
|happi|161|18|0.005998|0.000666|0.004509|0.000526|

In [18]:
# YOUR CODE HERE #
# Apply Laplacian smoothing
smooth_factor = len(vocabulary)  # length of vocabulary

# Add 1 to each count in the numerator
df['count(w_i, +)'] += 1
df['count(w_i, -)'] += 1

# Add the Laplacian smoothing factor to the denominator
denominator_smoothed_positive = sum_positive + smooth_factor
denominator_smoothed_negative = sum_negative + smooth_factor

# Recalculate the probabilities with Laplacian smoothing
df['P(w_i|+) smooth'] = df['count(w_i, +)'] / denominator_smoothed_positive
df['P(w_i|-) smooth'] = df['count(w_i, -)'] / denominator_smoothed_negative

# Print the DataFrame
df

Unnamed: 0,"count(w_i, +)","count(w_i, -)",P(w_i|+),P(w_i|-),P(w_i|+) smooth,P(w_i|-) smooth
#bbmme,2,1,0.000033,0.000000,0.000044,0.000023
#segalakatakata,2,1,0.000033,0.000000,0.000044,0.000023
(-:,2,1,0.000033,0.000000,0.000044,0.000023
(:,2,6,0.000033,0.000176,0.000044,0.000139
):,5,5,0.000131,0.000141,0.000110,0.000115
...,...,...,...,...,...,...
🚂,2,1,0.000033,0.000000,0.000044,0.000023
🚖,1,2,0.000000,0.000035,0.000022,0.000046
🚙,1,2,0.000000,0.000035,0.000022,0.000046
󾆖,1,2,0.000000,0.000035,0.000022,0.000046


**Task 6**

The final piece of the puzzle is equation 4.11

$P(c) = N_c/N_{doc}$

$N_c$: the number of tweet in our training data with class c
$N_{doc}$: the total number of tweets.

P(+) = number of positive tweets / number of tweets

P(-) = number of negative tweets / number of tweets

Calculate P(+) and P(-)

In [19]:
# YOUR CODE HERE #
# Calculate the number of positive and negative tweets
num_positive_tweets = len(positive_tweets_tr)
num_negative_tweets = len(negative_tweets_tr)

# Calculate the total number of tweets
total_tweets = num_positive_tweets + num_negative_tweets

# Calculate P(+) and P(-)
P_positive = num_positive_tweets / total_tweets
P_negative = num_negative_tweets / total_tweets

# Print the results
print("P(+) =", P_positive)
print("P(-) =", P_negative)

P(+) = 0.5
P(-) = 0.5


**Task 7**

Write the Naive Bayes algorithm by implementing equations 4.5/4.6

Say we have a tweet with 2 tokens `['damnit', ':(']`. Probability of these tweet being positive is proportional to:

`P(tweet|+)P(+)` = `P('damnit'|+) * P(':('|+) * P(+)`

and negative is proportional to:

`P(tweet|-)P(-)` = `P('damnit'|-) * P(':('|-) * P(-)`

If `P(tweet|+)P(+)` > `P(tweet|-)P(-)`, tweet is positive and else negative.

Predict whether this tweet is positive or negative using equations described above. Use the probabilities calculated using Laplacian smoothing.

Remember, section 4.2 page 62

> What do we do about words that occur in our test data but are not in our vocab- ulary at all because they did not occur in any training document in any class? The solution for such unknown words is to ignore them—remove them from the test document and not include any probability for them at all.

In [20]:
tw = negative_tweets_te[3]
print(tw)

['@luketothestar', 'damnit', ':(']


In [21]:
# Given tweet
tw = negative_tweets_te[3]
print(tw)

# Initialize probabilities for positive and negative classes
prob_pos = P_positive
prob_neg = P_negative

# Iterate over tokens in the tweet
for token in tw:
    # Check if the token is in the vocabulary
    if token in df.index:
        # Multiply the probabilities with P(token|+) and P(token|-)
        prob_pos *= df.loc[token, 'P(w_i|+) smooth']
        prob_neg *= df.loc[token, 'P(w_i|-) smooth']

['@luketothestar', 'damnit', ':(']


In [22]:
prob_pos, prob_neg

(2.1993973651219566e-05, 0.0409207456515211)

In [23]:
if prob_pos > prob_neg:
    print('Class positive')
else:
    print('Class negative')

Class negative


**Task 8**

As explained in section 4.1 page 61

> Naive Bayes calculations, like calculations for language modeling, are done in log space, to avoid underflow and increase speed.

In [24]:
# Numerical underflow
print(0.5**1000)
print(0.5**10000)

9.332636185032189e-302
0.0


Calcuate log likelihoods for P(w_i|+)\_smooth and P(w_i|-)\_smooth

|$w_i$| count($w_i$, +) | count($w_i$, -) | P(w_i\|+) | P(w_i\|-) |P(w_i\|+) smooth | P(w_i\|-) smooth |log(P(w_i\|+) smooth)|log(P(w_i\|-) smooth)|
| ----------- | ----------- |----------- |----------- |----------- |----------- |----------- |----------- |----------- |
|(-:|1|0|0.000037|0.000000|0.000056|0.000028|-9.796125|-10.494519|
|(:|1|6|0.000037|0.000222|0.000056|0.000194|-9.796125|-8.548609|
|):|6|6|0.000224|0.000222|0.000195|0.000194|-8.543362|-8.548609|
|--->|1|0|0.000037|0.000000|0.000056|0.000028|-9.796125|-10.494519|
|happi|161|18|0.005998|0.000666|0.004509|0.000526|-5.401676|-7.550080|

In [25]:
# YOUR CODE HERE 
import math
# Calculate log likelihoods for P(w_i|+)_smooth and P(w_i|-)_smooth
df['log(P(w_i|+) smooth)'] = df['P(w_i|+) smooth'].apply(lambda x: math.log(x))
df['log(P(w_i|-) smooth)'] = df['P(w_i|-) smooth'].apply(lambda x: math.log(x))
df

Unnamed: 0,"count(w_i, +)","count(w_i, -)",P(w_i|+),P(w_i|-),P(w_i|+) smooth,P(w_i|-) smooth,log(P(w_i|+) smooth),log(P(w_i|-) smooth)
#bbmme,2,1,0.000033,0.000000,0.000044,0.000023,-10.031595,-10.675700
#segalakatakata,2,1,0.000033,0.000000,0.000044,0.000023,-10.031595,-10.675700
(-:,2,1,0.000033,0.000000,0.000044,0.000023,-10.031595,-10.675700
(:,2,6,0.000033,0.000176,0.000044,0.000139,-10.031595,-8.883941
):,5,5,0.000131,0.000141,0.000110,0.000115,-9.115304,-9.066262
...,...,...,...,...,...,...,...,...
🚂,2,1,0.000033,0.000000,0.000044,0.000023,-10.031595,-10.675700
🚖,1,2,0.000000,0.000035,0.000022,0.000046,-10.724742,-9.982553
🚙,1,2,0.000000,0.000035,0.000022,0.000046,-10.724742,-9.982553
󾆖,1,2,0.000000,0.000035,0.000022,0.000046,-10.724742,-9.982553


**Task 9**

Repeat Task 7 but this time using log likelihoods.

In [26]:
tw = negative_tweets_te[3]
print(tw)

['@luketothestar', 'damnit', ':(']


In [27]:
# YOUR CODE HERE #
# Given tweet
tw = negative_tweets_te[3]
print(tw)

# Initialize log likelihoods for positive and negative classes
log_prob_pos = math.log(P_positive)
log_prob_neg = math.log(P_negative)

# Iterate over tokens in the tweet
for token in tw:
    # Check if the token is in the vocabulary
    if token in df.index:
        # Add the log likelihoods of the token being in each class
        log_prob_pos += df.loc[token, 'log(P(w_i|+) smooth)']
        log_prob_neg += df.loc[token, 'log(P(w_i|-) smooth)']

# Print the log likelihoods
print("Log likelihood for positive class:", log_prob_pos)
print("Log likelihood for negative class:", log_prob_neg)

['@luketothestar', 'damnit', ':(']
Log likelihood for positive class: -10.724742067074814
Log likelihood for negative class: -3.196118115886798


In [28]:
log_prob_pos, log_prob_neg

(-10.724742067074814, -3.196118115886798)

In [29]:
np.exp(log_prob_pos), np.exp(log_prob_neg)

(2.199397365121956e-05, 0.0409207456515211)

In [30]:
prob_pos, prob_neg

(2.1993973651219566e-05, 0.0409207456515211)

In [31]:
if log_prob_pos > log_prob_neg:
    print('Class positive')
else:
    print('Class negative')

Class negative


**Task 10**

Putting everything together, predict whether a tweet is positive or negative, for each tweet in the test set. Calculate accuracy.

In [32]:
y_test = []
y_preds = []

# YOUR CODE HERE #
# Iterate over each tweet in the test set
for tweet in positive_tweets_te + negative_tweets_te:
    # Initialize log likelihoods for positive and negative classes
    log_prob_pos = math.log(P_positive)
    log_prob_neg = math.log(P_negative)
    
    # Iterate over tokens in the tweet
    for token in tweet:
        # Check if the token is in the vocabulary
        if token in df.index:
            # Add the log likelihoods of the token being in each class
            log_prob_pos += df.loc[token, 'log(P(w_i|+) smooth)']
            log_prob_neg += df.loc[token, 'log(P(w_i|-) smooth)']
    
    # Predict the sentiment based on which log likelihood is higher
    if log_prob_pos > log_prob_neg:
        y_preds.append(1)  # positive
    else:
        y_preds.append(0)  # negative
    
    # Determine true label based on tweet's origin
    if tweet in positive_tweets_te:
        y_test.append(1)  # positive
    else:
        y_test.append(0)  # negative
    
y_preds = np.array(y_preds)
y_test = np.array(y_test)

In [33]:
sum(y_preds == y_test)/len(y_test)

0.994