<a href="https://colab.research.google.com/github/kobi-2/IUT-Lab-ML/blob/master/Sentiment_Classification_with_Naive_Bayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [311]:
from utils import process_tweet, lookup
import pdb
from nltk.corpus import stopwords, twitter_samples
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import TweetTokenizer
from os import getcwd


nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [312]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [313]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8002 entries, 0 to 8001
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       8002 non-null   object 
 1   y       8000 non-null   float64
dtypes: float64(1), object(1)
memory usage: 125.2+ KB


In [314]:
train_df.head(20)

Unnamed: 0,x,y
0,#ClimateChange #CC California's powerful and i...,0.0
1,@ohkaibaeks I only have 1 though!! :(,0.0
2,@MsKristinKreuk Hugs ang Kisses from the phili...,1.0
3,@joohyunvrl definitely :D,1.0
4,I will fulfil all your fantasies :) 👉 http://t...,1.0
5,"Sometimes it be's like that, yo. Follow someon...",0.0
6,@Dat_NiggaCarlos :((( it's not like a fersuree...,0.0
7,saturday classes :( fuck,0.0
8,There's nothing as cool as being totally over ...,1.0
9,Bantime: -1 :) #fail2ban,1.0


In [315]:
train_x = list(train_df.x.values)
train_y = list(train_df.y.values)

# Part 1:  Implementing your helper functions

To help train your naive bayes model, you will need to build a dictionary where the keys are a (word, label) tuple and the values are the corresponding frequency.  Note that the labels we'll use here are 1 for positive and 0 for negative.

A  `lookup()` helper function is here that `freqs` dictionary, a word, and a label (1 or 0) and returns the number of times that word and label tuple appears in the collection of tweets.

For example: given a list of tweets `["i am rather excited", "you are rather happy"]` and the label 1, the function will return a dictionary that contains the following key-value pairs:

{
    ("rather", 1): 2
    ("happi", 1) : 1
    ("excit", 1) : 1
}


#### Instructions
Create a function `count_tweets()` that takes a list of tweets as input, cleans all of them, and returns a dictionary.
- The key in the dictionary is a tuple containing the stemmed word and its class label, e.g. ("happi",1).
- The value the number of times this word appears in the given collection of tweets (an integer).

In [316]:
def count_tweets(result, tweets, ys):
    '''
    Input:
        tweets: a list of tweets
        ys: a list corresponding to the sentiment of each tweet (either 0 or 1)
    Output:
        result: a dictionary mapping each pair to its frequency
    '''
    # result = {}
    for tweet, y in zip(tweets, ys):
        for word in process_tweet(tweet):
            # define the key
            pair = (word,y) # ('happy', 1) key in dictionary
            # if the key exists in the dictionary, increment the count
            result[pair] = result.get(pair, 0) +1
    return result

In [317]:
# Testing your function
result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired', ' I am so happy', 'I am not happy']
ys = [1, 0, 0, 0, 0, 1, 0]
count_tweets(result, tweets, ys)

{('happi', 0): 1,
 ('happi', 1): 2,
 ('sad', 0): 1,
 ('tire', 0): 2,
 ('trick', 0): 1}

Part 2: Training The Baysian Classifier
## Prior and Logprior

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$$

## Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Notice that we add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing.

## Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

## Create `freqs` dictionary
- Given your `count_tweets()` function, you can compute a dictionary called `freqs` that contains all the frequencies.
- In this `freqs` dictionary, the key is the tuple (word, label)
- The value is the number of times it has appeared.

We will use this dictionary in several parts of this assignment.

In [318]:
# Build the freqs dictionary for later uses
freqs = count_tweets({}, train_x, train_y)

## Steps to Train Baysian Classifier
Given a freqs dictionary, `train_x` (a list of tweets) and a `train_y` (a list of labels for each tweet), implement a naive bayes classifier.

##### Calculate $V$
- You can then compute the number of unique words that appear in the `freqs` dictionary to get your $V$ (you can use the `set` function).

##### Calculate $freq_{pos}$ and $freq_{neg}$
- Using your `freqs` dictionary, you can compute the positive and negative frequency of each word $freq_{pos}$ and $freq_{neg}$.

##### Calculate $N_{pos}$ and $N_{neg}$
- Using `freqs` dictionary, you can also compute the total number of positive words and total number of negative words $N_{pos}$ and $N_{neg}$.

##### Calculate $D$, $D_{pos}$, $D_{neg}$
- Using the `train_y` input list of labels, calculate the number of documents (tweets) $D$, as well as the number of positive documents (tweets) $D_{pos}$ and number of negative documents (tweets) $D_{neg}$.
- Calculate the probability that a document (tweet) is positive $P(D_{pos})$, and the probability that a document (tweet) is negative $P(D_{neg})$

##### Calculate the logprior
- the logprior is $log(D_{pos}) - log(D_{neg})$

##### Calculate log likelihood
- Finally, you can iterate over each word in the vocabulary, use your `lookup` function to get the positive frequencies, $freq_{pos}$, and the negative frequencies, $freq_{neg}$, for that specific word.
- Compute the positive probability of each word $P(W_{pos})$, negative probability of each word $P(W_{neg})$ using equations 4 & 5.

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

**Note:** We'll use a dictionary to store the log likelihoods for each word.  The key is the word, the value is the log likelihood of that word).

- You can then compute the loglikelihood: $log \left( \frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$.

In [319]:
def train_naive_bayes(freqs, train_x, train_y):
    '''
    Input:
        freqs: dictionary from (word, label) to how often the word appears
        train_x: a list of tweets
        train_y: a list of labels correponding to the tweets (0,1)
    Output:
        logprior: the log prior. (equation 3 above)
        loglikelihood: the log likelihood of you Naive bayes equation. (equation 6 above)
    '''
    loglikelihood = {}
    logprior = 0

    # calculate V, the number of unique words in the vocabulary
    vocab = set([pair[0] for pair in freqs.keys()]) 
    V = len(vocab)

    # calculate N_pos and N_neg
    N_pos = N_neg = 0
    for pair in freqs.keys():
        # if the label is positive (greater than zero)
        if pair[1]:
            N_pos += freqs[pair] # Increment the number of positive words by the count for this (word, label) pair
        else:
            N_neg += freqs[pair]

    
    D = len(train_y) # Calculate D, the number of documents
    D_pos = np.sum(train_df.y==1) # Calculate D_pos and D_neg, the number of positive and negative documents
    D_neg = np.sum(train_df.y==0)
    logprior = np.log(D_pos) - np.log(D_neg)     # Calculate logprior

    # For each word in the vocabulary...
    for word in vocab:
        # get the positive and negative frequency of the word
        freq_pos = lookup(freqs,word,1) 
        freq_neg = lookup(freqs,word,0)

        # calculate the probability that each word is positive, and negative
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)

        # calculate the log likelihood of the word
        loglikelihood[word] = np.log(p_w_pos/p_w_neg)

    ### END CODE HERE ###

    return logprior, loglikelihood

In [320]:
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(logprior)
print(len(loglikelihood))

0.0
9089


In [321]:
loglikelihood

{'shadow': 0.6985591249960175,
 '43': 0.005411944436072202,
 'understand': -0.5335845562966148,
 'cooki': 0.005411944436072202,
 'matern': -0.6877352361238731,
 'help@veryhq.co.uk': -0.6877352361238731,
 'pea': -0.6877352361238731,
 'push': -0.4000531636720922,
 'emo': -1.0932003442320375,
 'birthdaygirl': 0.6985591249960175,
 'besteverdoctorwhoepisod': 0.6985591249960175,
 'hate': -1.520644359058977,
 'areadi': -0.6877352361238731,
 'zayniscomingback': -1.3808824166838185,
 'wheel': -0.6877352361238731,
 'biom': -0.6877352361238731,
 'aerial': 0.6985591249960175,
 'hike': -0.6877352361238731,
 'children': 0.6985591249960175,
 'crop': 0.41087705254423673,
 "school'": 0.6985591249960175,
 'nairobi': 0.6985591249960175,
 'congrat': 1.6793883780077437,
 'buzz': 1.104024233104182,
 'san': -0.4000531636720922,
 'hit': 1.104024233104182,
 'hour': -0.07154909670005613,
 'outfit': -0.6877352361238731,
 'chew': -0.6877352361238731,
 'puhon': 0.6985591249960175,
 'spray': -0.6877352361238731,
 '

Part 3: Testing The classifier: 
Now that we have the `logprior` and `loglikelihood`, we can test the naive bayes function by making predicting on some tweets!

#### Implement `naive_bayes_predict`
**Instructions**:
Implement the `naive_bayes_predict` function to make predictions on tweets.
* The function takes in the `tweet`, `logprior`, `loglikelihood`.
* It returns the probability that the tweet belongs to the positive or negative class.
* For each tweet, sum up loglikelihoods of each word in the tweet.
* Also add the logprior to this sum to get the predicted sentiment of that tweet.

$$ p = logprior + \sum_i^N (loglikelihood_i)$$

#### Note
Note we calculate the prior from the training data, and that the training data is evenly split between positive and negative labels (4000 positive and 4000 negative tweets).  This means that the ratio of positive to negative 1, and the logprior is 0.

The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood.  However, please remember to include the logprior, because whenever the data is not perfectly balanced, the logprior will be a non-zero value.

In [322]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    '''
    Input:
        tweet: a string
        logprior: a number
        loglikelihood: a dictionary of words mapping to numbers
    Output:
        p: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior (a number)
    '''
    word_l = process_tweet(tweet) #tokenizing the tweet and processing

    p = 0  # initialize probability to zero
    p += logprior # add the logprior

    for word in word_l:
        if word in loglikelihood:   # check if the word exists in the loglikelihood dictionary
            p += loglikelihood[word] # add the log likelihood of that word to the probability
    return p

In [323]:
loglikelihood['smile'], loglikelihood['joy'], loglikelihood['fun']

(1.5740278623499175, 0.29309401688785314, 0.7895309032017441)

In [324]:
loglikelihood[':)'], loglikelihood[':(']

(6.860820743046, -7.507751600798003)

In [325]:
# Experiment with your own tweet.
my_tweet = 'Jannet Smiled with fear and cried, broke down :) '
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print('The expected output is', p)

The expected output is 5.614201365758956


In [326]:
words = process_tweet(my_tweet)
words

for word in words:
    if word in loglikelihood:
        print(f'{word} Likelihood : {loglikelihood[word]} ')
    else:
        print(f'{word} not found in training corpus!')


jannet not found in training corpus!
smile Likelihood : 1.5740278623499175 
fear Likelihood : 0.005411944436072202 
cri Likelihood : -1.5787081600137383 
broke Likelihood : -1.2473510240592958 
:) Likelihood : 6.860820743046 


In [327]:
test_df = pd.read_csv('test.csv')

In [328]:
test_df.head()

Unnamed: 0,x,y
0,@V4Violetta Or that. I guess I need to build m...,1.0
1,@sennicka don't over engineer it. :),1.0
2,@ellieharveyy it probably your fault he lost h...,0.0
3,Good morning Kimmy :) @KimberlyKWyatt,1.0
4,@RockMyWedding @SouthFarm1 @JohnHopePhoto @Mir...,1.0


Calculate the following performance parameters from test.csv data:
* TP, FP, TN, FN rates
* Accuracy
* Precision
* F1 score

In [329]:
# test = test_df.to_numpy()
# print('test shape:', test.shape)
# print(test[0,0] ,  test[0,1])

In [330]:
y_true = test_df['y'].to_numpy()
y_pred = np.zeros((test_df.to_numpy().shape[0]))
print(y_true.shape, y_pred.shape)
print(y_true[0], y_pred[0])

(2001,) (2001,)
1.0 0.0


In [331]:
for i,j in zip(test_df['x'], np.arange(y_pred.shape[0])):
  p = naive_bayes_predict(i, logprior, loglikelihood)
  # print('The expected output is', p)
  y_pred[j]=p

In [332]:
print(y_pred)
print(y_pred.shape)

[  8.75118123   7.55937987 -11.90954051 ...   8.53493206   6.38072487
  -7.85737914]
(2001,)


In [333]:
# y_pred[y_pred>0.0]=1.0
# y_pred[y_pred<=0.0]=0.0

y_pred = np.where(y_pred>0.0, 1.0, 0.0)
print(y_pred)
print(y_pred.shape)


[1. 1. 0. ... 1. 1. 0.]
(2001,)


In [334]:
# 0.0 means correct detection = TP, TN
# 1.0 means FP
# -1.0 means FN
y_diff = y_pred - y_true 
print('y_diff shape:', y_diff.shape, 'y_pred shape:', y_pred.shape, 'y_true shape:', y_true.shape)

P_all = np.count_nonzero(y_true == 1.0)
N_all = np.count_nonzero(y_true == 0.0)

P_pred = np.count_nonzero(y_pred == 1.0)
N_pred = np.count_nonzero(y_pred == 0.0)

tp = np.count_nonzero((y_diff==0.0) & (y_pred==1.0))
fp = np.count_nonzero(y_diff > 0.0)
tn = np.count_nonzero((y_diff==0.0) & (y_pred==0.0))
fn = np.count_nonzero(y_diff < 0.0)

print('P:', P_all, 'N:', N_all)
print('P_pred:', P_pred, 'N_pred:', N_pred)
print('TP:', tp, 'FP:', fp, 'TN:', tn, 'FN:', fn)
print(tp+fp+tn+fn)

y_diff shape: (2001,) y_pred shape: (2001,) y_true shape: (2001,)
P: 1000 N: 1000
P_pred: 996 N_pred: 1005
TP: 991 FP: 4 TN: 996 FN: 9
2000


In [335]:
tp_rate = tp/P_all
fp_rate = fp/N_all
tn_rate = tn/N_all
fn_rate = fn/P_all

precision = tp/(tp+fp)
recall = tp/P_all

accuracy = (tp+tn)/(P_all+N_all)
f_measure = 2/((1/precision) + (1/recall))


print('TP:', tp, ' FP:', fp, ' TN:', tn, ' FN:', fn)
print('TP rate:', tp_rate, ' FP rate:', fp_rate, ' TN rate:', tn_rate, ' FN rate:', fn_rate)
print('Precision:', precision, ' Recall:', recall)
print('Accuracy:', accuracy, ' F-measure:', f_measure)

TP: 991  FP: 4  TN: 996  FN: 9
TP rate: 0.991  FP rate: 0.004  TN rate: 0.996  FN rate: 0.009
Precision: 0.9959798994974874  Recall: 0.991
Accuracy: 0.9935  F-measure: 0.993483709273183
