<a href="https://colab.research.google.com/github/gupta24789/sentiment-analysis/blob/main/01_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
import pandas as pd
import numpy as np
import itertools
from collections import Counter

## Read Data

In [39]:
train_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/sentiment-analysis/main/data/train.csv")
val_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/sentiment-analysis/main/data/val.csv")

train_df.processed_tweet = train_df.processed_tweet.fillna('[]').apply(lambda x: eval(x) if x is not None else [])
val_df.processed_tweet = val_df.processed_tweet.fillna('[]').apply(lambda x: eval(x) if x is not None else [])

## Create Word Freq by label

In [40]:
pos_freq_dict = Counter(list(itertools.chain.from_iterable(train_df[train_df.label==1]['processed_tweet'].tolist())))
pos_freq_dict.most_common(10)

[(':)', 2859),
 (':-)', 540),
 (':d', 500),
 ('thank', 493),
 ('love', 303),
 ('follow', 298),
 ('...', 214),
 ('day', 199),
 ('good', 190),
 ('like', 176)]

In [41]:
neg_freq_dict = Counter(list(itertools.chain.from_iterable(train_df[train_df.label==0]['processed_tweet'].tolist())))
neg_freq_dict.most_common(10)

[(':(', 3657),
 (':-(', 395),
 ("i'm", 279),
 ('...', 262),
 ('miss', 249),
 ('pleas', 230),
 ('follow', 212),
 ('want', 209),
 ('go', 189),
 ('like', 187)]

## Create Features

- pos_freq : sum of positive freq of all unique words in tweet
- neg_freq : sum of negative freq of all unique words in the tweet
- bias : 1

In [42]:
train_df['pos_freq'] = train_df.processed_tweet.apply(lambda x: np.sum([pos_freq_dict.get(w,0) for w in set(x)]))
train_df['neg_freq'] = train_df.processed_tweet.apply(lambda x: np.sum([neg_freq_dict.get(w,0) for w in set(x)]))

val_df['pos_freq'] = val_df.processed_tweet.apply(lambda x: np.sum([pos_freq_dict.get(w,0) for w in set(x)]))
val_df['neg_freq'] = val_df.processed_tweet.apply(lambda x: np.sum([neg_freq_dict.get(w,0) for w in set(x)]))

In [43]:
train_df.sample(6)

Unnamed: 0,raw_tweet,processed_tweet,label,pos_freq,neg_freq
4955,anyone has the pic of taeyeon's derp in channe...,"[anyon, pic, taeyeon', derp, channel, snsd, ha...",0.0,37.0,3708.0
3985,Hello :) Get Youth Job Opportunities follow &g...,"[hello, :), get, youth, job, opportun, follow]",1.0,3443.0,414.0
625,@moonlight69 well you're in for a wild ride &g...,"[well, wild, ride, >:d]",1.0,72.0,52.0
5097,@BellsIsMine what happened? :(,"[happen, :(]",0.0,15.0,3696.0
238,"""I shouldn't b called a friend if I am not the...","[b, call, friend, need, :), ...]",1.0,3223.0,413.0
2275,this #mca money tells my story :) http://t.co/...,"[mca, money, tell, stori, :)]",1.0,2898.0,52.0


## **Naive Bayes**

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

#### **So how do you train a Naive Bayes classifier?**
- The first part of training a naive bayes classifier is to identify the number of classes that you have.
- You will create a probability for each class.
$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.
Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.


#### **Prior and Logprior**

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$$


#### **Positive and Negative Probability of a Word**
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Notice that we add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing.


#### **Log likelihood**
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

In [44]:
D = train_df.shape[0]
D_pos = train_df[train_df.label==1].shape[0]
D_neg = train_df[train_df.label==0].shape[0]
logprior = np.log(D_pos/D_neg)
print(f"Logprior : {logprior}")

Logprior : 0.0


In [49]:
V = len(set(pos_freq_dict.keys()).union(neg_freq_dict.keys()))  ## unique words from pos+neg

In [53]:
train_df['pos_prob'] = train_df.pos_freq.apply(lambda x: (x+1)/(len(pos_freq_dict) + V))
train_df['neg_prob'] = train_df.neg_freq.apply(lambda x: (x+1)/(len(neg_freq_dict) + V))
train_df['log_likelihood'] = train_df.pos_prob/train_df.neg_prob

In [57]:
val_df['pos_prob'] = val_df.pos_freq.apply(lambda x: (x+1)/(len(pos_freq_dict) + V))
val_df['neg_prob'] = val_df.neg_freq.apply(lambda x: (x+1)/(len(neg_freq_dict) + V))
val_df['log_likelihood'] = val_df.pos_prob/val_df.neg_prob

In [59]:
## Add logprior and loglikehood
train_df['log_likelihood'] = logprior + train_df['log_likelihood']
val_df['log_likelihood'] = logprior + val_df['log_likelihood']

## Prediction

- if log_likelihood>=0 => 1
- if log_likelihood<0 => 0

In [68]:
train_df['pred_label'] = np.where(train_df.log_likelihood>0,1,0)
val_df['pred_label'] = np.where(val_df.log_likelihood>0,1,0)

## Accuracy

In [72]:
print("Train Accuracy : ", np.sum(train_df.pred_label == train_df.label)/len(train_df))
print("Val Accuracy : ", np.sum(val_df.pred_label == val_df.label)/len(val_df))

Train Accuracy :  0.49987503124218946
Val Accuracy :  0.49975012493753124
