<a href="https://colab.research.google.com/github/gupta24789/sentiment-analysis/blob/main/02_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import itertools
from collections import Counter

## Read Data

In [2]:
train_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/sentiment-analysis/main/data/train.csv")
val_df = pd.read_csv("https://raw.githubusercontent.com/gupta24789/sentiment-analysis/main/data/val.csv")

train_df.processed_tweet = train_df.processed_tweet.fillna('[]').apply(lambda x: eval(x) if x is not None else [])
val_df.processed_tweet = val_df.processed_tweet.fillna('[]').apply(lambda x: eval(x) if x is not None else [])

In [13]:
train_df.label.value_counts()

1.0    4000
0.0    4000
Name: label, dtype: int64

In [14]:
val_df.label.value_counts()

1    1000
0    1000
Name: label, dtype: int64

## Create Word Freq by label

In [15]:
pos_freq_dict = Counter(list(itertools.chain.from_iterable(train_df[train_df.label==1]['processed_tweet'].tolist())))
pos_freq_dict.most_common(10)

[(':)', 2866),
 (':-)', 530),
 ('thank', 507),
 (':d', 504),
 ('love', 322),
 ('follow', 306),
 ('...', 221),
 ('day', 193),
 ('good', 191),
 ('like', 186)]

In [16]:
neg_freq_dict = Counter(list(itertools.chain.from_iterable(train_df[train_df.label==0]['processed_tweet'].tolist())))
neg_freq_dict.most_common(10)

[(':(', 3636),
 (':-(', 404),
 ("i'm", 293),
 ('...', 268),
 ('miss', 242),
 ('pleas', 219),
 ('follow', 202),
 ('want', 192),
 ('like', 190),
 ('get', 189)]

## Create Features

- pos_freq : sum of positive freq of all unique words in tweet
- neg_freq : sum of negative freq of all unique words in the tweet

In [17]:
train_df['pos_freq'] = train_df.processed_tweet.apply(lambda x: np.sum([pos_freq_dict.get(w,0) for w in set(x)]))
train_df['neg_freq'] = train_df.processed_tweet.apply(lambda x: np.sum([neg_freq_dict.get(w,0) for w in set(x)]))

val_df['pos_freq'] = val_df.processed_tweet.apply(lambda x: np.sum([pos_freq_dict.get(w,0) for w in set(x)]))
val_df['neg_freq'] = val_df.processed_tweet.apply(lambda x: np.sum([neg_freq_dict.get(w,0) for w in set(x)]))

In [18]:
train_df.head(6)

Unnamed: 0,raw_tweet,processed_tweet,label,pos_freq,neg_freq
0,Want to say a huge thanks to @WarriorAssaultS ...,"[want, say, huge, thank, ff, thank, support, :)]",1.0,3575.0,358.0
1,@jaynehh_ you just need a job and get a letter...,"[need, job, get, letter, work, place, say, wor...",1.0,958.0,464.0
2,"@knhillrocks HA yes, make it quick tho :D","[ha, ye, make, quick, tho, :d]",1.0,690.0,144.0
3,@shartyboy Thanks for texting me back :)) I'm ...,"[thank, text, back, :), i'm, text, tomorrow, :)]",1.0,3650.0,512.0
4,Laying out a greetings card range for print to...,"[lay, greet, card, rang, print, today, love, j...",1.0,990.0,240.0
5,#FollowFriday @CCIFCcanada @AdamEvnmnt @boxcal...,"[followfriday, top, engag, member, commun, wee...",1.0,3026.0,58.0


## **Naive Bayes**

Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.

#### **So how do you train a Naive Bayes classifier?**
- The first part of training a naive bayes classifier is to identify the number of classes that you have.
- You will create a probability for each class.
$P(D_{pos})$ is the probability that the document is positive.
$P(D_{neg})$ is the probability that the document is negative.
Use the formulas as follows and store the values in a dictionary:

$$P(D_{pos}) = \frac{D_{pos}}{D}\tag{1}$$

$$P(D_{neg}) = \frac{D_{neg}}{D}\tag{2}$$

Where $D$ is the total number of documents, or tweets in this case, $D_{pos}$ is the total number of positive tweets and $D_{neg}$ is the total number of negative tweets.


#### **Prior and Logprior**

The prior probability represents the underlying probability in the target population that a tweet is positive versus negative.  In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".

The prior is the ratio of the probabilities $\frac{P(D_{pos})}{P(D_{neg})}$.
We can take the log of the prior to rescale it, and we'll call this the logprior

$$\text{logprior} = log \left( \frac{P(D_{pos})}{P(D_{neg})} \right) = log \left( \frac{D_{pos}}{D_{neg}} \right)$$.

Note that $log(\frac{A}{B})$ is the same as $log(A) - log(B)$.  So the logprior can also be calculated as the difference between two logs:

$$\text{logprior} = \log (P(D_{pos})) - \log (P(D_{neg})) = \log (D_{pos}) - \log (D_{neg})\tag{3}$$


#### **Positive and Negative Probability of a Word**
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:

- $freq_{pos}$ and $freq_{neg}$ are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
- $N_{pos}$ and $N_{neg}$ are the total number of positive and negative words for all documents (for all tweets), respectively.
- $V$ is the number of unique words in the entire set of documents, for all classes, whether positive or negative.

We'll use these to compute the positive and negative probability for a specific word using this formula:

$$ P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V}\tag{4} $$
$$ P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V}\tag{5} $$

Notice that we add the "+1" in the numerator for additive smoothing.  This [wiki article](https://en.wikipedia.org/wiki/Additive_smoothing) explains more about additive smoothing.


#### **Log likelihood**
To compute the loglikelihood of that very same word, we can implement the following equations:

$$\text{loglikelihood} = \log \left(\frac{P(W_{pos})}{P(W_{neg})} \right)\tag{6}$$

In [19]:
## D_pos: total number of positive tweets, D_neg : total number of negative tweets
D_pos = train_df[train_df.label==1].shape[0]
D_neg = train_df[train_df.label==0].shape[0]
logprior = np.log(D_pos/D_neg)
print(f"Logprior : {logprior}")

Logprior : 0.0


In [20]:
V = len(set(pos_freq_dict.keys()).union(neg_freq_dict.keys()))  ## unique words from pos+neg
N_pos = len(pos_freq_dict)
N_neg = len(neg_freq_dict)

In [21]:
train_df['pos_prob'] = train_df.pos_freq.apply(lambda x: (x+1)/(N_pos + V))
train_df['neg_prob'] = train_df.neg_freq.apply(lambda x: (x+1)/(N_neg + V))
train_df['log_likelihood'] = np.log(train_df.pos_prob/train_df.neg_prob)

In [22]:
val_df['pos_prob'] = val_df.pos_freq.apply(lambda x: (x+1)/(len(pos_freq_dict) + V))
val_df['neg_prob'] = val_df.neg_freq.apply(lambda x: (x+1)/(len(neg_freq_dict) + V))
val_df['log_likelihood'] = np.log(val_df.pos_prob/val_df.neg_prob)

In [23]:
## Add logprior and loglikehood
train_df['log_likelihood'] = logprior + train_df['log_likelihood']
val_df['log_likelihood'] = logprior + val_df['log_likelihood']

## Prediction

- if log_likelihood>=0 => 1
- if log_likelihood<0 => 0

In [24]:
train_df['pred_label'] = np.where(train_df.log_likelihood>=0,1,0)
val_df['pred_label'] = np.where(val_df.log_likelihood>=0,1,0)

## Accuracy

In [25]:
print("Train Accuracy : ", np.sum(train_df.pred_label == train_df.label)/len(train_df))
print("Val Accuracy : ", np.sum(val_df.pred_label == val_df.label)/len(val_df))

Train Accuracy :  0.9920029988754218
Val Accuracy :  0.99


## Error Analysis

In [31]:
for i, row in train_df[train_df.pred_label != train_df.label].sample(6).iterrows():
  print(f"Raw Tweet : {row['raw_tweet']}")
  print(f"Processed Tweet : {row['processed_tweet']}")
  print(f"Label : {row['label']}")
  print("\n")

Raw Tweet : Get more at http://t.co/aady6CDfB2 :) http://t.co/aXu8Gidwv3
Processed Tweet : ['get']
Label : 1.0


Raw Tweet : http://t.co/FLVKMeaL1i TODAY IN TEE TOURNAMENT!! #EqualityAct GET YOURS FOR ONLY 11$!! :D http://t.co/7lwnbd1KvJ
Processed Tweet : []
Label : 1.0


Raw Tweet : https://t.co/85JGM6Oj6Q new video people, check it out :)
Processed Tweet : []
Label : 1.0


Raw Tweet : @Harbinger1973 you star!! Thank you very much. :)
Processed Tweet : []
Label : nan


Raw Tweet : A brief introduction 2 d earliest history of #Indian subcontinent even bfr Mauryas:
http://t.co/NEeqJTkQhy:) http://t.co/jKTIjQxYoW
Processed Tweet : ['brief', 'introduct', '2', 'earliest', 'histori', 'indian', 'subcontin', 'even', 'bfr', 'maurya']
Label : 1.0


Raw Tweet : @miabellasesso http://t.co/FtI5vLQJks @SBNation and a few select others.. will get to you :)
Processed Tweet : []
Label : 1.0




## Predict

In [43]:
import re
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [44]:
def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [51]:
def predict(tweet):
  processed_tweet = process_tweet(tweet)
  pos_freq = np.sum([pos_freq_dict.get(w,0) for w in processed_tweet])
  neg_freq = np.sum([neg_freq_dict.get(w,0) for w in processed_tweet])

  pos_prob = (pos_freq + 1)/(N_pos + V)
  neg_prob = (neg_freq + 1)/(N_neg + V)
  log_likelihood = np.log(pos_prob/neg_prob)
  log_likelihood = logprior + log_likelihood
  pred = 1 if log_likelihood>=0 else 0
  return pred

In [53]:
tweet = "I hate this movies"
print(predict(tweet))

0


In [54]:
tweet = "I love this movies"
print(predict(tweet))

1
