<a href="https://colab.research.google.com/github/isuruK2003/SentimentAnalysis/blob/main/sentiment_analysis_with_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Sentiment Analysis with Naive Bayes'

#### 1) Importing Data

In [None]:
from os import getcwd
import numpy as np
import nltk
from nltk.corpus import twitter_samples

In [None]:
nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

In [None]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [None]:
# split the data into two pieces, one for training and one for testing (validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]

test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

In [None]:
# Combine train_pos and train_neg;

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# Applying a label

# train_x contains tweets, where the first portion (0 to len(train_pos) - 1) is postive and the rest is negative (from len(train_pos) - 1 to len(test_x) - 1)
# what the following code does is, it creates a array of ones for first portion and another array of zeros for the secod portion
# then it combines them together, resulting a numpy array that has a length of test_x
# the, same procedure has applied to the test_x too

train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
train_y = np.squeeze(train_y.tolist())

test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)
test_y = np.squeeze(train_y.tolist())

print(train_y, test_y)

[1. 1. 1. ... 0. 0. 0.] [1. 1. 1. ... 0. 0. 0.]


#### 2) Cleaning Data

In [None]:
import re
import string

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

In [None]:
def tokenize_tweet(tweet:str):
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    return tokenizer.tokenize(tweet)

print("Quick Test:", tokenize_tweet(test_x[0]))

Quick Test: ['bro', ':', 'u', 'wan', 'cut', 'hair', 'anot', ',', 'ur', 'hair', 'long', 'liao', 'bo', 'me', ':', 'since', 'ord', 'liao', ',', 'take', 'it', 'easy', 'lor', 'treat', 'as', 'save', '$', 'leave', 'it', 'longer', ':)', 'bro', ':', 'lol', 'sibei', 'xialan']


In [None]:
def process_tweet(tweet):
    tweet = re.sub(r'\$\w*', '', tweet) # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet) # remove hyperlinks
    tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet) # remove hashtags
    tweet = re.sub(r'#', '', tweet) # only removing the hash # sign from the word

    tweet_cleaned = []
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english') # remove stock market tickers like $GE

    for word in tokenize_tweet(tweet):
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            stem_word = stemmer.stem(word)  # stemming word
            tweet_cleaned.append(stem_word)
    return tweet_cleaned

print("Quick Test:\n", train_neg[1], "\n", process_tweet(train_neg[1]))

Quick Test:
 Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :( 
 ['everyth', 'kid', 'section', 'ikea', 'cute', 'shame', "i'm", 'nearli', '19', '2', 'month', ':(']


#### 3) Classifier using Naive Bayes

#### Bayes' Theorem

- Bayes' theorem describes the probability of an event, based on prior knowledge of conditions related to the event. It is widely used in statistics and machine learning to update the probability of a hypothesis given new evidence.

$$P(X \mid Y) = \frac{P(Y \mid X) \cdot P(X)}{P(Y)}$$

### Application of Bayes' Theorem in Naive Bayes Classification

- Naive Bayes classification is a simple probabilistic classifier based on Bayes' Theorem. It assumes that the features are conditionally independent given the class label. This assumption greatly simplifies the computation and allows the model to handle large datasets efficiently.

- The Naive Bayes classifier calculates the posterior probability of each class given the observed features and chooses the class with the highest posterior probability. Specifically, for a dataset with features \(X = (X_1, X_2, ..., X_n)\), the classification rule is based on Bayes' Theorem:

$$P(Y \mid X_1, X_2, ..., X_n) = \frac{P(X_1, X_2, ..., X_n \mid Y) \cdot P(Y)}{P(X_1, X_2, ..., X_n)}$$

- Using the **naive assumption** of conditional independence, we can simplify the likelihood term:

$$P(X_1, X_2, ..., X_n \mid Y) = P(X_1 \mid Y) \cdot P(X_2 \mid Y) \cdot \dots \cdot P(X_n \mid Y)$$

- Thus, the Naive Bayes classification rule becomes:

$$\hat{Y} = \arg \max_Y P(Y) \prod_{i=1}^n P(X_i \mid Y)$$

Where:
- \(P(Y)\) is the prior probability of the class.
- \(P(X_i \mid Y)\) is the likelihood of feature \(X_i\) given class \(Y\).

- This approach is particularly useful for text classification tasks, like spam detection or sentiment analysis, where features (e.g., words) are treated as independent given the class.

- Despite the strong assumption of independence, Naive Bayes often performs surprisingly well and is easy to implement and understand.


In [None]:
def compute_freqs(train_x:list[list[str]], train_y:list[float]):
    freqs = {}
    for tweet, sentiment in zip(train_x, train_y):
        for word in process_tweet(tweet):
            pair = (word, sentiment)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1
    return freqs

In [None]:
import pandas as pd

In [None]:
freqs = compute_freqs(train_x, train_y)

# Initialize the dictionary to store word statistics
df = {}

V = len(set([pair[0] for pair in freqs.keys()]))  # Total number of unique words
N_pos = 0  # Total number of positive occurrences
N_neg = 0  # Total number of negative occurrences

# Populate the dictionary with word statistics
for pair, freq in freqs.items():
    word = pair[0]
    sentiment = pair[1]
    if word not in df:
        df[word] = {
            "pos_freq": 0,
            "neg_freq": 0
        }
    if sentiment == 1:  # Positive sentiment
        df[word]["pos_freq"] += freq
        N_pos += freq
    elif sentiment == 0:  # Negative sentiment
        df[word]["neg_freq"] += freq
        N_neg += freq

# Convert the dictionary into a DataFrame
df = pd.DataFrame.from_dict(df, orient="index")

# Add probabilities and log ratio to the DataFrame
df["pos_prob"] = (df["pos_freq"] + 1) / (N_pos + V)  # Laplace smoothing
df["neg_prob"] = (df["neg_freq"] + 1) / (N_neg + V)  # Laplace smoothing
df["prob_log_ratio"] = np.log(df["pos_prob"] / df["neg_prob"])



In [None]:
df

Unnamed: 0,pos_freq,neg_freq,pos_prob,neg_prob,prob_log_ratio
followfriday,23,0,0.000654,0.000028,3.166931
top,30,5,0.000845,0.000165,1.631105
engag,7,0,0.000218,0.000028,2.068318
member,14,6,0.000409,0.000193,0.751017
commun,27,1,0.000763,0.000055,2.627934
...,...,...,...,...,...
dislik,0,1,0.000027,0.000055,-0.704270
burdensom,0,1,0.000027,0.000055,-0.704270
amelia,0,1,0.000027,0.000055,-0.704270
melon,0,1,0.000027,0.000055,-0.704270


In [None]:
log_prior = np.log(len(train_pos)/len(train_neg))

In [None]:
def get_log_likelihood(tweet):
    log_likelihood = 0
    for word in process_tweet(tweet):
        if word not in df.index:
            continue
        log_likelihood += df.loc[word, "prob_log_ratio"]
    return log_likelihood

In [None]:
def predict(tweet:str):
    likelihood = float(get_log_likelihood(tweet))
    if likelihood == 0:
        raise ValueError("Ambigous Prediction")
    return float(likelihood > 0) # postive

#### 4) Testing

In [None]:
true_sum = 0
false_sum = 0
ambiguous_count = 0

for tweet, y in zip(test_x, test_y):
    try:
        if predict(tweet) == float(y):
            true_sum += 1
        else:
            false_sum += 1
    except ValueError:
        ambiguous_count += 1

total_predictions = true_sum + false_sum
accuracy = (true_sum / total_predictions) * 100 if total_predictions > 0 else 0

print(f"Accuracy: {accuracy:.3f} %")
print(f"Ambiguous Count: {ambiguous_count}")

Accuracy: 49.975 %
Ambiguous Count: 1


#### 5) Conclusion

Naive Bayes' classifiesrs are simple to build, and are meant to provide reasonable accuracy. In the above experiment, the accuracy was less tha 50% which is not the best. The reason for this is Naive Bayes', but its the "Naive" approach that is used here for data preprocessing. So in the second part of this experiment, more sophisticated approach will be used data preprocessing.

In [None]:
# To manually test the model

def classify_tweet(tweet):
    try:
        class_map = ["Negative", "Positive", "Neutral"]
        return class_map[int(predict(tweet))]
    except ValueError:
        return class_map[-1]

In [None]:
# Examples
classify_tweet("i love oop :)")

'Positive'