# PA1.2 Naive Bayes for Text Classification

### Introduction

In this notebook, you will be implementing a Naive Bayes model to classify sentences based off their emotions.

The Naive Bayes model is a probabilistic model that uses Bayes' Theorem to calculate the probability of a label given some observed features. In this case, we will be using the Naive Bayes model to calculate the probability of a sentence belonging to a certain emotion given the words in the sentence.

For reference and additional details, please go through [Chapter 4](https://web.stanford.edu/~jurafsky/slp3/4.pdf) of the SLP3 book.


### Instructions

- Follow along with the notebook, filling out the necessary code where instructed.

- <span style="color: red;">Read the Submission Instructions, Plagiarism Policy, and Late Days Policy in the attached PDF.</span>

- <span style="color: red;">Make sure to run all cells for credit.</span>

- <span style="color: red;">Do not remove any pre-written code.</span>

- <span style="color: red;">You must attempt all parts.</span>

In [None]:
pip install numpy pandas matplotlib seaborn scikit-learn




In [None]:
# import all required libraries here
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder


## Loading and Preprocessing the Dataset

We will be working with the [dair-ai/emotion](https://huggingface.co/datasets/dair-ai/emotion) dataset. This contains 6 classes of emotions: `joy`, `sadness`, `anger`, `fear`, `love`, and `surprise`.

Instead of downloading the dataset manually, we will be using the [`datasets`](https://huggingface.co/docs/datasets) library to download the dataset for us. This is a library in the HuggingFace ecosystem that allows us to easily download and use datasets for NLP tasks. Outside of just downloading the dataset, it also provides a standard interface for accessing the data, which makes it easy to use with other libraries like Pandas and PyTorch. You can take a look at the huge list of datasets available [here](https://huggingface.co/datasets).

In the following cells,

1. Load in the dataset (It should already be split into train, validation, and test sets.)

2. Define a dictionary mapping the emotion labels to integers. You can find these on the dataset page linked above.

3. Format each split of the dataset into a Pandas DataFrame. The columns should be `text` and `label`, where `text` is the sentence and `label` is the emotion label.

In [None]:

!pip install datasets
from datasets import load_dataset
dataset = load_dataset("dair-ai/emotion")

train_data = dataset["train"]
validation_data = dataset["validation"]
test_data = dataset["test"]





You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [None]:
dataset_label_mapping = {
    0: "sadness",
    1: "joy",
    2 : "love",
    3: "anger",
    4: "fear",
    5: "surprise"

}
print(dataset_label_mapping)

{0: 'sadness', 1: 'joy', 2: 'love', 3: 'anger', 4: 'fear', 5: 'surprise'}


In [None]:
train_df = pd.DataFrame({"text": train_data["text"], "label": train_data["label"]})
validation_df = pd.DataFrame({"text": validation_data["text"], "label": validation_data["label"]})
test_df = pd.DataFrame({"text": test_data["text"], "label": test_data["label"]})
print("Training DataFrame:")
print(test_df.head())


Training DataFrame:
                                                text  label
0  im feeling rather rotten so im not very ambiti...      0
1          im updating my blog because i feel shitty      0
2  i never make her separate from me because i do...      0
3  i left with my bouquet of red and yellow tulip...      1
4    i was feeling a little vain when i did this one      0


In [None]:

train_shape = train_df.shape
validation_shape = validation_df.shape
test_shape = test_df.shape


print("Train DataFrame Shape:", train_shape)
print("Validation DataFrame Shape:", validation_shape)
print("Test DataFrame Shape:", test_shape)

Train DataFrame Shape: (16000, 2)
Validation DataFrame Shape: (2000, 2)
Test DataFrame Shape: (2000, 2)


Now that we've gotten a feel for the dataset, we might want to do some cleaning or preprocessing before continuing. For example, we might want to remove punctuation and other alphanumeric characters, lowercase all the text, strip away extra whitespace, and remove stopwords.

In the cell below, write a function that does exactly the following described above. You can use the `re` library to help you with this. You can also use the `nltk` library to help you with removing stopwords.

Once you are done, you can simply `apply` this function to the `text` column of the dataset to get the preprocessed text.

In [None]:

import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


import nltk
nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):

    text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)
    text = text.lower()
    text = re.sub('\s+', ' ', text).strip()
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]


    preprocessed_text = ' '.join(filtered_tokens)

    return preprocessed_text


train_df["text_preprocessed"] = train_df["text"].apply(preprocess_text)
validation_df["text_preprocessed"] = validation_df["text"].apply(preprocess_text)
test_df["text_preprocessed"] = test_df["text"].apply(preprocess_text)


print("Training DataFrame after preprocessing:")
print(train_df[["text", "text_preprocessed", "label"]].head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Training DataFrame after preprocessing:
                                                text  \
0                            i didnt feel humiliated   
1  i can go from feeling so hopeless to so damned...   
2   im grabbing a minute to post i feel greedy wrong   
3  i am ever feeling nostalgic about the fireplac...   
4                               i am feeling grouchy   

                                   text_preprocessed  label  
0                              didnt feel humiliated      0  
1  go feeling hopeless damned hopeful around some...      0  
2          im grabbing minute post feel greedy wrong      3  
3  ever feeling nostalgic fireplace know still pr...      2  
4                                    feeling grouchy      3  


### Vectorizing sentences with Bag of Words

Now that we have loaded in our data, we will need to vectorize our sentences - this is necessary to be able to numericalize our inputs before feeding them into our model.

We will be using a Bag of Words approach to vectorize our sentences. This is a simple approach that counts the number of times each word appears in a sentence.

The element at index $\text{i}$ of the vector will be the number of times the $\text{i}^{\text{th}}$ word in our vocabulary appears in the sentence. So, for example, if our vocabulary is `["the", "cat", "sat", "on", "mat"]`, and our sentence is `"the cat sat on the mat"`, then our vector will be `[2, 1, 1, 1, 1]`.

You will now create a `BagOfWords` class to vectorize our sentences. This will involve creating

1. A vocabulary from our corpus

2. A mapping from words to indices in our vocabulary

3. A function to vectorize a sentence in the fashion described above

It may help you to define something along the lines of a `fit` and a `vectorize` method.

In [None]:
from collections import Counter

class BagOfWords:
    def __init__(self):
        self.vocabulary = None
        self.word_to_index = None

    def fit(self, corpus):

        preprocessed_sentences = [self._preprocess_text(sentence) for sentence in corpus]
        flat_vocabulary = [word for sentence in preprocessed_sentences for word in sentence.split()]
        word_frequencies = Counter(flat_vocabulary)
        self.vocabulary = list(word_frequencies.keys())
        self.word_to_index = {word: index for index, word in enumerate(self.vocabulary)}

    def vectorize(self, sentence):

        preprocessed_sentence = self._preprocess_text(sentence)

        vector = [0] * len(self.vocabulary)


        word_frequencies = Counter(preprocessed_sentence.split())

        for word, frequency in word_frequencies.items():
            if word in self.word_to_index:
                vector[self.word_to_index[word]] = frequency

        return vector

    def _preprocess_text(self, text):
        return preprocess_text(text)



bow = BagOfWords()



For a sanity check, you can manually set the vocabulary of your `BagOfWords` object to the vocabulary of the example above, and check that the vectorization of the sentence is correct.

Once you have implemented the `BagOfWords` class, fit it to the training data, and vectorize the training, validation, and test data.

In [None]:

bow.fit(train_df["text_preprocessed"].tolist())

train_vectors = [bow.vectorize(sentence) for sentence in train_df["text_preprocessed"].tolist()]
validation_vectors = [bow.vectorize(sentence) for sentence in validation_df["text_preprocessed"].tolist()]
test_vectors = [bow.vectorize(sentence) for sentence in test_df["text_preprocessed"].tolist()]

# Displaying  a few examples of the vectorized data
print("Example of a Vectorized Sentence from the Training Data:")

print(bow.vocabulary)
print(train_vectors[2])

Example of a Vectorized Sentence from the Training Data:
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## Naive Bayes

### From Scratch

Now that we have vectorized our sentences, we can implement our Naive Bayes model. Recall that the Naive Bayes model is based off of the Bayes Theorem:

$$
P(y \mid x) = \frac{P(x \mid y)P(y)}{P(x)}
$$

What we really want is to find the class $c$ that maximizes $P(c \mid x)$, so we can use the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(x \mid c)P(c)
$$

We can then use the Naive Bayes assumption to simplify this:

$$
\hat{c} = \underset{c}{\text{argmax}} \ P(c \mid x) = \underset{c}{\text{argmax}} \ P(c) \prod_{i=1}^{n} P(x_i \mid c)
$$

Where $x_i$ is the $i^{\text{th}}$ word in our sentence.

All of these probabilities can be estimated from our training data. We can estimate $P(c)$ by counting the number of times each class appears in our training data, and dividing by the total number of training examples. We can estimate $P(x_i \mid c)$ by counting the number of times the $i^{\text{th}}$ word in our vocabulary appears in sentences of class $c$, and dividing by the total number of words in sentences of class $c$.

It would help to apply logarithms to the above equation so that we translate the product into a sum, and avoid underflow errors. This will give us the following equation:

$$
\hat{c} = \underset{c}{\text{argmax}} \ \log P(c) + \sum_{i=1}^{n} \log P(x_i \mid c)
$$

You will now implement this algorithm. It would help to go through [this chapter from SLP3](https://web.stanford.edu/~jurafsky/slp3/4.pdf) to get a better understanding of the model - **it is recommended base your implementation off the pseudocode that has been provided on Page 6**. You can either make a `NaiveBayes` class, or just implement the algorithm across two functions.

<span style="color: red;"> For this part, the only external library you will need is `numpy`. You are not allowed to use anything else.</span>

In [None]:


class NaiveBayes:
    def __init__(self, alpha=1):
        self.alpha = alpha  # Laplace smoothing parameter
        self.class_probabilities = None
        self.word_probabilities = None
        self.vocabulary_size = None

    def fit(self, X, y):
        # X is a list of vectorized sentences
        # y is a list of corresponding labels(sentiments numbering 0,1,2..6)

        X = np.array(X)
        y = np.array(y)

        # Calculating class probabilities P(c)
        unique_classes, class_counts = np.unique(y, return_counts=True)
        self.class_probabilities = class_counts / len(y)

        # Calculating word probabilities P(xi|c)
        self.vocabulary_size = X.shape[1]
        self.word_probabilities = np.zeros((len(unique_classes), self.vocabulary_size))

        for c in unique_classes:
            word_counts = np.sum(X[y==c], axis=0)
            total_words_in_class = np.sum(word_counts)

            # Laplace smoothing
            smoothed_word_probs = (word_counts + self.alpha) / (total_words_in_class + self.alpha * self.vocabulary_size)
            self.word_probabilities[c, :] = np.log(smoothed_word_probs)

    def predict(self, X):

        X = np.array(X)
        log_likelihoods = np.dot(X, self.word_probabilities.T) + np.log(self.class_probabilities)
        predictions = np.argmax(log_likelihoods, axis=1)

        return predictions


Now use your implementation to train a Naive Bayes model on the training data, and generate predictions for the Validation Set.

Report the Accuracy, Precision, Recall, and F1 score of your model on the validation data. Also display the Confusion Matrix. You are allowed to use `sklearn.metrics` for this.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

naive_bayes_model = NaiveBayes()
naive_bayes_model.fit(train_vectors, train_df["label"])
validation_predictions = naive_bayes_model.predict(validation_vectors)

accuracy = accuracy_score(validation_df["label"], validation_predictions)
precision = precision_score(validation_df["label"], validation_predictions, average='weighted')
recall = recall_score(validation_df["label"], validation_predictions, average='weighted')
f1 = f1_score(validation_df["label"], validation_predictions, average='weighted')
conf_matrix = confusion_matrix(validation_df["label"], validation_predictions)


print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("\nConfusion Matrix:")
print(conf_matrix)


Accuracy: 0.788
Precision: 0.8107206326104295
Recall: 0.788
F1 Score: 0.7649655106714165

Confusion Matrix:
[[519  20   1   5   5   0]
 [ 31 668   3   2   0   0]
 [ 38  76  61   2   1   0]
 [ 52  30   0 189   4   0]
 [ 49  24   0   8 129   2]
 [ 31  30   0   1   9  10]]


### Using `sklearn`

Now that you have implemented your own Naive Bayes model, you will use the `sklearn` library to train a Naive Bayes model on the same data. Alongside this, you will use their implementation of the Bag of Words model, the `CountVectorizer` class, to vectorize your sentences.

You can use the `MultinomialNB` class to train a Naive Bayes model. Go through the relevant documentation to figure out how to use it, and how it differs from the model you implemented.

When you finish training your model, report the same metrics as above on the Validation Set.

In [None]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB


vectorizer = CountVectorizer()
X_train_sklearn = vectorizer.fit_transform(train_df["text_preprocessed"])
X_validation_sklearn = vectorizer.transform(validation_df["text_preprocessed"])

nb_model_sklearn = MultinomialNB()
nb_model_sklearn.fit(X_train_sklearn, train_df["label"])

validation_predictions_sklearn = nb_model_sklearn.predict(X_validation_sklearn)

accuracy_sklearn = accuracy_score(validation_df["label"], validation_predictions_sklearn)
precision_sklearn = precision_score(validation_df["label"], validation_predictions_sklearn, average='weighted')
recall_sklearn = recall_score(validation_df["label"], validation_predictions_sklearn, average='weighted')
f1_sklearn = f1_score(validation_df["label"], validation_predictions_sklearn, average='weighted')
conf_matrix_sklearn = confusion_matrix(validation_df["label"], validation_predictions_sklearn)

print("Sklearn Naive Bayes Model Metrics:")
print("Accuracy:", accuracy_sklearn)
print("Precision:", precision_sklearn)
print("Recall:", recall_sklearn)
print("F1 Score:", f1_sklearn)
print("\nConfusion Matrix:")
print(conf_matrix_sklearn)

Sklearn Naive Bayes Model Metrics:
Accuracy: 0.7885
Precision: 0.8110843396828652
Recall: 0.7885
F1 Score: 0.7654008327670031

Confusion Matrix:
[[519  20   1   5   5   0]
 [ 30 669   3   2   0   0]
 [ 38  76  61   2   1   0]
 [ 52  30   0 189   4   0]
 [ 49  24   0   8 129   2]
 [ 31  30   0   1   9  10]]
