# Naive Bayes Classifier for Text Classification

### 1. **Introduction to Naive Bayes**

Naive Bayes is a probabilistic classifier based on **Bayes' Theorem**. It is called "naive" because it assumes that the features (words in our case) are conditionally independent given the class label. Despite this strong assumption, Naive Bayes performs surprisingly well, especially in text classification tasks.

### 2. **Bayes' Theorem**

At the heart of the Naive Bayes classifier is **Bayes' Theorem**, which describes the probability of a class $C$ given a set of features $X$. The theorem is expressed as:

$$
P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}
$$

Where:
- $P(C|X)$ is the **posterior** probability of class $C$ given the features $X$.
- $P(X|C)$ is the **likelihood** of observing features $X$ given the class $C$.
- $P(C)$ is the **prior** probability of the class $C$.
- $P(X)$ is the **evidence** or the probability of observing the features $X$ (which remains constant for all classes).

### 3. **Simplification Using Naive Assumption**

In the case of Naive Bayes, we make the assumption that the features are conditionally independent, given the class. This means that the probability of a feature $x_1, x_2, ..., x_n$ occurring together is the product of the individual probabilities of each feature. This simplifies our likelihood term:

$$
P(X|C) = P(x_1, x_2, ..., x_n | C) = \prod_{i=1}^{n} P(x_i | C)
$$

Where:
- $x_1, x_2, ..., x_n$ are the features (in our case, the words in the document).
- $P(x_i | C)$ is the probability of feature $x_i$ occurring given class $C$.

Thus, the posterior probability becomes:

$$
P(C|X) = \frac{P(C) \cdot \prod_{i=1}^{n} P(x_i | C)}{P(X)}
$$

We can ignore $P(X)$ because it is constant for all classes and doesn’t affect the decision of which class is most likely. Therefore, we only need to compute:

$$
P(C|X) \propto P(C) \cdot \prod_{i=1}^{n} P(x_i | C)
$$



In [1]:
import numpy as np
import nltk
from datasets import load_dataset
from nltk.corpus import stopwords
import string
from collections import defaultdict, Counter



  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from tqdm import tqdm

In [3]:
# Step 1: Download and load the IMDb dataset from Hugging Face
dataset = load_dataset("imdb")

In [4]:

# Step 2: Preprocess the data
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nmadali/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Function to preprocess text (lowercasing, removing punctuation and stopwords)
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = ''.join([char for char in text if char not in string.punctuation])  # Remove punctuation
    tokens = text.split()  # Tokenize by splitting by whitespace
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return tokens

In [6]:
# Preprocess the training and test data
train_data = dataset['train']
test_data = dataset['test']

train_text = [preprocess_text(text) for text in train_data['text'] ]
test_text = [preprocess_text(text) for text in test_data['text'] ]

In [7]:
# Step 3: Convert text to Bag of Words
def build_vocab(corpus):
    vocab = defaultdict(int)
    for text in corpus:
        for word in text:
            vocab[word] += 1
    return vocab

def vectorize_data(text, vocab):
    
    vector = np.zeros(len(vocab))
    for word in text:
            if word in vocab:
                vector[list(vocab.keys()).index(word)] += 1
        
    return vector 

In [8]:
# Build vocabulary from training data
vocab = build_vocab(train_text)


In [None]:
### 4. **Class Probabilities**

For a given class $C$, the class probability $P(C)$ is simply the relative frequency of that class in the training data. If we have a dataset with $N$ total samples and $N_C$ samples of class $C$, the class probability is:

$$
P(C) = \frac{N_C}{N}
$$

### 5. **Feature Likelihoods (Word Probabilities)**

The likelihood $P(x_i | C)$ represents the probability of observing the word $x_i$ in class $C$. This is calculated by counting how often the word $x_i$ appears in documents of class $C$, and then dividing by the total number of words in class $C$.

To avoid the issue of zero probabilities (when a word doesn't appear in the training data for a given class), we use **Laplace smoothing**, which ensures that every word has a non-zero probability. The smoothed probability of a word $x_i$ given class $C$ is:

$$
P(x_i | C) = \frac{count(x_i, C) + 1}{|V| + count(C)}
$$

Where:
- $count(x_i, C)$ is the count of how many times the word $x_i$ appears in documents of class $C$.
- $|V|$ is the size of the vocabulary (the number of distinct words).
- $count(C)$ is the total number of words in class $C$.

This is the probability of a word $x_i$ occurring in class $C$ after smoothing.

### 6. **Prediction**

To classify a new document, we compute the posterior probability $P(C|X)$ for each class and choose the class with the highest posterior probability:

$$
\hat{C} = \arg\max_{C} P(C) \cdot \prod_{i=1}^{n} P(x_i | C)
$$

This means we calculate the posterior probabilities for each class and select the class with the highest value. The class with the highest score is the predicted label.

### 7. **Final Formula for Naive Bayes Classification**

Putting everything together, the Naive Bayes classifier predicts the class $C$ for a document $X = (x_1, x_2, ..., x_n)$ by maximizing the following expression:

$$
\hat{C} = \arg\max_{C} \left( P(C) \cdot \prod_{i=1}^{n} P(x_i | C) \right)
$$

Where:
- $P(C)$ is the class prior.
- $P(x_i | C)$ is the likelihood of the word $x_i$ given the class $C$, smoothed using Laplace smoothing.



In [48]:

# Step 4: Naive Bayes Classifier
class NaiveBayes:
    def __init__(self):
        self.class_probs = {}
        self.word_probs = defaultdict(lambda: defaultdict(float))
    
    def fit(self, X, y):
        # Step 4.1: Calculate class probabilities
        class_counts = Counter(y)
        total_count = len(y)
        for label, count in class_counts.items():
            self.class_probs[label] = count / total_count
        
        # Step 4.2: Calculate word probabilities for each class
        word_counts = defaultdict(lambda: defaultdict(int))
        for (tokens, label) in tqdm(zip(X, y)):
                    for word, count in Counter(tokens).items():
                            word_counts[label][word] += count

        
        # Step 4.3: Apply Laplace smoothing and calculate probabilities
        for label in word_counts:
            total_words_in_class = sum(word_counts[label].values()) + len(vocab)
            for word in word_counts[label]:
                self.word_probs[label][word] = (word_counts[label][word] + 1) / total_words_in_class

    def predict(self, X):
        predictions = []
        for tokens in tqdm(X):
            class_scores = {}
            for label, class_prob in self.class_probs.items():
                score = np.log(class_prob)
                for word, count in Counter(tokens).items():
                        
                        if word in self.word_probs[label]:
                            score += np.log(self.word_probs[label][word])
                        else:
                            score += np.log(1 / (sum(self.class_probs.values()) + len(vocab)))  # Smoothing
                class_scores[label] = score
            predictions.append(max(class_scores, key=class_scores.get))
        return predictions

### 8. **Example: Classification of IMDb Reviews**

For example, if we have a review with the words "great movie", we compute the posterior probabilities for both classes "positive" and "negative" based on:
- The prior probabilities $P(\text{positive})$ and $P(\text{negative})$,
- The likelihoods $P(\text{great} | \text{positive})$, $P(\text{movie} | \text{positive})$, $P(\text{great} | \text{negative})$, and $P(\text{movie} | \text{negative})$.

The class with the higher posterior probability is the predicted label for the review.



In [49]:
# Step 5: Train Naive Bayes classifier
nb = NaiveBayes()
y_train = train_data['label']
nb.fit(train_text, y_train)

25000it [00:00, 38177.90it/s]


In [50]:
# Step 6: Predict on the test data
y_test = test_data['label']
y_pred = nb.predict(test_text)

100%|███████████████████████████████████| 25000/25000 [00:07<00:00, 3349.79it/s]


In [51]:
# Step 7: Evaluate the model
def evaluate(y_true, y_pred):
    accuracy = np.mean(np.array(y_true) == np.array(y_pred))
    print(f"Accuracy: {accuracy:.4f}")
    
    # Classification report (Precision, Recall, F1)
    tp, fp, fn, tn = 0, 0, 0, 0
    for true, pred in zip(y_true, y_pred):
        if true == 1 and pred == 1:
            tp += 1
        elif true == 0 and pred == 1:
            fp += 1
        elif true == 1 and pred == 0:
            fn += 1
        elif true == 0 and pred == 0:
            tn += 1

    precision = tp / (tp + fp) if tp + fp > 0 else 0
    recall = tp / (tp + fn) if tp + fn > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0

    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")

### 9. **Evaluation**

To evaluate the performance of the Naive Bayes classifier, we use metrics such as **accuracy**, **precision**, **recall**, and **F1-score**:

- **Accuracy**: Measures the overall correctness of the model.
- **Precision**: The proportion of positive predictions that were actually positive.
- **Recall**: The proportion of actual positives that were correctly predicted.
- **F1-score**: The harmonic mean of precision and recall, balancing both.


In [52]:


evaluate(y_test, y_pred)


Accuracy: 0.8288
Precision: 0.8642
Recall: 0.7802
F1-Score: 0.8200
