# Naive Bayes Classifier for Text Classification

### 1. **Introduction to Naive Bayes**

Naive Bayes is a probabilistic classifier based on **Bayes' Theorem**. It is called "naive" because it assumes that the features (words in our case) are conditionally independent given the class label. Despite this strong assumption, Naive Bayes performs surprisingly well, especially in text classification tasks.

### 2. **Bayes' Theorem**

At the heart of the Naive Bayes classifier is **Bayes' Theorem**, which describes the probability of a class $C$ given a set of features $X$. The theorem is expressed as:

$$
P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}
$$

Where:
- $P(C|X)$ is the **posterior** probability of class $C$ given the features $X$.
- $P(X|C)$ is the **likelihood** of observing features $X$ given the class $C$.
- $P(C)$ is the **prior** probability of the class $C$.
- $P(X)$ is the **evidence** or the probability of observing the features $X$ (which remains constant for all classes).

### 3. **Simplification Using Naive Assumption**

In the case of Naive Bayes, we make the assumption that the features are conditionally independent, given the class. This means that the probability of a feature $x_1, x_2, ..., x_n$ occurring together is the product of the individual probabilities of each feature. This simplifies our likelihood term:

$$
P(X|C) = P(x_1, x_2, ..., x_n | C) = \prod_{i=1}^{n} P(x_i | C)
$$

Where:
- $x_1, x_2, ..., x_n$ are the features (in our case, the words in the document).
- $P(x_i | C)$ is the probability of feature $x_i$ occurring given class $C$.

Thus, the posterior probability becomes:

$$
P(C|X) = \frac{P(C) \cdot \prod_{i=1}^{n} P(x_i | C)}{P(X)}
$$

We can ignore $P(X)$ because it is constant for all classes and doesn’t affect the decision of which class is most likely. Therefore, we only need to compute:

$$
P(C|X) \propto P(C) \cdot \prod_{i=1}^{n} P(x_i | C)
$$



In [1]:
import datasets
import string
import numpy as np
import nltk

from tqdm import tqdm
from nltk.corpus import stopwords
from collections import defaultdict, Counter
nltk.download('stopwords')


train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])



  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nmadali/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
class Tokenizer:
  def __init__(self, stop_words, puncts, truncation_size=256  ):
      
    self.stop_words=stop_words
    self.puncts=puncts
    self.df = {}
    self.truncation_size=truncation_size
  def format_string(self, text):
      tokens=[ token for token  in text.lower().split() if not ((token in self.stop_words) or  (token in self.puncts))   ]
      return tokens  
      
  def tokenize(self, text, truncation=False):
      tokens=self.format_string(text)
      tmp=[]
      for token in tokens:
          if token in self.w2i :
              tmp.append(self.w2i[token])
          else:
              tmp.append(self.w2i['<unk>'])
              
      if truncation:
          tmp=tmp[: self.truncation_size]
          output= np.ones(self.truncation_size)*self.w2i['<pad>']
          output[:len(tmp)]=tmp
          return list(output)
      else:
          return tmp
  def detokenize(self, idxs):
      words=[self.i2w[idx] for idx in idxs]  
      
      return ''.join(word+' ' for word in words )
  def fit(self, train_text):
    for text in train_text:
        tokens=set(self.format_string(text))
        for token in tokens:
            if token in self.df:
                self.df[token]+=1
            else: 
                self.df[token]=1
    self.df['<unk>']=1
    self.df['<pad>']=1

    
    
    self.w2i = { k:idx for idx, (k,v) in enumerate(self.df.items())}
    self.i2w = { v:k for (k,v) in  self.w2i .items() }

    self.idf=np.zeros(len(self.df))
    for (k,v) in self.w2i.items():
        self.idf[v]=np.log((1+len(train_text))/(1+self.df[k]))  
        

In [3]:
train_text=[sample['text'] for sample in train_data]
train_label=[sample['label'] for sample in train_data]

In [4]:
test_text=[sample['text'] for sample in test_data]
test_label=[sample['label'] for sample in test_data]

In [5]:
stop_words= stopwords.words('english')
puncts=  [punt for punt in string.punctuation]

In [6]:
tokenizer=Tokenizer(stop_words , puncts)

In [7]:
tokenizer.fit(train_text)

In [8]:
len(tokenizer.w2i)

251445

### 4. **Class Probabilities**

For a given class $C$, the class probability $P(C)$ is simply the relative frequency of that class in the training data. If we have a dataset with $N$ total samples and $N_C$ samples of class $C$, the class probability is:

$$
P(C) = \frac{N_C}{N}
$$


In [9]:
train_data['text'][0],train_data['label'][0]

('I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, e

In [10]:
np.mean(train_data['label']),1-np.mean(train_data['label'])

(0.5, 0.5)

In [11]:
label_prob={}
for (label, count) in Counter(train_data['label']).items():
    label_prob[label]=count/len(train_data['label'])

In [12]:
label_prob

{0: 0.5, 1: 0.5}

### 5. **Feature Likelihoods (Word Probabilities)**

The likelihood $P(x_i | C)$ represents the probability of observing the word $x_i$ in class $C$. This is calculated by counting how often the word $x_i$ appears in documents of class $C$, and then dividing by the total number of words in class $C$.

To avoid the issue of zero probabilities (when a word doesn't appear in the training data for a given class), we use **Laplace smoothing**, which ensures that every word has a non-zero probability. The smoothed probability of a word $x_i$ given class $C$ is:

$$
P(x_i | C) = \frac{count(x_i, C) + 1}{|V| + count(C)}
$$

Where:
- $count(x_i, C)$ is the count of how many times the word $x_i$ appears in documents of class $C$.
- $|V|$ is the size of the vocabulary (the number of distinct words).
- $count(C)$ is the total number of words in class $C$.

This is the probability of a word $x_i$ occurring in class $C$ after smoothing.

In [13]:
word_prob=defaultdict(lambda: defaultdict(float))

for text,label in zip(train_text,train_label ):
    tokens=tokenizer.tokenize(text)
    for (idx, count) in Counter(tokens).items():
        word_prob[label][idx]+=count

for label in np.unique(train_label):
    total_count=np.sum(list(word_prob[label].values()))
    for idx in word_prob[label]:
        word_prob[label][idx]=(word_prob[label][idx]+1)/(total_count+len(tokenizer.w2i))   

### 6. **Prediction**

To classify a new document, we compute the posterior probability $P(C|X)$ for each class and choose the class with the highest posterior probability:

$$
\hat{C} = \arg\max_{C} P(C) \cdot \prod_{i=1}^{n} P(x_i | C)
$$

This means we calculate the posterior probabilities for each class and select the class with the highest value. The class with the highest score is the predicted label.

In [14]:
text=test_text[0]
tokens=tokenizer.tokenize(text)

In [15]:
sent_prob={}
for label in np.unique(train_label):
    prob=np.log(label_prob[label])
    for token in tokens:
        if token in tokenizer.i2w:
            if word_prob[label][token]>0:
             prob+=np.log(word_prob[label][token])
            else:
             prob+=np.log(1/len(tokenizer.w2i))
        else:
            prob+=np.log(1/len(tokenizer.w2i))
    sent_prob[label]=prob

In [16]:
np.log(word_prob[label][token])

-8.424001058228837

In [17]:
np.argmax(list(sent_prob.values()))

0

### 7. **Final Formula for Naive Bayes Classification**

Putting everything together, the Naive Bayes classifier predicts the class $C$ for a document $X = (x_1, x_2, ..., x_n)$ by maximizing the following expression:

$$
\hat{C} = \arg\max_{C} \left( P(C) \cdot \prod_{i=1}^{n} P(x_i | C) \right)
$$

Where:
- $P(C)$ is the class prior.
- $P(x_i | C)$ is the likelihood of the word $x_i$ given the class $C$, smoothed using Laplace smoothing.



In [18]:
predictions=[]
for text in test_text: 
    tokens=tokenizer.tokenize(text)
    sent_prob={}
    for label in np.unique(train_label):
        prob=np.log(label_prob[label])
        for token in tokens:
            if token in tokenizer.i2w:
                if word_prob[label][token]>0:
                 prob+=np.log(word_prob[label][token])
                else:
                 prob+=np.log(1/len(tokenizer.w2i))
            else:
                prob+=np.log(1/len(tokenizer.w2i))
        sent_prob[label]=prob
    predictions.append(np.argmax(list(sent_prob.values())))

### 9. **Evaluation**

To evaluate the performance of the Naive Bayes classifier, we use metrics such as **accuracy**, **precision**, **recall**, and **F1-score**:

- **Accuracy**: Measures the overall correctness of the model.
- **Precision**: The proportion of positive predictions that were actually positive.
- **Recall**: The proportion of actual positives that were correctly predicted.
- **F1-score**: The harmonic mean of precision and recall, balancing both.


In [19]:
print('Accuracy :', np.mean(np.array(predictions)==np.array(test_label)))

Accuracy : 0.834


In [24]:
from sklearn.metrics import confusion_matrix, classification_report

In [22]:
y_pred, y_true= np.array(predictions), np.array(test_label)

In [26]:
np.mean(y_true), 1-np.mean(y_true)

(0.5, 0.5)

In [28]:
from collections import Counter
for key, count in Counter(y_true).items():
    print(key, count/len(y_true))

0 0.5
1 0.5


In [23]:
confusion_matrix(y_true, y_pred)

array([[10937,  1563],
       [ 2587,  9913]])

In [25]:
target_names = ['Neg', 'Pos']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

         Neg       0.81      0.87      0.84     12500
         Pos       0.86      0.79      0.83     12500

    accuracy                           0.83     25000
   macro avg       0.84      0.83      0.83     25000
weighted avg       0.84      0.83      0.83     25000

