<a href="https://colab.research.google.com/github/jnunez0319/personal_projects/blob/main/naivebayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data and Preprocessing

In [None]:
col_names = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/csvfiles/document_data/word_indices.txt', sep=" ", header=None)

col_names.shape

(5180, 1)

In [None]:
X_train = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/csvfiles/document_data/train.csv', header=None)
X_test = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/csvfiles/document_data/test.csv', header=None)

X_train.columns = col_names[0]
X_test.columns = col_names[0]

print(X_train.shape, X_test.shape)
X_test.head()

(4527, 5180) (1806, 5180)


Unnamed: 0,dlr,new,york,sale,time,cocoa,dec,smith,juli,sept,...,hkg,twelv,arden,sherman,basf,kaufman,charleston,fomc,butan,murdochvil
0,2,2,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
y_train = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/csvfiles/document_data/train_labels.txt', sep=" ", header=None)
y_test = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/csvfiles/document_data/test_labels.txt', sep=" ", header=None)

y_train.columns = ['label']
y_test.columns = ['label']

print(y_train.shape, y_test.shape)

(4527, 1) (1806, 1)


In [None]:
# Convert arrays to numpy arrays
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()

# Flatten labels into 1D arrays
y = y_train.to_numpy().flatten()
yt = y_test.to_numpy().flatten()
y.shape

(4527,)

# Naive-Bayes Implementation

## Scikit Learn Model

Here we show the use of Scikit Learn's `MultinomialNB` library.

> This will be helpful later on when you compare this to our own class done by scratch!

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [None]:
nb = MultinomialNB()
nb.fit(X_train, y)

In [None]:
yp = nb.predict(X_test)
print(accuracy_score(yp, yt), precision_score(yp, yt), recall_score(yp, yt))

0.982281284606866 0.9930458970792768 0.9635627530364372


## From Scratch

![Naive Bayes image](https://databasecamp.de/wp-content/uploads/naive-bayes-overview-1024x709.png)

### `NaiveBayes()`

The main idea of the `NaiveBayes` classifier is to accomplish the following idea:
    
>$P(y|X) = \frac{P(X|y) * P(y)} {P(X)}$

This essentially translates to **" If i pass this set of data the chances of it falling under this class is ___"**.

To accomplish this we can break down the formula into three parts which will be our three main functions in the `NaiveBayes()` class `calc_priors`, `calc_likelihoods`, and `calc_posteriors`.

Each class will also have two dictionaries to hold probability data:

>`self.priors`: Stores the prior probabilities $P(y)$ for each class.

>`self.likelihoods`: Stores the likelihoods $P(X|y)$ for each class, for each feature.
--------------------------------------------------------------------------------
### `fit(X, y, alpha=1.0)`
This function is used to train the Naive Bayes classifier by calling two key methods:

>#### `calc_priors(y)`: $P(y)$
>This function computes the prior probabilities $P(y)$ and stores it in `self.priors`. The prior probability is simply the probability of each class occurring in the dataset.

>#### `calc_likelihoods(X, y, alpha=1.0)`: $P(X|y)$
>This function calculates the likelihood $P(X|y)$ using Laplace smoothing (`alpha`) to handle zero counts, which could lead to probability issues. Output is stored in `self.likelihoods`.

### `predict(X_test)`
This function predicts the class labels for a given set of input samples. and calls the `calc_posteriors` function

>#### `calc posteriors(X)` $P(y|X)$
>This function calculates the posterior probability $P(y∣X)$ for each sample in `X_test`. Returns an array of predicted class labels for the input samples.


In [None]:
class NaiveBayes():
  '''
  Naive Bayes Classifier
  We are looking to find essentially:

  P(y|X) = P(y)P(X|y)/P(X)

  '''
  def __init__(self):
    self.priors = {}
    self.likelihoods = {}

  def calc_priors(self, y):
    ''' Find P(y) '''
    classes = np.unique(y)
    for c in classes:
      self.priors[c] = np.mean(y == c)

  def calc_likelihoods(self, X, y, alpha=1.0):
    ''' Find P(X|y) '''
    likelihoods = {}
    classes = np.unique(y)
    num_features = X.shape[1]  # Number of words (vocabulary size)

    # For each unique label in the dataset
    for class_label in classes:
      class_indices = np.where(y == class_label)[0]
      X_class = X[class_indices]

      # Sum of word counts for all rows where Y = class_label
      word_count_class = np.sum(X_class, axis=0)

      # Total number of words in row of this class
      total_word_count_class = np.sum(word_count_class)

      # Apply Laplace smoothing incase any unknown words appear
      likelihoods[class_label] = (word_count_class + alpha) / (total_word_count_class + alpha * num_features)

    self.likelihoods = likelihoods


  def calc_posteriors(self, X):
    ''' Find P(y|X) '''
    posteriors = []

    # go through each row
    for x in X:
      class_posteriors = {}

      # For each class, calculate the posterior probability of the current row
      for class_label in self.priors:
        # Start with the log of the prior probability to avoid underflow issues
        log_posterior = np.log(self.priors[class_label])

        # Add the log of the likelihoods for each word in the document
        for idx, count in enumerate(x):
          if count > 0:
            # Add the log of the likelihood for this word, raised to the power of its count
            log_posterior += count * np.log(self.likelihoods[class_label][idx])

        class_posteriors[class_label] = log_posterior

      # at the end take the max score for the posterior, thats our
      posteriors.append(max(class_posteriors, key=class_posteriors.get))

    return np.array(posteriors)


  def fit(self, X, y, alpha=1.0):
    self.calc_priors(y)
    self.calc_likelihoods(X, y, alpha)

  def predict(self, X):
    return self.calc_posteriors(X)



>We could now evaluate and compare our model:

In [None]:
import time

nb = NaiveBayes()

# Start timer
start_time = time.time()

# Train model
nb.fit(X_train, y_train)

# Stop the timer
end_time = time.time()

#Predict on training set
train_yp = nb.predict(X_train)

# Output training time and accuracy
training_time = end_time - start_time
print(f"Training Time: {training_time:.4f} seconds")
accuracy_score(train_yp, y_train)

Training Time: 0.6641 seconds


0.9692953390766512

In [None]:
# Predict on test set and output accuracy
yp = nb.predict(X_test)
accuracy_score(yp, yt)

0.982281284606866

Not bad!
> Accuracy is surprisingly exactly the same as the `sklearn.naive_bayes MultinominalNB` model at a 98% on the test set which is a nice touch.

    0.982281284606866

> Runtime is **very fast** too as we consistently get training times <1 second.

# Conclusion
>And there we go! In my opinion Naive Bayes is one of the simpler algorithms to implement from scratch and makes for some good practice in statistics and probability.
