<a href="https://colab.research.google.com/github/ppokranguser/Artificial_Intelligence_study/blob/main/AI_5th_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes Classifier Example with PyTorch

In this example, we will implement a Naive Bayes classifier using PyTorch for a spam detection task on the SMS Spam Collection dataset. PyTorch does not have a built-in Naive Bayes implementation, so we will manually construct the classifier by calculating the necessary probabilities.


## Step 0. Import Libraries and Prepare the Dataset

- Download the following dataset and upload it into the current session.
SMS Spam Dataset: [https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset]


In [None]:
import pandas as pd
import torch

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from google.colab import files
from google.colab import drive
drive.mount('/content/drive')


## Step 1. Load and Preprocess the Dataset

We need to convert the text messages into a bag-of-words representation using CountVectorizer and convert them into PyTorch tensors.

SMS Spam dataset consists of two columnes: v1 and v2
- v1: Label (ham or spam)
- v2: raw text


In [None]:
# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/spam.csv', encoding='latin1')
print("The number of examples", len(df)) # 5,572

df = df[['v1', 'v2']] # v1: label, v2: raw text
print(df.head(10)) # Display the first 10 rows

# Initialize the CountVectorizer to transform text into a bag-of-words model
# Convert the messages into numeric form
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['v2']).toarray()

# Labels (spam/ham)
y = df['v1'].map({'ham': 0, 'spam': 1}).values  # Map ham to 0 and spam to 1

## Step 2. Split the Dataset

We will split the dataset into training and test sets and convert them to PyTorch tensors.
- Training set: used for training the bayes classifier
- Test set: used for evaluting the trained classifier

In [None]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

## Step 3. Build the Naive Bayes Classifier using PyTorch

In PyTorch, we will manually implement the Naive Bayes classifier by calculating the prior probabilities and likelihoods for each feature given the class.
- P(Y): The prior probability of each class (spam/ham) is computed from the training data.
- P(X|Y): The likelihood of each feature (word presence/absence) given the class is computed.

(**Self-study**): Laplace smoothing is applied to handle zero probabilities.
- Zero frequency problem: What if a new word appears in the test set? Naive Bayes Classifier always predict zero probability for a new data that has never been observed... How can we address this problem?


In [None]:
class NaiveBayesClassifier:
    def __init__(self):
        self.class_probs = None
        self.feature_probs_given_class = None

    def fit(self, X, y):
        # Calculate the prior probability P(Y) for each class
        class_counts = torch.bincount(y.long())
        self.class_probs = class_counts / torch.sum(class_counts)

        # Calculate the likelihood P(X|Y) for each feature given the class
        feature_counts_given_class = torch.zeros((len(class_counts), X.shape[1]))

        for c in range(len(class_counts)):
            feature_counts_given_class[c] = torch.sum(X[y == c], axis=0)

        # Laplace smoothing
        # (Self-study): how can this technique address the issue of zero probabilities?
        self.feature_probs_given_class = (feature_counts_given_class + 1) / (
            torch.sum(feature_counts_given_class, axis=1, keepdim=True) + 2
        )

    def predict(self, X):
        # Calculate log probabilities to avoid numerical underflow
        log_class_probs = torch.log(self.class_probs)
        log_feature_probs_given_class = torch.log(self.feature_probs_given_class)
        log_probs = torch.matmul(X, log_feature_probs_given_class.T) + log_class_probs
        return torch.argmax(log_probs, axis=1)

## 4. Train the Model

- Now that we have defined the model, we will train it on the training data.
- We can train the Naive Bayes Classifier via only a single scan of the dataset.

In [None]:
# Initialize the model
nb_classifier = NaiveBayesClassifier()

# Train the model
nb_classifier.fit(X_train_tensor, y_train_tensor)

## 5. Make Predictions and Evaluation
- After training the model, we can make predictions on the test data.
- Convert the predictions back to NumPy arrays for evaluation.


### Accuracy
- The accuracy shows how many predictions were correct out of the total predictions. A high accuracy indicates that the model is correctly classifying messages as spam or ham.

### Classification Report
The classification report provides detailed insights into the model's performance:

- Precision: The ratio of true positive predictions (correctly predicted spam) to all predicted positives.
- Recall: The ratio of true positives to all actual positives (how well the model identifies spam).
- F1-Score: The harmonic mean of precision and recall, offering a balanced measure of the model's performance.


In [None]:
# Make predictions on the test data
y_pred_tensor = nb_classifier.predict(X_test_tensor)

# Convert predictions to NumPy arrays
y_pred = y_pred_tensor.numpy()

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Detailed classification report
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))

## 6. Summary

In this hands-on practice, we implemented a Naive Bayes classifier using PyTorch for text classification.

We covered:
- Preprocessing the SMS Spam Collection dataset.
- Building a Naive Bayes classifier from scratch in PyTorch.
- Training the model on the preprocessed data.
- Evaluating the model's performance using metrics like accuracy and F1-score.

Although PyTorch does not have built-in support for Naive Bayes, we constructed the classifier manually by computing the necessary probabilities. This practice illustrates the flexibility and power of PyTorch for custom machine learning models.