<a href="https://colab.research.google.com/github/pltnhan/machinelearningplatforms/blob/main/NaiveBayesClassification_Lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Classification with Naive Bayes

## Introduction

In this lab session, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `SMSSpamCollection.txt`.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import os
import sys
os.chdir('/content/gdrive/MyDrive/IU Material/Machine Learning Platforms/lab1')
sys.path.append("/content/gdrive/MyDrive/IU Material/Machine Learning Platforms/lab1")
!pwd

/content/gdrive/MyDrive/IU Material/Machine Learning Platforms/lab1


In [None]:
df = pd.read_csv('./car-eval.csv')
df.head(10)

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,clazz
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
5,vhigh,vhigh,2,2,med,high,unacc
6,vhigh,vhigh,2,2,big,low,unacc
7,vhigh,vhigh,2,2,big,med,unacc
8,vhigh,vhigh,2,2,big,high,unacc
9,vhigh,vhigh,2,4,small,low,unacc


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [None]:
df.iloc[:,1:].describe()

Unnamed: 0,maint,doors,persons,lug_boot,safety,clazz
count,1728,1728,1728,1728,1728,1728
unique,4,4,3,3,3,4
top,vhigh,2,2,small,low,unacc
freq,432,432,576,576,576,1210


In [None]:
class_counts = df['clazz'].value_counts()
print("Class Counts:\n", class_counts)

minority_class = class_counts.idxmin()
majority_class = class_counts.idxmax()

# Get all instances of the minority class
minority_instances = df[df['clazz'] == minority_class]

# Get a subset of instances of the majority class to match the minority class
majority_subset = df[df['clazz'] == majority_class].sample(n=len(minority_instances), random_state=42)

# Combine minority instances with the subset of majority instances
balanced_df = pd.concat([minority_instances, majority_subset])

# Shuffle the balanced dataset
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

balanced_class_counts = balanced_df['clazz'].value_counts()
print("\nBalanced Class Counts:\n", balanced_class_counts)

Class Counts:
 unacc    1210
acc       384
good       69
vgood      65
Name: clazz, dtype: int64

Balanced Class Counts:
 vgood    65
unacc    65
Name: clazz, dtype: int64


## Train-test split

Now implement a train-test split on the dataset:

In [None]:
from sklearn.model_selection import train_test_split
X = balanced_df.drop('clazz',axis=1)
y = balanced_df['clazz']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class:

In [None]:
import pandas as pd
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    return tokens

# Assuming you have a DataFrame called train_df with a column named 'clazz'

# Calculate prior probabilities for each class
class_counts = train_df['clazz'].value_counts()
total_instances = len(train_df)
p_classes = {clazz: class_count / total_instances for clazz, class_count in class_counts.items()}

word_frequency_dictionaries = {}
for clazz in p_classes:
    # Concatenate all texts belonging to the class
    class_text = ' '.join(train_df[train_df['clazz'] == clazz]['clazz'])
    # Preprocess the text
    preprocessed_text = preprocess_text(class_text)
    # Count word frequencies
    word_frequency = Counter(preprocessed_text)
    # Store word frequency dictionary for the class
    word_frequency_dictionaries[clazz] = word_frequency

for clazz, word_frequency in word_frequency_dictionaries.items():
    print(f"Class: {clazz}")
    print(word_frequency)
    print()


Class: unacc
Counter({'unacc': 46})

Class: vgood
Counter({'vgood': 45})



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
p_classes

{'unacc': 0.5054945054945055, 'vgood': 0.4945054945054945}

## Count the total corpus words
Calculate V, the total number of words in the corpus:

In [None]:
V = 0

for word_frequency in word_frequency_dictionaries.values():
    V += sum(word_frequency.values())

print("Total number of words (V) in the corpus:", V)

Total number of words (V) in the corpus: 91


## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [None]:
def bag_it(text):
    tokens = word_tokenize(text)

    stop_words = set(stopwords.words('english'))
    tokens = [token.lower() for token in tokens if token.isalpha() and token.lower() not in stop_words]

    bag_of_words = Counter(tokens)

    return bag_of_words

In [None]:
# Apply the bag_it function to the text data in train_df and test_df
train_df['bag_of_words'] = train_df['clazz'].apply(bag_it)
test_df['bag_of_words'] = test_df['clazz'].apply(bag_it)

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [None]:
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    log_posteriors = {}

    for clazz, word_freq in class_word_freq.items():
        log_posterior = np.log(p_classes[clazz])

        for word in doc:
            word_count = word_freq.get(word, 0)
            log_prob_word_given_class = np.log((word_count + 1) / (sum(word_freq.values()) + V))
            log_posterior += log_prob_word_given_class

        log_posteriors[clazz] = log_posterior

    if return_posteriors:
        return log_posteriors

    return max(log_posteriors, key=log_posteriors.get)

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [None]:
def calculate_accuracy(predictions, actual_labels):
    correct = sum(1 for pred, actual in zip(predictions, actual_labels) if pred == actual)
    total = len(actual_labels)
    accuracy = correct / total
    return accuracy

# Predict class labels for test set
predictions = []
for _, row in test_df.iterrows():
    bag_of_words = row['clazz']
    predicted_class = classify_doc(bag_of_words, word_frequency_dictionaries, p_classes, V)
    predictions.append(predicted_class)

# Calculate accuracy
accuracy = calculate_accuracy(predictions, test_df['clazz'])
print("Accuracy:", accuracy)


Accuracy: 0.5128205128205128


## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!