# Baye's Theorem
In order to understand **Naive Bayes Classification**, one first must understand **Baye's Theorem**

## **Scenario**: 
### There are **5** marbles in a bag. **3** of these marbles are white and **2** of these marbles are black. Given that the first marble we pulled was a **black** marble, what is the probability that the next marble will also be **black**?


![](http://i.postimg.cc/Y2stPGqg/Bayetheorem.jpg)

# Naive Baye's Classifier
The Naive Baye's Classifier modifies and utilizes the Baye's Theoreom to accomoplish it's objective. Various types of Naive Bayes Classifiers include:

- **Multinomial**
- **Gaussian**
- **Bernoulli**

In this tutorial we will utilize Multinomial and Bernoulli Naive Baye's Classifiers. This is significant because we will accomplish this through a spam detection. Within this model we will be able to demonstrate the difference between the Bernoulli approach vs the Multinomial. Key Notes:

- Bernoulli : Utilizes the frequency of a specific word within our model as well as the event that the word does not occur. Thus it explicitly penalizes the non-occurrence of a feature 
- Multinomial : Utilizes the relative frequency count to estimate the the probability of a feature appearing in a sample. This is important because it does not penalize the non occurance of the feature but instead it implements smoothing (Laplace or Lidstone) to account for the non occurance of various features.




# Processing of Raw Email Data
![](https://i.postimg.cc/Hk0BPN1N/emailcleansing.jpg)

## We will utlize this method to cleanse our emails in preparation for analysis. There are also functions to utlize lowercases, removal of punctuations etc. When we start changing our cases various words take on different meanings. For example 'Mark' could be utilized as someone's name but as we decrease the case to 'mark' it tends to have the same representation as the verb mark.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,roc_curve, roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns




In [None]:
#load sms-spam-collection-dataset
df = pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv", encoding="latin1")
df.head()

In [None]:
df.columns

In [None]:
# We will begin by cleansing deleting the empty columns and creating a boolean column for spam classification
# Drop the last three columns iteratively

for i in range(3):
    df = df.drop(df.columns[-1], axis=1)    # Drop last three empty columns 
    
df['spam']= df['v1'].apply(lambda x:1 if x=='spam' else 0)  #Create boolean of spam or ham

df = df.rename(columns={'v2': 'Text'})   #Rename column v2 to text

df = df.drop('v1', axis=1)

df.head()

In [None]:
# Load the stopwords from NLTK
stopwords = set(stopwords.words('english'))

# Instantiate the PorterStemmer
stemmer = PorterStemmer()


In [None]:
# Preprocessing steps
def preprocess(text):
    # Remove special characters and numbers
    text = re.sub('[^a-zA-Z]', ' ', text)

    # Convert to lowercase
    text = text.lower()

    # Tokenization
    tokens = text.split()

    # Remove stopwords
    tokens = [token for token in tokens if token not in stopwords]

    # Stemming
    tokens = [stemmer.stem(token) for token in tokens]

    # Join the tokens back into a single string
    text = ' '.join(tokens)

    return text

In [None]:
# Apply preprocessing to the 'text' column
df['processed_text'] = df['Text'].apply(preprocess)

In [None]:
# Create an instance of CountVectorizer
vectorizer = CountVectorizer()



The above code allows us to  utilize CountVectorizer's functionality for transforming the text data into a numerical representation suitable for machine learning algorithms.It converts a collection of text documents into a matrix of token counts. Each document is represented as a vector, where each element of the vector corresponds to the count of a specific word or token in the document.

In [None]:
# Fit and transform the preprocessed text data
X = vectorizer.fit_transform(df['processed_text'])

#This is basically transforming the processed text data into a matrix of token counts using the vectorizer

In [None]:
df['processed_text'].dtype

In [None]:
df['spam'].unique()

In [None]:
# Print the document-term matrix
print(X.toarray())

In [None]:
df.head()

 # Test effectiveness of Multinomial Model vs Bernoulli Model on spam detection


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, df['spam'], test_size=0.3)

In [None]:
model = MultinomialNB()
model.fit(X_train, y_train)

In [None]:
# Predict on the training set
y_train_pred = model.predict(X_train)

# Predict on the testing set
y_test_pred = model.predict(X_test)

In [None]:
# Calculate and print accuracy score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)

In [None]:
# Test with Bernoullli
Bmodel = BernoulliNB()
Bmodel.fit(X_train, y_train)

# Predict on the training set
By_train_pred = Bmodel.predict(X_train)

# Predict on the testing set
By_test_pred = Bmodel.predict(X_test)

# Calculate and print accuracy score
train_accuracy = accuracy_score(y_train, By_train_pred)
test_accuracy = accuracy_score(y_test, By_test_pred)
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)

In [None]:
# Calculate ROC curve and AUC for MultinomialNB
fpr, tpr, thresholds = roc_curve(y_test, y_test_pred)
auc = roc_auc_score(y_test, y_test_pred)

# Plot ROC curve for MultinomialNB
plt.plot(fpr, tpr, label='MultinomialNB (AUC = {:.2f})'.format(auc))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Compute confusion matrix for MultinomialNB
multinomial_nb_cm = confusion_matrix(y_test, y_test_pred)

# Visualize confusion matrix for MultinomialNB
plt.figure(figsize=(8, 6))
sns.heatmap(multinomial_nb_cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - MultinomialNB")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# Calculate ROC curve and AUC for BernoulliNB
fpr, tpr, thresholds = roc_curve(y_test, By_test_pred)
auc = roc_auc_score(y_test, By_test_pred)

# Plot ROC curve for BernoulliNB
plt.plot(fpr, tpr, label='BernoulliNB (AUC = {:.2f})'.format(auc))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

In [None]:
# Compute confusion matrix for BernoulliNB
bernoulli_nb_cm = confusion_matrix(y_test, By_test_pred)

# Visualize confusion matrix for BernoulliNB
plt.figure(figsize=(8, 6))
sns.heatmap(bernoulli_nb_cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - BernoulliNB")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
from sklearn.naive_bayes import GaussianNB
label_names = ['ham', 'spam']
from sklearn.metrics import accuracy_score

# Create an instance of GaussianNB
gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train.toarray(), y_train)

# Predict on the training set
y_train_pred_gaussian = gaussian_nb.predict(X_train.toarray())

# Predict on the testing set
y_test_pred_gaussian = gaussian_nb.predict(X_test.toarray())

# Calculate and print accuracy score for training set
train_accuracy_gaussian = accuracy_score(y_train, y_train_pred_gaussian)
print("GaussianNB Training Accuracy:", train_accuracy_gaussian)

# Calculate and print accuracy score for testing set
test_accuracy_gaussian = accuracy_score(y_test, y_test_pred_gaussian)
print("GaussianNB Testing Accuracy:", test_accuracy_gaussian)

# Compute confusion matrix for GaussianNB
gaussian_nb_cm = confusion_matrix(y_test, y_test_pred_gaussian)

# Visualize confusion matrix for GaussianNB
plt.figure(figsize=(8, 6))
sns.heatmap(gaussian_nb_cm, annot=True, fmt="d", cmap="Blues", xticklabels=label_names, yticklabels=label_names)
plt.title("Confusion Matrix - GaussianNB")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()


# Conclusion
## The MultinomialNB model demonstrated the best performance among the three models for spam detection. It achieved the highest accuracy on both the training and testing sets. The MultinomialNB algorithm is well-suited for text classification tasks, such as spam detection, where the features are discrete and represent the occurrence counts of words.