# Multinomial Naive Bayes for Spam Email Classification
This notebook demonstrates how to use the Multinomial Naive Bayes algorithm to classify emails as spam or normal.

## 1. Import Required Libraries
We will use libraries such as `pandas`, `sklearn`, and `numpy` for data manipulation, model building, and evaluation.

In [3]:
# Import Required Libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

## 2. Load and Explore the Dataset
We will use a dataset containing labeled emails as spam or normal. The dataset should have two columns: `text` (email content) and `label` (spam or normal).

In [12]:
# Create a sample dataset
data = {
	'text': [
		"Congratulations! You've won a $1,000 gift card. Click here to claim your prize.",
		"Hi, can we schedule a meeting for tomorrow?",
		"Exclusive offer just for you. Buy now and save 50%.",
		"Don't miss out on this opportunity to earn money from home.",
		"Hello, I hope this email finds you well.",
          "Win a free vacation to the Bahamas! Click now to claim your prize.",
        "Limited time offer! Get 70% off on all products. Shop today!",
        "Hey, just wanted to check in and see how you're doing.",
        "Can you send me the report by the end of the day? Thanks!"
	],
	'label': ['spam', 'normal', 'spam', 'spam', 'normal','spam', 'spam', 'normal', 'normal']
}

# Convert the sample data into a DataFrame
emails = pd.DataFrame(data)

# Display all rows of the dataset
pd.set_option('display.max_rows', None)
# Display all columns of the dataset

print(emails)

# Check for missing values
print(emails.isnull().sum())

                                                text   label
0  Congratulations! You've won a $1,000 gift card...    spam
1        Hi, can we schedule a meeting for tomorrow?  normal
2  Exclusive offer just for you. Buy now and save...    spam
3  Don't miss out on this opportunity to earn mon...    spam
4           Hello, I hope this email finds you well.  normal
5  Win a free vacation to the Bahamas! Click now ...    spam
6  Limited time offer! Get 70% off on all product...    spam
7  Hey, just wanted to check in and see how you'r...  normal
8  Can you send me the report by the end of the d...  normal
text     0
label    0
dtype: int64


## 3. Preprocess the Data
Convert the text data into numerical features using `CountVectorizer` and split the dataset into training and testing sets.

In [13]:
# Convert text data to numerical features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails['text'])

# Encode labels (spam = 1, normal = 0)
y = emails['label'].apply(lambda x: 1 if x == 'spam' else 0)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

visualize contents of all the training mails and their labels

In [15]:
# Extract the training set emails using the indices of y_train
training_emails = emails.loc[y_train.index]

# Display the contents of the training emails
print(training_emails)

                                                text   label
5  Win a free vacation to the Bahamas! Click now ...    spam
0  Congratulations! You've won a $1,000 gift card...    spam
8  Can you send me the report by the end of the d...  normal
2  Exclusive offer just for you. Buy now and save...    spam
4           Hello, I hope this email finds you well.  normal
3  Don't miss out on this opportunity to earn mon...    spam
6  Limited time offer! Get 70% off on all product...    spam


visualize contents of all the testing mails

In [16]:
# Extract the testing set emails using the indices of y_test
testing_emails = emails.loc[y_test.index]

# Display the contents of the testing emails
print(testing_emails)

                                                text   label
7  Hey, just wanted to check in and see how you'r...  normal
1        Hi, can we schedule a meeting for tomorrow?  normal


## 4. Train the Multinomial Naive Bayes Model
Fit the model to the training data.

In [17]:
# Train the Multinomial Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

In [None]:
import numpy as np
# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Get the log probabilities for each word in each class
log_probabilities = model.feature_log_prob_

#convert log probabilities to conditional probabilities
conditional_probabilities = [np.exp(log_prob) for log_prob in log_probabilities]

# Print the conditional probabilities for each word for each class
classname=['normal mail', 'spam mail']
for i, class_probabilities in enumerate(conditional_probabilities):
    print(f"Class {classname[i]}:")
    for word, prob in zip(feature_names, class_probabilities):
        print(f"{word}: {prob:.4f}")
    print()

Class normal:
000: 0.0108
50: 0.0108
70: 0.0108
all: 0.0108
and: 0.0108
bahamas: 0.0108
buy: 0.0108
by: 0.0215
can: 0.0215
card: 0.0108
check: 0.0108
claim: 0.0108
click: 0.0108
congratulations: 0.0108
day: 0.0215
doing: 0.0108
don: 0.0108
earn: 0.0108
email: 0.0215
end: 0.0215
exclusive: 0.0108
finds: 0.0215
for: 0.0108
free: 0.0108
from: 0.0108
get: 0.0108
gift: 0.0108
hello: 0.0215
here: 0.0108
hey: 0.0108
hi: 0.0108
home: 0.0108
hope: 0.0215
how: 0.0108
in: 0.0108
just: 0.0108
limited: 0.0108
me: 0.0215
meeting: 0.0108
miss: 0.0108
money: 0.0108
now: 0.0108
of: 0.0215
off: 0.0108
offer: 0.0108
on: 0.0108
opportunity: 0.0108
out: 0.0108
prize: 0.0108
products: 0.0108
re: 0.0108
report: 0.0215
save: 0.0108
schedule: 0.0108
see: 0.0108
send: 0.0215
shop: 0.0108
thanks: 0.0215
the: 0.0430
this: 0.0215
time: 0.0108
to: 0.0108
today: 0.0108
tomorrow: 0.0108
vacation: 0.0108
ve: 0.0108
wanted: 0.0108
we: 0.0108
well: 0.0215
win: 0.0108
won: 0.0108
you: 0.0323
your: 0.0108

Class spam:
000

In [18]:
print(model)

MultinomialNB()


## 5. Evaluate the Model
Evaluate the model's performance on the test data using accuracy and a classification report.

In [25]:
# Make predictions on the test data
y_pred = model.predict(X_test)



# print the original content and label for each mail in test set and its predicted labels
for i in range(len(y_test)):
    print(f"Original: {testing_emails.iloc[i]['text']}")
    print(f"Label: {testing_emails.iloc[i]['label']}")
    print(f"Predicted: {'spam' if y_pred[i] == 1 else 'normal'}")
    print()

Original: Hey, just wanted to check in and see how you're doing.
Label: normal
Predicted: normal

Original: Hi, can we schedule a meeting for tomorrow?
Label: normal
Predicted: normal



## 6. Test with New Emails
Use the trained model to classify new email samples.

In [26]:
# Test with new email samples
new_emails = [
    "Congratulations! You've won a $1,000 gift card. Click here to claim your prize.",
    "Hi, can we schedule a meeting for tomorrow?"
]

# Transform the new emails using the same vectorizer
new_emails_transformed = vectorizer.transform(new_emails)

# Predict labels for the new emails
predictions = model.predict(new_emails_transformed)

# Display predictions
for email, label in zip(new_emails, predictions):
    print(f"Email: {email}\nPrediction: {'Spam' if label == 1 else 'Normal'}\n")

Email: Congratulations! You've won a $1,000 gift card. Click here to claim your prize.
Prediction: Spam

Email: Hi, can we schedule a meeting for tomorrow?
Prediction: Normal

