# Multinomial Naive Bayes for Spam Email Classification
This notebook demonstrates how to use the Multinomial Naive Bayes algorithm to classify emails as spam or normal.

In [18]:
## Problem Setup

The goal of this notebook is to classify emails as either "spam" or "normal" using the Multinomial Naive Bayes algorithm. 

### Key Steps:
1. **Dataset Preparation**:
    - A dataset of 10 emails is created, where each email is labeled as "spam" or "normal".

2. **Feature Extraction**:
    - The text data is converted into numerical features using `CountVectorizer`.

3. **Model Training**:
    - The Multinomial Naive Bayes model is trained on the extracted features and labels.

4. **Evaluation**:
    - The models performance is evaluated using accuracy and a classification report.

5. **Prediction**:
    - The trained model is used to classify new, unseen emails.

This setup demonstrates how text classification can be performed using machine learning techniques.

SyntaxError: invalid syntax (1672980161.py, line 3)

## 1. Import Required Libraries
We will use libraries such as `pandas`, `sklearn`, and `numpy` for data manipulation, model building, and evaluation.

In [19]:
# Import Required Libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

## 2. Load and Explore the Dataset
We will use a dataset containing labeled emails as spam or normal. The dataset should have two columns: `text` (email content) and `label` (spam or normal).

In [20]:
# Create a sample dataset
data = {
	'text': [
		"Congratulations! You've won a $1,000 gift card. Click here to claim your prize.",
		"Hi, can we schedule a meeting for tomorrow?",
		"Exclusive offer just for you. Buy now and save 50%.",
		"Don't miss out on this opportunity to earn money from home.",
		"Hello, I hope this email finds you well.",
          "Win a free vacation to the Bahamas! Click now to claim your prize.",
        "Limited time offer! Get 70% off on all products. Shop today!",
        "Hey, just wanted to check in and see how you're doing.",
        "Can you send me the report by the end of the day? Thanks!"
	],
	'label': ['spam', 'normal', 'spam', 'spam', 'normal','spam', 'spam', 'normal', 'normal']
}

# Convert the sample data into a DataFrame
emails = pd.DataFrame(data)

# Display all rows of the dataset
pd.set_option('display.max_rows', None)
# Display all columns of the dataset

print(emails)

# Check for missing values
print(emails.isnull().sum())

                                                text   label
0  Congratulations! You've won a $1,000 gift card...    spam
1        Hi, can we schedule a meeting for tomorrow?  normal
2  Exclusive offer just for you. Buy now and save...    spam
3  Don't miss out on this opportunity to earn mon...    spam
4           Hello, I hope this email finds you well.  normal
5  Win a free vacation to the Bahamas! Click now ...    spam
6  Limited time offer! Get 70% off on all product...    spam
7  Hey, just wanted to check in and see how you'r...  normal
8  Can you send me the report by the end of the d...  normal
text     0
label    0
dtype: int64


## 3. Preprocess the Data
Convert the text data into numerical features using `CountVectorizer` and split the dataset into training and testing sets.

In [21]:
# Convert text data to numerical features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails['text'])

# Encode labels (spam = 1, normal = 0)
y = emails['label'].apply(lambda x: 1 if x == 'spam' else 0)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

visualize contents of all the training mails and their labels

In [None]:
# Extract the training set emails using the indices of y_train
training_emails = emails.loc[y_train.index]

# Display the contents of the training emails
print(training_emails)

                                                text   label
5  Win a free vacation to the Bahamas! Click now ...    spam
0  Congratulations! You've won a $1,000 gift card...    spam
8  Can you send me the report by the end of the d...  normal
2  Exclusive offer just for you. Buy now and save...    spam
4           Hello, I hope this email finds you well.  normal
3  Don't miss out on this opportunity to earn mon...    spam
6  Limited time offer! Get 70% off on all product...    spam


visualize contents of all the testing mails

In [22]:
# Extract the testing set emails using the indices of y_test
testing_emails = emails.loc[y_test.index]

# Display the contents of the testing emails
print(testing_emails)

                                                text   label
7  Hey, just wanted to check in and see how you'r...  normal
1        Hi, can we schedule a meeting for tomorrow?  normal


## 4. Train the Multinomial Naive Bayes Model
Fit the model to the training data.

## How the Multinomial Naive Bayes Model is Trained

The Multinomial Naive Bayes model is trained using the following steps:

1. **Feature Extraction**:
    - The text data is converted into numerical features using the `CountVectorizer`. This creates a matrix where each row represents an email, and each column represents the count of a specific word in that email.

2. **Label Encoding**:
    - The labels (spam or normal) are encoded as binary values: `1` for spam and `0` for normal.

3. **Training the Model**:
    - The model calculates the probabilities required for classification:
        - **Prior Probabilities**: The probability of each class (spam or normal) in the training data.
        - **Conditional Probabilities**: The probability of each word given a class, computed using Laplace smoothing to handle words that may not appear in a particular class.



## Computing Prior Probabilities for Each Class

The prior probability for each class is the proportion of samples belonging to that class in the training dataset. It is calculated as:

$$
P(c) = \frac{\text{Number of Samples in Class } c}{\text{Total Number of Samples}}
$$

### Steps to Compute Prior Probabilities:
1. **Count the Samples in Each Class**:
    - Count the number of training samples labeled as "spam" and "normal".

2. **Divide by Total Samples**:
    - Divide the count of samples in each class by the total number of training samples.

### Example from the Notebook:
- Total training samples: 7
- Samples in "normal" class (label = 0): 2
- Samples in "spam" class (label = 1): 5

The prior probabilities are:
- For "normal" class:
  $$
  P(\text{normal}) = \frac{2}{7} \approx 0.2857
  $$

- For "spam" class:
  $$
  P(\text{spam}) = \frac{5}{7} \approx 0.7143
  $$

These values are stored in the variable `class_prior_probabilities` as:
```python
array([0.28571429, 0.71428571])
```
```

## computing Class Conditional Probabilities

The class conditional probability for each word is computed using the Multinomial Naive Bayes formula. Here's how it works:

1. **Count the Occurrences of Each Word in Each Class**:
    - For each word in the vocabulary, count how many times it appears in emails labeled as "spam" and "normal".

2. **Add Smoothing**:
    - To avoid zero probabilities for words that do not appear in a particular class, Laplace smoothing is applied. This involves adding 1 to the count of each word and adding the total number of unique words (vocabulary size) to the denominator.

3. **Compute the Probability**:
    - The conditional probability of a word $w$ given a class $c$ is calculated as:
     $$
      P(w|c) = \frac{\text{Count}(w, c) }{\text{Total Words in Class } c }

      $$

 $$
      P(w|c) = \frac{\text{Count}(w, c) + 1}{\text{Total Words in Class } c + \text{Vocabulary Size}}      $$


Here:

 $\text{Count}(w, c)$ is the number of times the word $w$ appears in emails of class $c$.
      - $\text{Total Words in Class } c$ is the total count of all words in emails of class $c$.
      - $\text{Vocabulary Size}$ is the total number of unique words in the dataset.

4. **Log Transformation**:
    - To prevent numerical underflow during multiplication of probabilities, the logarithm of the probabilities is often used. This is stored in `model.feature_log_prob_`.

5. **Convert Log Probabilities to Conditional Probabilities**:
    - The log probabilities can be exponentiated to get the actual conditional probabilities:
      $$
      P(w|c) = \exp(\text{log}(P(w|c)))
      $$

This process ensures that the model can compute the likelihood of an email belonging to a class based on the words it contains.



4. **Optimization**:
    - The model optimizes the parameters by maximizing the likelihood of the training data under the Naive Bayes assumption (features are conditionally independent given the class).

5. **Model Storage**:
    - The trained model stores the log probabilities of each word for each class (`model.feature_log_prob_`) and the prior probabilities of each class (`model.class_log_prior_`).

This process ensures that the model can efficiently classify new emails based on the learned probabilities.

In [25]:
import numpy as np
# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Convert the word count matrix to an array for easier interpretation
word_count_matrix = X_train.toarray()

#compute and print class prior probabilities
class_prior_probabilities = np.bincount(y_train) / len(y_train)
print(f"Class prior probabilities: {class_prior_probabilities}")
print()
print(f"Number of training samples: {len(y_train)}")
print(f"Number of training samples in each class: {np.bincount(y_train)}")
print()

# print the number of total words and  in normal mails and in spams
print(f"Total words in normal mails: {np.sum(word_count_matrix[:, :len(feature_names)//2])}")
print(f"Total words in spams: {np.sum(word_count_matrix[:, len(feature_names)//2:])}")

#print the number of unique words(vocab size) which will be used for #laplacian smoothing
print(f"Vocabulary size: {len(feature_names)}")
print(f"Unique words: {len(np.unique(feature_names))}")
print()



# for each  word , print its count in each mail in the training set and its total count in normal mails and in spam mails
for i in range(10):
    word = feature_names[i]
    print(f"Word {i+1}: {word}")
    print(f"Count in each mail: {word_count_matrix[:, i].sum()}")
    print(f"Total count in normal mails: {word_count_matrix[y_train == 0, i].sum()}")
    print(f"Total count in spam mails: {word_count_matrix[y_train == 1, i].sum()}")
    print()



Class prior probabilities: [0.28571429 0.71428571]

Number of training samples: 7
Number of training samples in each class: [2 5]

Total words in normal mails: 32
Total words in spams: 45
Vocabulary size: 73
Unique words: 73

Word 1: 000
Count in each mail: 1
Total count in normal mails: 0
Total count in spam mails: 1

Word 2: 50
Count in each mail: 1
Total count in normal mails: 0
Total count in spam mails: 1

Word 3: 70
Count in each mail: 1
Total count in normal mails: 0
Total count in spam mails: 1

Word 4: all
Count in each mail: 1
Total count in normal mails: 0
Total count in spam mails: 1

Word 5: and
Count in each mail: 1
Total count in normal mails: 0
Total count in spam mails: 1

Word 6: bahamas
Count in each mail: 1
Total count in normal mails: 0
Total count in spam mails: 1

Word 7: buy
Count in each mail: 1
Total count in normal mails: 0
Total count in spam mails: 1

Word 8: by
Count in each mail: 1
Total count in normal mails: 1
Total count in spam mails: 0

Word 9: can
C

In [28]:
# Train the Multinomial Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

In [33]:
import numpy as np
# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Get the log probabilities for each word in each class
log_probabilities = model.feature_log_prob_

#convert log probabilities to conditional probabilities
conditional_probabilities = [np.exp(log_prob) for log_prob in log_probabilities]

# Print the conditional probabilities for each word for each class
classname=['normal mail', 'spam mail']
for i, class_probabilities in enumerate(conditional_probabilities):
    print(f"Class {classname[i]}:")
    for word, prob in zip(feature_names, class_probabilities):
        print(f"{word}: {prob:.4f}")
    print()

Class normal mail:
000: 0.0108
50: 0.0108
70: 0.0108
all: 0.0108
and: 0.0108
bahamas: 0.0108
buy: 0.0108
by: 0.0215
can: 0.0215
card: 0.0108
check: 0.0108
claim: 0.0108
click: 0.0108
congratulations: 0.0108
day: 0.0215
doing: 0.0108
don: 0.0108
earn: 0.0108
email: 0.0215
end: 0.0215
exclusive: 0.0108
finds: 0.0215
for: 0.0108
free: 0.0108
from: 0.0108
get: 0.0108
gift: 0.0108
hello: 0.0215
here: 0.0108
hey: 0.0108
hi: 0.0108
home: 0.0108
hope: 0.0215
how: 0.0108
in: 0.0108
just: 0.0108
limited: 0.0108
me: 0.0215
meeting: 0.0108
miss: 0.0108
money: 0.0108
now: 0.0108
of: 0.0215
off: 0.0108
offer: 0.0108
on: 0.0108
opportunity: 0.0108
out: 0.0108
prize: 0.0108
products: 0.0108
re: 0.0108
report: 0.0215
save: 0.0108
schedule: 0.0108
see: 0.0108
send: 0.0215
shop: 0.0108
thanks: 0.0215
the: 0.0430
this: 0.0215
time: 0.0108
to: 0.0108
today: 0.0108
tomorrow: 0.0108
vacation: 0.0108
ve: 0.0108
wanted: 0.0108
we: 0.0108
well: 0.0215
win: 0.0108
won: 0.0108
you: 0.0323
your: 0.0108

Class spam

In [34]:
#fit the model to the training data
# Make predictions on the test set

print(model)

MultinomialNB()


## 5. Evaluate the Model
Evaluate the model's performance on the test data using accuracy and a classification report.

## How the Model Classifies a New Email



1. **Feature Extraction**:
    - The email text is transformed into numerical features using the same `CountVectorizer` that was used during training. This ensures consistency in the feature representation.

2. **Compute class posteria Probabilities**:
    - The model calculates the log probabilities for each class (spam or normal) based on the words present in the new test email

3. **Apply Naive Bayes Formula**:
    - The model computes the likelihood of the email belonging to each class using the Naive Bayes formula:
      $$
      \text{Log Likelihood}(c) = \text{Log Prior}(c) + \sum_{w \in \text{Email}} \text{Log}(P(w|c))
      $$

  $$
      \text{ Likelihood}(c) = \text{Prior}(c) * \prod_{w \in \text{Email}} \text{}(P(w|c))
      $$

      
      
 Here, 
       
$c$ is the class (spam or normal), and $P(w|c)$ is the conditional probability of word $w$ given class $c$.

4. **Class Prediction**:
    - The class with the highest likelihood is selected as the predicted class for the email.

5. **Output**:
    - The model outputs the predicted label (`spam` or `normal`) for the email.

This process ensures that the model uses the learned probabilities to make accurate predictions for new, unseen emails.

In [None]:
# Make predictions on the test data
y_pred = model.predict(X_test)



# print the original content and label for each mail in test set and its predicted labels
for i in range(len(y_test)):
    print(f"Original: {testing_emails.iloc[i]['text']}")
    print(f"Label: {testing_emails.iloc[i]['label']}")
    print(f"Predicted: {'spam' if y_pred[i] == 1 else 'normal'}")
    print()

Original: Hey, just wanted to check in and see how you're doing.
Label: normal
Predicted: normal

Original: Hi, can we schedule a meeting for tomorrow?
Label: normal
Predicted: normal



## 6. Test with New Emails
Use the trained model to classify new email samples.

In [None]:
# Test with new email samples
new_emails = [
    "Congratulations! You've won a $1,000 gift card. Click here to claim your prize.",
    "Hi, can we schedule a meeting for tomorrow?"
]

# Transform the new emails using the same vectorizer
new_emails_transformed = vectorizer.transform(new_emails)

# Predict labels for the new emails
predictions = model.predict(new_emails_transformed)

# Display predictions
for email, label in zip(new_emails, predictions):
    print(f"Email: {email}\nPrediction: {'Spam' if label == 1 else 'Normal'}\n")

Email: Congratulations! You've won a $1,000 gift card. Click here to claim your prize.
Prediction: Spam

Email: Hi, can we schedule a meeting for tomorrow?
Prediction: Normal

