Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Sanskriti Bajaad"
COLLABORATORS = "Individual"

---

# Naive Bayes Spam Detection 

Will build a Naive Bayes text classifier from scratch to detect spam messages. We will use a dataset of SMS messages labeled as "ham" (not spam) or "spam". The assignment will guide you through loading data, preprocessing text, calculating probabilities for Naive Bayes, and evaluating the classifier's performance. Each step is designed as a question that you will answer with code or written response. Make sure to follow instructions closely and fill in the required code where prompted. 

Dataset: We will use the SMS Spam Collection dataset​, a corpus of SMS messages classified as ham or spam. The dataset is provided as a CSV file (spam.csv) with two columns: one for the label (ham or spam) and one for the message text.

## Question 1: Loading and Exploring the Dataset
First, we need to load the spam dataset and get an understanding of its contents. This will involve reading the data from the file and checking basic statistics. Your tasks:
Load the dataset from the provided file (e.g., spam.csv) into a pandas DataFrame.
Ensure the data is read correctly (handle any encoding issues if necessary).
Display the first 5 rows of the DataFrame to see the format of the data.
Compute the number of messages that are ham vs. spam and print these counts.
This will give us an idea of the class distribution and the structure of the data. Hint: You can use pandas.read_csv. The file might be tab-separated; if so, use sep='\t'. You may need to specify an encoding (such as 'latin-1') if you encounter errors reading the file.

In [11]:
# Step 1: Loading the dataset and initial exploration

import pandas as pd

# 1. Load the dataset from spam.csv into a DataFrame
# (The dataset file "spam.csv" is assumed to be in the current directory.)
# TODO: Read the CSV file into a pandas DataFrame named df with columns ["Label", "Message"].
# Hint: If using pd.read_csv, consider specifying sep='\t' and encoding='latin-1'.
# If the CSV has no header row, use header=None and names=["Label", "Message"].
df = pd.read_csv("spam.csv", sep=',', encoding='latin-1')

# 2. Display the first 5 rows of the DataFrame to inspect the format
# YOUR CODE HERE
df = df.iloc[:, :2]
df.columns = ["Label", "Message"]
print(df.head())


# 3. Compute the number of messages that are ham vs. spam
# YOUR CODE HERE
print(df["Label"].value_counts())


  Label                                            Message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
Label
ham     4825
spam     747
Name: count, dtype: int64


## Question 2: Preprocessing - Tokenization and Normalization
Raw text data is messy. To use it in our classifier, we should preprocess the messages:
Tokenization: split each message into individual words (tokens).
Normalization: convert text to a standard form, e.g., lowercasing all words.
Punctuation removal: remove or ignore punctuation so it doesn't count as part of words.

In this question, you'll implement a function to preprocess a single message. The preprocessing should:
1. Convert the text to lowercase.
2. Remove punctuation (you can remove any character that is not a letter or number).
3. Split the text into tokens (words).

We'll start without removing stop words (common words like "the", "and", etc.); we'll address that in the next question. Your tasks:

Implement the function preprocess_text(text, remove_stopwords=False):
Lowercase the input text.
Remove punctuation from the text.
Split the text into tokens (for example, using str.split() or regular expressions to split on whitespace).
For now, ignore the remove_stopwords parameter (we will use it in Question 3).
Use the function on the messages in the dataset to create a new column (e.g., "Tokens") in the DataFrame containing the list of tokens for each message.
Print a sample message and its token list to verify the preprocessing is working as expected.

Hints:
You can use Python's re module (e.g., re.sub) to remove punctuation by replacing non-alphanumeric characters with a space or empty string.
Alternatively, you can remove punctuation by checking each character (.isalnum()).
After cleaning, split on whitespace to get tokens.
Make sure not to remove spaces between words unintentionally when removing punctuation (replacing punctuation with a space can help separate words).

In [12]:
import re

def preprocess_text(text, remove_stopwords=False):
    """
    Tokenize and normalize the given text.
    - Lowercase the text
    - Remove punctuation/non-alphanumeric characters
    - Split into tokens (words)
    If remove_stopwords=True, we'll also remove common stop words (we'll handle this in the next step).
    """
    # Lowercase the text
    # Remove punctuation (replace non-letter/number characters with space)
    # Split text into tokens
    text = text.lower()
    text = re.sub(r'[^a-z0-9]', ' ', text)
    tokens = text.split()
    
    return tokens

# Create a new column in the DataFrame for the tokenized messages
# Use the preprocess_text function on each message in df["Message"]
# YOUR CODE HERE
df['Tokens'] = df['Message'].apply(preprocess_text)

# Test the preprocessing on a sample message
sample_idx = 1  # we'll test on the first message
print("Original message:", df.loc[sample_idx, "Message"])
print("Tokens:", df.loc[sample_idx, "Tokens"])

Original message: Ok lar... Joking wif u oni...
Tokens: ['ok', 'lar', 'joking', 'wif', 'u', 'oni']


## Question 3: Removing Stop Words

Stop words are common words (like "the", "and", "to", "is") that may not be useful for distinguishing between spam and ham. Removing stop words can sometimes improve a model's performance by focusing on more meaningful words. In this question, we'll add stop word removal to our preprocessing and consider its effect on the classifier. 

Your tasks:
1. Define a list or set of English stop words. You can use a predefined list (for example, use sklearn's list: from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS) or define a small list of your own (e.g., ["the", "a", "an", "to", "is", "in", ...]).
2. Update your preprocess_text function to remove stop words when remove_stopwords=True. This means filtering out tokens that are in your stop words list.
3. Test your updated preprocessing on a sample message by calling preprocess_text with remove_stopwords=True and verify that common words are removed.
4. Think: (No code required for this part) How might removing stop words affect the performance of the spam classifier? Will it increase, decrease, or have minimal effect on accuracy? We will later evaluate the model with and without stop words to see the difference.

Hint: Converting your stop word list to a set will make the membership check (word in stop_words) more efficient.

In [13]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
stop_words = set(ENGLISH_STOP_WORDS)

def preprocess_text(text, remove_stopwords=False):
    # Lowercase the text
    text = text.lower()
    
    # Remove punctuation and non-alphanumeric characters
    text = re.sub(r'[^a-z0-9\s]', ' ', text)

    
    # Split into tokens
    tokens = text.split()
    
    # If remove_stopwords is True, filter out tokens that are in the stop_words set
    if remove_stopwords:
        tokens = [w for w in tokens if w not in stop_words]
    return tokens

# Let's test the function on the same sample message with and without stop word removal
sample_text = df.loc[sample_idx, "Message"]
print("Tokens without stopword removal:", preprocess_text(sample_text, remove_stopwords=False))
print("Tokens with stopword removal:", preprocess_text(sample_text, remove_stopwords=True))

Tokens without stopword removal: ['ok', 'lar', 'joking', 'wif', 'u', 'oni']
Tokens with stopword removal: ['ok', 'lar', 'joking', 'wif', 'u', 'oni']


## Question 4: Creating Word Frequency Counts per Class

Now that we can tokenize our messages, let's prepare the data for the Naive Bayes classifier. Naive Bayes for text classification uses the frequency of each word in each class (spam or ham) to compute probabilities. In this step, we will:

1. Separate the training data into spam messages and ham messages.
2. Count how many times each word appears in spam messages and in ham messages.

These word frequency counts per class will form the basis of our likelihood estimates $P(\text{word}|\text{spam})$ and $P(\text{word}|\text{ham})$. Your tasks:
1. Split the dataset into a training set and a test set. Use 80% of the data for training and 20% for testing. (It's important to evaluate on unseen test data.)
You can use sklearn.model_selection.train_test_split with a fixed random_state (for reproducibility), or shuffle and split manually.
Make sure to separate both the messages and labels for train and test.
2. Using the training set only, create two dictionaries (or collections.Counter):
spam_word_counts: counts of each word in all spam messages in the training set.
ham_word_counts: counts of each word in all ham messages in the training set.
3. Also compute the total number of words in spam messages (N_spam_words) and in ham messages (N_ham_words) in the training set. (This is the sum of the counts in each dictionary, or equivalently the total length of all spam/ham tokens.)
4. Print the top 5 most frequent words in spam_word_counts and ham_word_counts to see some common words in each class (optional, for curiosity).

Hint:
If you used a DataFrame, you can create train_df and test_df by splitting df. For example, using train_test_split from sklearn.
To count words, iterating through each training message's tokens and updating counts is straightforward. You can use dict or Counter.
Example with Counter: from collections import Counter; then for spam: spam_word_counts = Counter(), and update it for each spam message's token list.

In [14]:
from sklearn.model_selection import train_test_split
from collections import Counter

# 1. Split the full dataset into training and testing sets (e.g., 80% train, 20% test)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
# Reset indices for convenience
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

# 2. Separate training data into spam and ham subsets
# YOUR CODE HERE: create train_spam and train_ham DataFrames
train_spam = train_df[train_df["Label"] == "spam"]
train_ham = train_df[train_df["Label"] == "ham"]

# 3. Tokenize all messages in each subset (using preprocess_text with remove_stopwords=True)
# Create a list of token lists for spam messages and another for ham messages.
# YOUR CODE HERE: generate spam_tokens_list and ham_tokens_list
spam_tokens_list = []
for msg in train_spam["Message"]:
    spam_tokens_list.append(preprocess_text(msg, remove_stopwords=True))

ham_tokens_list = []
for msg in train_ham["Message"]:
    ham_tokens_list.append(preprocess_text(msg, remove_stopwords=True))

# 4. Count word frequencies for spam and ham
spam_word_counts = Counter()
ham_word_counts = Counter()

# Update the counters with each message's tokens
for tokens in spam_tokens_list:
    spam_word_counts.update(tokens)
for tokens in ham_tokens_list:
    ham_word_counts.update(tokens)

# 5. Calculate total number of words in spam and ham messages
N_spam_words = sum(spam_word_counts.values())
N_ham_words = sum(ham_word_counts.values())


# 6. Print the total word counts in each class
print(f"Total words in spam messages: {N_spam_words}")
print(f"Total words in ham messages: {N_ham_words}")


Total words in spam messages: 10387
Total words in ham messages: 30103


## Question 5: Calculating Prior and Conditional Probabilities
Now we will compute the probabilities needed for the Naive Bayes classifier:
1. Prior probabilities: $P(\text{spam})$ and $P(\text{ham})$ — the probabilities that any given message is spam or ham, based on the training data.
2. Conditional probabilities (likelihoods): $P(w|\text{spam})$ and $P(w|\text{ham})$ for each word $w$ — the probability of a word appearing in a message given the message is spam or ham.

Using the training set:
1. $P(\text{spam}) = \frac{\text{count of spam messages in training}}{\text{total count of messages in training}}$.
2. $P(\text{ham}) = \frac{\text{count of ham messages in training}}{\text{total count of messages in training}}$.

For a given word $w$:
1. $P(w|\text{spam}) = \frac{\text{count of $w$ in spam messages}}{\text{total words in spam messages}}$ (using the frequencies you calculated).
2. $P(w|\text{ham}) = \frac{\text{count of $w$ in ham messages}}{\text{total words in ham messages}}$.

However, note that if a word did not appear in spam (count = 0), this probability will be 0, which can be problematic. We'll address that in the next question (Laplace smoothing). For now, we'll compute the probabilities without smoothing. 

Your tasks:
1. Calculate the prior probabilities P_spam and P_ham from the training data. Store them in variables P_spam and P_ham.
2. Implement a function conditional_prob(word, class_label) that returns $P(word | class_label)$ using the frequency counts from Question 4 (no smoothing yet):
If class_label is "spam", use spam_word_counts and N_spam_words.
If class_label is "ham", use ham_word_counts and N_ham_words.
If the word is not found in the respective dictionary, the probability should be 0 (since count=0).
3. Test your function on a couple of words, for example:
A word you expect to be common in spam (like "free") for both spam and ham classes.
A word that is common in ham or appears in both.
4. Print the prior probabilities and a few example conditional probabilities.

Hint: The sum of spam_word_counts.values() we computed is the denominator for $P(w|\text{spam})$. Use integer counts for numerator.

In [15]:
# Number of spam and ham messages in the training set
# (We can get these from the lengths of train_spam and train_ham DataFrames)
# YOUR CODE HERE: compute num_spam_messages and num_ham_messages
num_spam_messages = len(train_spam)
num_ham_messages = len(train_ham)
num_messages = len(train_df)

# Prior probabilities P(spam) and P(ham)
P_spam = num_spam_messages / num_messages
P_ham = num_ham_messages / num_messages


# Print the prior probabilities
print(f"P(spam): {P_spam:.4f}")
print(f"P(ham): {P_ham:.4f}")

# Conditional probability function for a word given class
def conditional_prob(word, class_label):
    """Return P(word | class_label) based on the training data frequencies."""
    if class_label == "spam":
        count = spam_word_counts.get(word, 0)
        return count / N_spam_words if N_spam_words > 0 else 0.0
    elif class_label == "ham":
        count = ham_word_counts.get(word, 0)
        return count / N_ham_words if N_ham_words > 0 else 0.0
    else:
        raise ValueError("Invalid class_label. Choose 'spam' or 'ham'.")

# Test the conditional probabilities for some example words
test_words = ["free", "call", "to"]
for w in test_words:
    print(f"\nFor word '{w}':")
    print(f"P({w}|spam) = {conditional_prob(w, 'spam'):.6f}")
    print(f"P({w}|ham) = {conditional_prob(w, 'ham'):.6f}")

P(spam): 0.1339
P(ham): 0.8661

For word 'free':
P(free|spam) = 0.017714
P(free|ham) = 0.001628

For word 'call':
P(call|spam) = 0.000000
P(call|ham) = 0.000000

For word 'to':
P(to|spam) = 0.000000
P(to|ham) = 0.000000


## Question 6: Applying Laplace Smoothing

To handle zero probabilities for unseen words, we apply **Laplace smoothing** (also known as *add-one smoothing*).  
The idea is to pretend we saw each word at least once in each class, so that no probability is ever zero.

We compute smoothed conditional probabilities as follows:

$$
P_{\text{smooth}}(w \mid \text{spam}) = \frac{\text{count}(w, \text{spam}) + 1}{N_{\text{spam\_words}} + |V|}
$$

$$
P_{\text{smooth}}(w \mid \text{ham}) = \frac{\text{count}(w, \text{ham}) + 1}{N_{\text{ham\_words}} + |V|}
$$

Where:

- \( |V| \) is the number of **unique words** in the training set vocabulary (the union of all words in spam and ham).
- \( N_{\text{spam\_words}} \) is the total number of words in all spam messages.
- \( N_{\text{ham\_words}} \) is the total number of words in all ham messages.

By adding 1 to all word counts, even words that weren’t seen (`count = 0`) will have a small, non-zero probability:

$$
\frac{1}{N + |V|}
$$

Tasks

1. Calculate the vocabulary size V (number of unique words in the training set). You can get this by taking the set union of keys from spam_word_counts and ham_word_counts, or by combining the lists of spam and ham tokens.

2. Implement a new function conditional_prob_smooth(word, class_label) that returns the Laplace-smoothed probability of a word given class:
Use the formulas above: numerator is count + 1, denominator is total words in class + V.
Use V you computed for the denominator.

3. Test this function on a word that was previously unseen in one class to ensure it's not zero. For example, pick a word that appears in ham but not in spam and check conditional_prob_smooth(word, "spam") (it should now be > 0).

4. Compare a couple of values from conditional_prob vs conditional_prob_smooth for words that have zero counts to see the difference.


In [16]:
# Step 1: Compute the vocabulary size (number of unique words across spam and ham)
# Hint: Combine keys from spam_word_counts and ham_word_counts into a set

# TODO: Create a set called `vocab` that contains all unique words in training
# YOUR CODE HERE
vocab = set(spam_word_counts.keys()).union(set(ham_word_counts.keys()))
V = len(vocab)
print(f"Vocabulary size (training): {V}")

# Step 2: Define a function to compute conditional probabilities with Laplace smoothing
# TODO: return conditional_prob_smooth for spam and ham

def conditional_prob_smooth(word, class_label):
    """Return P(word | clzass) with Laplace smoothing."""
    # TODO: Complete the function for both 'spam' and 'ham'
    if class_label == "spam":
        count = spam_word_counts.get(word, 0)
        total_words = sum(spam_word_counts.values())
        return (count + 1) / (total_words + V)
    elif class_label == "ham":
        count = ham_word_counts.get(word, 0)
        total_words = sum(ham_word_counts.values())
        return (count + 1) / (total_words + V)
    else:
        raise ValueError("class_label must be 'spam' or 'ham'")

# Step 3: Pick a word that is in ham messages but not in spam messages
# This will help us compare smoothed vs unsmoothed probabilities

# TODO: Loop over ham_word_counts to find a word not seen in spam_word_counts
unseen_in_spam = None

for word in ham_word_counts:
    if word not in spam_word_counts:
        unseen_in_spam = word
        break

# Print results using both smoothed and unsmoothed functions
if unseen_in_spam:
    print(f"Word '{unseen_in_spam}' is unseen in spam training data.")
    print(f"P({unseen_in_spam}|spam) without smoothing = {conditional_prob(unseen_in_spam, 'spam'):.6f}")
    print(f"P({unseen_in_spam}|spam) with smoothing = {conditional_prob_smooth(unseen_in_spam, 'spam'):.6f}")


Vocabulary size (training): 7459
Word 'boat' is unseen in spam training data.
P(boat|spam) without smoothing = 0.000000
P(boat|spam) with smoothing = 0.000056


# Question 7: Implementing the Naive Bayes Classifier

Using the probabilities we derived, we can now classify a new message as spam or ham.

According to Bayes’ theorem, we compute:

$$
P(\text{spam} \mid \text{message}) \propto P(\text{spam}) \prod_{w \in \text{message}} P(w \mid \text{spam})
$$

$$
P(\text{ham} \mid \text{message}) \propto P(\text{ham}) \prod_{w \in \text{message}} P(w \mid \text{ham})
$$

To avoid numerical underflow from multiplying many small probabilities, we work in the **log space**. This gives us the **log-likelihoods**:

$$
\log P(\text{spam} \mid \text{message}) = \log P(\text{spam}) + \sum_{w \in \text{message}} \log P(w \mid \text{spam})
$$

$$
\log P(\text{ham} \mid \text{message}) = \log P(\text{ham}) + \sum_{w \in \text{message}} \log P(w \mid \text{ham})
$$

We then predict the class with the higher log-probability.

---

Your tasks:

- Implement a function `predict_naive_bayes(message)` that:
  - Preprocesses the input message using your `preprocess_text` function.
  - Computes the **log-probability** of the message being spam and ham using Laplace-smoothed conditional probabilities.
  - Returns the class `"spam"` if the spam log-probability is higher, otherwise `"ham"`.

- Test your function with two example messages:
  - `"Congratulations! You've won a free lottery. Call now to claim $$$"` (likely spam)
  - `"Hey, are we still on for dinner tonight?"` (likely ham)

---

Hint:
Use `math.log()` for computing logarithms, and be sure your probabilities are nonzero (you should use the Laplace-smoothed ones).


In [None]:
import math

# Question 7: Naive Bayes prediction function

def predict_naive_bayes(message):
    """
    Predict whether a given message is "spam" or "ham" using the trained Naive Bayes model.
    Returns the predicted label as a string.
    """
    # Step 1: Preprocess the message
    # TODO: Tokenize and clean the message
    tokens = preprocess_text(message)

    # Step 2: Initialize log probabilities with prior probabilities
    # TODO: Use math.log on the prior probabilities P_spam and P_ham
    log_prob_spam = math.log(P_spam)
    log_prob_ham = math.log(P_ham)

    # Step 3: Add log conditional probabilities for each word
    for word in tokens:
        if word in vocab:
            # TODO: Add log P(word | spam) and log P(word | ham)
            log_prob_spam += math.log(conditional_prob_smooth(word, "spam"))
            log_prob_ham += math.log(conditional_prob_smooth(word, "ham"))
        # If the word is not in the vocabulary, skip it

    # Step 4: Return the class with the higher log probability
    # TODO: Compare log_prob_spam and log_prob_ham
    if log_prob_spam > log_prob_ham:
        return "spam"
    else:
        return "ham"

# Quick tests for the predictor
test_messages = [
    "Congratulations! You've won a free lottery. Call now to claim $$$",  # likely spam
    "Hey, are we still on for dinner tonight?",  # likely ham
]
for msg in test_messages:
    print(f"Message: {msg}")
    print(f"Predicted: {predict_naive_bayes(msg)}")
    print("---")


Message: Congratulations! You've won a free lottery. Call now to claim $$$
Predicted: spam
---
Message: Hey, are we still on for dinner tonight?
Predicted: ham
---


# Question 8: Making Predictions on the Test Set

With our classifier function ready, we can now apply it to every message in the test set and see how well it performs on unseen data. In this step, you'll generate predictions for the test set and compare them to the actual labels. 

Your tasks:

1. Use the predict_naive_bayes function to predict labels for each message in test_df.
You can do this with a loop, or by using pandas apply on the "Message" column of test_df.
2. Store the predictions in a list or as a new column in test_df (e.g., test_df["Predicted"]).
3. Print the first 10 predictions alongside the true labels to get a sense of how the classifier is doing.
4. (Important for grading) Also create two lists or arrays:
y_true for the true labels in the test set,
y_pred for the predicted labels.

We'll use y_true and y_pred in the next step to compute evaluation metrics.

In [18]:
# Question 8: Predict on test set

# Generate predictions for each message in the test set
y_true = list(test_df["Label"])
y_pred = []  # this will hold our predicted labels

# Loop over each message in test set and predict
for msg in test_df["Message"]:
    pred_label = predict_naive_bayes(msg)
    y_pred.append(pred_label)

# Optionally, add predictions to the test DataFrame for convenience
test_df["Predicted"] = y_pred

# Print first 10 results: actual vs predicted
print("Sample predictions (Actual -> Predicted):")
for i in range(10):
    actual = y_true[i]
    predicted = y_pred[i]
    print(f"{actual} -> {predicted}")


Sample predictions (Actual -> Predicted):
ham -> ham
ham -> ham
spam -> spam
ham -> ham
spam -> spam
ham -> ham
ham -> ham
ham -> ham
ham -> ham
ham -> ham


# Question 9: Evaluating Model Performance

Now that we have predictions on the test set, let's evaluate how well our Naive Bayes classifier performed. We will calculate the following evaluation metrics:
1. Accuracy: the proportion of messages correctly classified.
2. Precision (for the "spam" class): among the messages predicted as spam, what fraction are actually spam.
3. Recall (for the "spam" class): among the actual spam messages, what fraction did the classifier correctly identify as spam.
4. F1-score: the harmonic mean of precision and recall, giving a single measure of classifier quality for the positive class.
For these metrics, we'll treat "spam" as the positive class. (It's common in binary classification to focus on the performance for the positive class of interest, here spam detection.) 

Your tasks:
1. Calculate the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) by comparing y_pred and y_true:
TP: cases where the true label is spam and the model predicted spam.
TN: cases where the true label is ham and the model predicted ham.
FP: cases where the true label is ham but the model predicted spam (a ham message incorrectly flagged as spam).
FN: cases where the true label is spam but the model predicted ham (a spam message missed by the classifier).
2. Using TP, TN, FP, FN, compute:
accuracy = (TP + TN) / total_test_messages
precision = TP / (TP + FP) (if TP+FP is 0, set precision to 0 to avoid division by zero).
recall = TP / (TP + FN) (if TP+FN is 0, set recall to 0).
f1 = 2 * precision * recall / (precision + recall) (if precision+recall is 0, then F1 can be set to 0).
3. Print out the four metrics.

In [19]:
# Initialize counts
TP = FP = TN = FN = 0

for actual, pred in zip(y_true, y_pred):
    if actual == "spam" and pred == "spam":
        TP += 1
    elif actual == "ham" and pred == "ham":
        TN += 1
    elif actual == "ham" and pred == "spam":
        FP += 1
    elif actual == "spam" and pred == "ham":
        FN += 1


# Compute metrics
total = TP + TN + FP + FN

accuracy = (TP + TN) / total if total > 0 else 0
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-score:  {f1:.4f}")


Accuracy:  0.9848
Precision: 0.9716
Recall:    0.9133
F1-score:  0.9416


# Question 10: Interpreting Results and Discussing Limitations

Finally, let's interpret our results and reflect on the Naive Bayes classifier:
1. Performance Discussion: Look at the accuracy, precision, recall, and F1 you obtained. Are they satisfactory for a spam filter? Did the model perform better at identifying ham vs spam (check precision and recall values for spam)? For example, if precision is high but recall is low, the filter rarely flags ham as spam (good) but misses some spam (not ideal). Share your observations.
2. Effect of Stop Words Removal: If you removed stop words in preprocessing, do you think it helped the model? (You could compare with a run without stop word removal to see the difference, if time permits.) Typically, removing stop words might slightly improve or have minimal effect on spam detection, since stop words are common in both ham and spam and don't carry much discriminatory power. Briefly explain your reasoning or findings.
3. Limitations of Naive Bayes: Discuss some limitations of the Naive Bayes approach for text classification in this context:
Independence assumption: Naive Bayes assumes that words in a message occur independently given the class, which is not strictly true (e.g., phrases or word combinations aren't considered).
Misleading evidence: If a ham message contains a very "spammy" word or vice versa, Naive Bayes will weigh that word heavily, possibly ignoring context.
Data sparsity: If a spam has words never seen before (which we handled with smoothing), the model might still be unsure. Naive Bayes doesn't capture semantics—two different words are completely unrelated to it (e.g., "prize" and "award" are treated as distinct features with no relationship).
Other limitations: It's a simple model that might not catch more subtle patterns (like character obfuscation in spam "cl1ck here", or the overall structure of messages).

Your tasks:

1. Write a short discussion (3-5 sentences for each point above) interpreting the metrics and discussing the effect of stop word removal and limitations of Naive Bayes for spam detection.
2. Please provide your answer in the markdown cell below (no code needed here, just explanation).

**Answer:**

*Performance: Achieved an accuracy of 98.5%, precision of 97.2%, recall of 91.3%, and an F1-score of 94.2%, High 90's suggest that overall performance was good. The high precision means we are able to identify what is spam without mislabeling. The  lower recall suggests that it occasionally misses spam messages (false negatives), which could be improved. Overall, the balance between precision and recall is fairly good and deems reliable.


*Effect of Stop Words Removal: By removing stop words, the model can focus on more important words which are more useful for classification. Helpsreduce noise and improve precision and recall.


*Limitations of Naive Bayes: Naive Bayes thinks each word is independent, which isn’t always true in real messages. It doesn’t understand word order or meaning, as humans understand context. It can also struggle with words it hasn’t seen before, it’s a interesting model that works well but has shown to not be a 100% accurate.

