<a href="https://colab.research.google.com/github/plthomps/CIS-3902-Data-Mining/blob/main/naive_bayes_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes Example: Spam Classification

This notebook demonstrates **how Naive Bayes classification works mathematically** using a simple spam email example.

We will classify an email as:

- **Spam**
- **Not Spam**

based on whether it contains the words:

- **"Free"**
- **"Win"**


## Step 1: Training Data Summary

From previous emails we observed:

| Email Type | Total | Contain "Free" | Contain "Win" |
|-----------|-------|---------------|--------------|
| Spam      | 40    | 32            | 28           |
| Not Spam  | 60    | 6             | 3            |


## Step 2: Prior Probabilities

These represent how common each class is.


In [1]:
spam_total = 40
notspam_total = 60
total = spam_total + notspam_total

P_spam = spam_total / total
P_notspam = notspam_total / total

P_spam, P_notspam

(0.4, 0.6)

## Step 3: Likelihood Probabilities

Probability of each word given the class.


In [2]:
# Likelihoods for Spam
P_free_given_spam = 32 / 40
P_win_given_spam = 28 / 40

# Likelihoods for Not Spam
P_free_given_notspam = 6 / 60
P_win_given_notspam = 3 / 60

P_free_given_spam, P_win_given_spam, P_free_given_notspam, P_win_given_notspam

(0.8, 0.7, 0.1, 0.05)

## Step 4: New Email to Classify

The new email contains:

✔ "Free"  
✔ "Win"

We compute probabilities for each class.

Note: Each word is a piece of evidence.
If the word isn’t present, it provides no evidence, so we leave it out of the calculation.


In [3]:
# Naive Bayes calculation

spam_probability = (
    P_spam *
    P_free_given_spam *
    P_win_given_spam
)

notspam_probability = (
    P_notspam *
    P_free_given_notspam *
    P_win_given_notspam
)

spam_probability, notspam_probability

(0.22400000000000003, 0.003)

## Step 5: Compare Probabilities


In [4]:
if spam_probability > notspam_probability:
    prediction = "SPAM"
else:
    prediction = "NOT SPAM"

prediction

'SPAM'

## ✅ Result

The email is classified as **SPAM** because the probability of spam is much higher.

Even though spam emails are less common overall, the words **"Free"** and **"Win"** are far more likely to appear in spam.

---

## Why is it called "Naive"?

We assumed the words occur independently.

In reality, they often appear together — but the method still works very well!
