# Notebook E-tivity 3 CE4021 Task 2

<hr style=\"border:2px solid gray\"> </hr>

## Imports

In [None]:
#None

If you believe required imports are missing, please contact your moderator.

<hr style=\"border:2px solid gray\"> </hr>

## Task 2

Use the below information to create a Naive Bayes SPAM filter. Test your filter using the messages in new_emails. You may add as many cells as you require to complete the task.

In [7]:
previous_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
previous_ham = ['Your activity report','benefits physical activity', 'the importance vows']
new_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

In [22]:
def calculate_prior_probabilities(ham_emails, spam_emails):
    """
    Calculate prior probabilities P(ham) and P(spam).
    
    :param ham_emails: List of ham emails.
    :param spam_emails: List of spam emails.
    :return: Tuple of prior probabilities for ham and spam.
    """
    total_emails = len(ham_emails) + len(spam_emails)
    prior_ham = len(ham_emails) / total_emails
    prior_spam = len(spam_emails) / total_emails
    return prior_ham, prior_spam

def calculate_conditional_probabilities(ham_emails, spam_emails):
    """
    Calculate conditional probabilities P(word|ham) and P(word|spam).
    
    :param ham_emails: List of ham emails.
    :param spam_emails: List of spam emails.
    :return: Tuple of dictionaries with conditional probabilities for each word in ham and spam.
    """
    word_counts_ham = {}
    word_counts_spam = {}
    
    for email in ham_emails:
        words = email.split()
        for word in words:
            word_counts_ham[word] = word_counts_ham.get(word, 0) + 1

    for email in spam_emails:
        words = email.split()
        for word in words:
            word_counts_spam[word] = word_counts_spam.get(word, 0) + 1
    
    # Adding Laplace smoothing and calculating conditional probabilities
    vocabulary = set(word_counts_ham.keys()) | set(word_counts_spam.keys())
    total_words_ham = sum(word_counts_ham.values())
    total_words_spam = sum(word_counts_spam.values())

    conditional_probabilities_ham = {word: (word_counts_ham.get(word, 0) + 1) / (total_words_ham + len(vocabulary)) for word in vocabulary}
    conditional_probabilities_spam = {word: (word_counts_spam.get(word, 0) + 1) / (total_words_spam + len(vocabulary)) for word in vocabulary}
    
    return conditional_probabilities_ham, conditional_probabilities_spam

def classify_email(email, prior_ham, prior_spam, conditional_probabilities_ham, conditional_probabilities_spam):
    """
    Classify an email as ham or spam using Bayes' Rule.
    
    :param email: The email text to be classified.
    :param prior_ham: Prior probability for ham.
    :param prior_spam: Prior probability for spam.
    :param conditional_probabilities_ham: Dictionary with conditional probabilities for each word in ham.
    :param conditional_probabilities_spam: Dictionary with conditional probabilities for each word in spam.
    :return: Classification result ('ham' or 'spam').
    """
    words = email.split()
    
    # Initialize probabilities
    prob_ham = prior_ham
    prob_spam = prior_spam
    
    # Calculate conditional probabilities for each word and update the probabilities
    for word in words:
        prob_ham *= conditional_probabilities_ham.get(word, 1)
        prob_spam *= conditional_probabilities_spam.get(word, 1)
    
    # Compare probabilities and classify
    if prob_spam > prob_ham:
        return 'spam'
    else:
        return 'ham'

# Build the model
prior_ham, prior_spam = calculate_prior_probabilities(previous_ham, previous_spam)
conditional_probabilities_ham, conditional_probabilities_spam = calculate_conditional_probabilities(previous_ham, previous_spam)

# Classify new emails
for label, emails in new_emails.items():
    for email in emails:
        classification = classify_email(email, prior_ham, prior_spam, conditional_probabilities_ham, conditional_probabilities_spam)
        print(f"Email: '{email}' is classified as '{classification}'")


Email: 'renew your password' is classified as 'spam'
Email: 'renew your vows' is classified as 'spam'
Email: 'benefits of our account' is classified as 'spam'
Email: 'the importance of physical activity' is classified as 'ham'


<hr style=\"border:2px solid gray\"> </hr>

## The Naive Bayes classifier is based on Bayes' theorem:



$$P(A∣B) = \frac{(P|A) x P(A)}{P(B)}$$

<strong>email classification:</strong>
- P(A∣B) is the probability that an email is spam (or ham).
- P(B∣A) is the probability of observing the email content given that it's spam (or ham).
- P(A) is the prior probability that an email is spam (or ham), i.e., without any knowledge about its content.
- P(B) is the probability of observing the email content.

<strong>To calculate prior probabilities:</strong>

$$ P(\text{ham}) = \frac{\text{Number of ham emails}}{\text{Total number of emails}} $$
$$ P(\text{spam}) = \frac{\text{Number of spam emails}}{\text{Total number of emails}} $$

<strong>For conditional probabilities for each word \( w \):</strong>

$$ P(w|\text{ham}) = \frac{\text{Count of } w \text{ in ham emails} + 1}{\text{Total words in ham emails} + \text{Size of vocabulary}} $$
$$ P(w|\text{spam}) = \frac{\text{Count of } w \text{ in spam emails} + 1}{\text{Total words in spam emails} + \text{Size of vocabulary}} $$

<strong>For email classification:</strong>

$$ P(\text{ham}|w_1, w_2, ..., w_n) \propto P(\text{ham}) \times \prod_{i=1}^{n} P(w_i|\text{ham}) $$
$$ P(\text{spam}|w_1, w_2, ..., w_n) \propto P(\text{spam}) \times \prod_{i=1}^{n} P(w_i|\text{spam}) $$


The conditional probabilities (or likelihoods) for a word $$ ( w ) $$ are:

Given:
- Total words in ham emails:  $ ( N_{ham} ) $
- Total words in spam emails: $ ( N_{spam} ) $
- Count of word $ ( w ) $ in ham emails: $ ( C_{w, ham} ) $
- Count of word $ ( w ) $ in spam emails: $ ( C_{w, spam} ) $
- Vocabulary size (unique words across all emails): $ ( V ) $

Using Laplace Smoothing:

1. For Ham:
$ [ P(w|ham) = \frac{C_{w, ham} + 1}{N_{ham} + V} ] $

2. For Spam:
$ [ P(w|spam) = \frac{C_{w, spam} + 1}{N_{spam} + V} ] $

<hr style=\"border:2px solid gray\"> </hr>

## Reflection

- The code employs Bayes' theorem to build a simple Naive Bayes classifier for email spam detection.
- The use of Laplace Smoothing in calculating conditional probabilities is a good approach to handle words that might not have appeared in the training data but appear in new emails.
- The structure of the code is modular, with each function performing a specific task, making it readable and maintainable.
- A potential improvement could be to handle word variations (e.g., stemming or lemmatization) and consider removing common stopwords.
- The code assumes words are independent of each other (naive assumption), which might not always be the case in real emails.
- For practical applications, the dataset's size, diversity, and preprocessing steps will play a significant role in the model's effectiveness.

## Reference

- 3Blue1Brown. “Bayes Theorem.” YouTube, 22 Dec. 2019, www.youtube.com/watch?v=HZGCoVF3YvM.
- “Mastering Probability and Statistics in Python - Part 1.” Www.youtube.com, www.youtube.com/watch?v=KEhWSpdMCVk&list=PLVgEzPHodXi1wT9OK8B_W6Hs8Xc-gaG6N&index=2. Accessed 17 Oct. 2023.

- “Probability.” Mathsisfun.com, 2017, www.mathsisfun.com/data/probability.html.
- Rusland, Nurul Fitriah, et al. “Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets.” IOP Conference Series: Materials Science and Engineering, vol. 226, Aug. 2017, p. 012091, https://doi.org/10.1088/1757-899x/226/1/012091.