# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [2]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [3]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [4]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

flip_result
tails    510
heads    490
Name: count, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [5]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.49
Probability of Tails: 0.51


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [6]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)

data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.5, 0.5]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)

#Replace labels with ones with some relationship
for index, row in df.iterrows():
    prob = min(1, .7 *row["contains_free"] + .7*row["contains_winner"]+.1)
    df.at[index, 'label'] = np.random.choice(['spam', 'ham'], p=[prob, 1-prob])

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)

In [7]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
df_emails = pd.read_csv('simulated_email_dataset.csv')
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,ham
2,112,0,0,morning,ham
3,130,1,0,afternoon,spam
4,95,0,1,afternoon,spam


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [8]:
# Your code for Data Preprocessing goes here
email_length_min = df_emails['email_length'].min()
email_length_max = df_emails['email_length'].max()
df_emails['email_length'] = (df_emails['email_length'] - email_length_min) / (email_length_max - email_length_min)
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,0.521127,0,0,morning,ham
1,0.43662,0,0,morning,ham
2,0.542254,0,0,morning,ham
3,0.669014,1,0,afternoon,spam
4,0.422535,0,1,afternoon,spam


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [9]:
# Your code for calculating the probability of spam and ham emails in the dataset goes here

total_emails = len(df_emails['label'])
total_ham_emails = df_emails['label'].value_counts()['ham']
total_spam_emails = df_emails['label'].value_counts()['spam']
prob_ham = total_ham_emails/total_emails
prob_spam = total_spam_emails/total_emails
print(f'probabilty ham emails {prob_ham}')
print(f'probabilty spam emails {prob_spam}')


probabilty ham emails 0.41
probabilty spam emails 0.59


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [22]:
# Write a function using Bayes' Theorem for classification

def bayes_classify(email_features, df:pd.DataFrame, feature_columns=None):
    """
    Classify an email as spam or ham using Bayes' Theorem.

    Parameters:
    - email_features (dict): A dictionary of feature values for the email to classify.
                             Example: {'email_length': 0.5, 'contains_free': 1, 'contains_winner': 0}
    - df (pd.DataFrame): The dataset containing labeled training data.
    - 'label' (str): The name of the column containing class labels ('spam', 'ham').
    - feature_columns (list): A list of feature column names to use for classification. Defaults to all columns except 'label'.

    Returns:
    - str: The predicted class ('spam' or 'ham').
    """
    if feature_columns is None:
        feature_columns = [col for col in df.columns if col != 'label']

    # Calculate prior probabilities
    total_emails = len(df['label'])
    total_ham_emails = df['label'].value_counts()['ham']
    total_spam_emails = df['label'].value_counts()['spam']
    prob_ham = total_ham_emails/total_emails
    prob_spam = total_spam_emails/total_emails
    priors = {'ham': prob_ham, 'spam':prob_spam}

    # Calculate likelihoods
    classes = ['spam', 'ham']
    likelihoods = {}
    for i in classes:
        class_subset = df[df['label'] == i]
        likelihoods[i] = {}
        for feature in feature_columns:
            if class_subset[feature].dtype == float or class_subset[feature].dtype == int:  
                likelihoods[i][feature] = class_subset[feature].mean()
            else:  
                likelihoods[i][feature] = class_subset[feature].value_counts(normalize=True).to_dict()

    # Calculate posterior probabilities
    posteriors = {}
    for class_i in classes:
        posterior = priors[class_i]  # Start with the prior probability
        for feature, value in email_features.items():
            if feature not in likelihoods[class_i]:
                continue
            if isinstance(likelihoods[class_i][feature], dict):  # Categorical feature
                posterior *= likelihoods[class_i][feature].get(value, 1e-6)  # Use small value for unseen categories
            else:  # Numerical feature
                posterior *= likelihoods[class_i][feature]  # Multiply by feature mean (for simplicity)
        posteriors[class_i] = posterior

    # Return the class with the highest posterior probability
    return max(posteriors, key=posteriors.get)

### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [23]:
# Your code goes here
test_df = pd.DataFrame()
predictions = []
# for _, row in df_emails.iterrows():
row_dict = row.to_dict()
predicted_class = bayes_classify(row_dict, df_emails)
predicted_class
# predictions.append(predicted_class)

# Add predictions to the test dataframe
# test_df['predicted_label'] = predictions
# test_df

'spam'

### Task 6: Discussion
1. Discuss how Bayesian updating improves the accuracy of the classifier.
2. What are the limitations of the model built in this lab?


### Answers
1. Bayesian updating allows us to continuously improve our model as new data becomes available. By updating the prior probabilities with new evidence, we can refine our predictions and make them more accurate. This iterative process helps in incorporating new information and adjusting the model to better reflect the underlying patterns in the data. In the context of our email classifier, Bayesian updating helps in dynamically adjusting the probabilities of an email being spam or ham based on the features observed in the new emails.

2.  - Simplistic Assumptions: The model assumes that the features are independent of each other, which might not be true in real-world scenarios. This is known as the "naive" assumption in Naive Bayes classifiers.
    - Feature Engineering: The model's performance heavily depends on the quality and relevance of the features used. If important features are missing or irrelevant features are included, the model's accuracy can be affected.
    - Static Model: The model does not adapt to changes in the data distribution over time unless explicitly updated with new data.

## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.