# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [73]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [74]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [75]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

flip_result
tails    507
heads    493
Name: count, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [76]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.493
Probability of Tails: 0.507


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [77]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)

data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.5, 0.5]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)

#Replace labels with ones with some relationship
for index, row in df.iterrows():
    prob = min(1, .7 *row["contains_free"] + .7*row["contains_winner"]+.1)
    df.at[index, 'label'] = np.random.choice(['spam', 'ham'], p=[prob, 1-prob])

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)

In [78]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
df_emails = pd.read_csv('simulated_email_dataset.csv')
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,ham
2,112,0,0,morning,ham
3,130,1,0,afternoon,spam
4,95,0,1,afternoon,spam


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [79]:
# Your code for Data Preprocessing goes here
email_length_min = df_emails['email_length'].min()
email_length_max = df_emails['email_length'].max()
df_emails['email_length'] = (df_emails['email_length'] - email_length_min) / (email_length_max - email_length_min)
cutsOfEmailLength = ['small', 'medium', 'large']
df_emails['len_cuts'] = pd.qcut(df_emails['email_length'], len(cutsOfEmailLength), labels=cutsOfEmailLength)
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label,len_cuts
0,0.521127,0,0,morning,ham,large
1,0.43662,0,0,morning,ham,medium
2,0.542254,0,0,morning,ham,large
3,0.669014,1,0,afternoon,spam,large
4,0.422535,0,1,afternoon,spam,medium


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [80]:
# Your code for calculating the probability of spam and ham emails in the dataset goes here

total_emails = len(df_emails['label'])
total_ham_emails = df_emails['label'].value_counts()['ham']
total_spam_emails = df_emails['label'].value_counts()['spam']
prob_ham = total_ham_emails/total_emails
prob_spam = total_spam_emails/total_emails
print(f'probabilty ham emails {prob_ham}')
print(f'probabilty spam emails {prob_spam}')

probabilty ham emails 0.41
probabilty spam emails 0.59


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [83]:
# Function to calculate probabilities
def calculate_probabilities(feature, df: pd.DataFrame, label, value):
    if feature == 'label':  # Skip the label column
        return
    count_label = len(df[df['label'] == label])
    if count_label == 0:  # Avoid division by zero
        return 0
    return len(df[(df[feature] == value) & (df['label'] == label)]) / count_label


# Function to calculate conditional probabilities
def calc_conditional_probs(df, data, label, features):
    """Calculate all conditional probabilities for a given label (spam or ham) using a dictionary."""
    likelihoods = {}
    for feature in features:
        likelihoods[feature] = calculate_probabilities(feature, df, label, data[feature])
    return likelihoods


# Function to classify an email using Bayes' Theorem
def bayes_classify(data, df: pd.DataFrame):
    data['email_length'] = (data['email_length'] - email_length_min) / (email_length_max - email_length_min)

    # Calculate prior probabilities
    total_emails = len(df['label'])
    prob_spam = len(df[df['label'] == 'spam']) / total_emails
    prob_ham = len(df[df['label'] == 'ham']) / total_emails

    # Assign email_length category based on bins
    _, length_divisions = pd.qcut(df['email_length'], 3, retbins=True)
    print(f'length divisions {length_divisions}')

    
    # Assign the appropriate length category
    print(f'email length {data['email_length']}')
    for i in range(len(length_divisions) - 1):

        if length_divisions[i] <= data['email_length'] < length_divisions[i + 1] or (i == len(length_divisions) - 2 and data['email_length'] == length_divisions[i + 1]):
            data['len_cuts'] = cutsOfEmailLength[i]
            break


    # Calculate likelihoods for spam and ham
    likelihoods_spam = calc_conditional_probs(df, data, 'spam', ['len_cuts', 'contains_free', 'contains_winner', 'time_of_day'])
    likelihoods_ham = calc_conditional_probs(df, data, 'ham', ['len_cuts', 'contains_free', 'contains_winner', 'time_of_day'])

    # Extract probabilities
    P_Len_given_spam = likelihoods_spam['len_cuts']
    P_free_given_spam = likelihoods_spam['contains_free']
    P_winner_given_spam = likelihoods_spam['contains_winner']
    P_tod_given_spam = likelihoods_spam['time_of_day']

    P_Len_given_ham = likelihoods_ham['len_cuts']
    P_free_given_ham = likelihoods_ham['contains_free']
    P_winner_given_ham = likelihoods_ham['contains_winner']
    P_tod_given_ham = likelihoods_ham['time_of_day']

    # Calculate P(spam | data) using Bayes' Theorem
    numerator_spam = P_Len_given_spam * P_free_given_spam * P_winner_given_spam * P_tod_given_spam * prob_spam
    numerator_ham = P_Len_given_ham * P_free_given_ham * P_winner_given_ham * P_tod_given_ham * prob_ham

    P_spam_given_data = numerator_spam / (numerator_spam + numerator_ham)

    # Return the probability of the email being spam
    return P_spam_given_data


### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [84]:
# Your code goes here
test_df = pd.DataFrame()
predictions = []
for _, row in df_emails.iterrows():
    row_dict = row.to_dict()
    predicted_class = bayes_classify(row_dict, df=df_emails)
    
    predictions.append(predicted_class)

predictions

length divisions [0.         0.38732394 0.51408451 1.        ]
email length -0.24280896647490574
Data after assigning len_cuts: {'email_length': -0.24280896647490574, 'contains_free': 0, 'contains_winner': 0, 'time_of_day': 'morning', 'label': 'ham', 'len_cuts': 'large'}
length divisions [0.         0.38732394 0.51408451 1.        ]
email length -0.24340408649077566
Data after assigning len_cuts: {'email_length': -0.24340408649077566, 'contains_free': 0, 'contains_winner': 0, 'time_of_day': 'morning', 'label': 'ham', 'len_cuts': 'medium'}
length divisions [0.         0.38732394 0.51408451 1.        ]
email length -0.24266018647093832
Data after assigning len_cuts: {'email_length': -0.24266018647093832, 'contains_free': 0, 'contains_winner': 0, 'time_of_day': 'morning', 'label': 'ham', 'len_cuts': 'large'}
length divisions [0.         0.38732394 0.51408451 1.        ]
email length -0.24176750644713352
Data after assigning len_cuts: {'email_length': -0.24176750644713352, 'contains_free':

[0.2587605180978532,
 0.22836345053971946,
 0.2587605180978532,
 0.7715833971047873,
 0.807235884282403,
 0.9787465657569145,
 0.8404068990263758,
 0.24668404856416695,
 0.973436523293239,
 0.7469370469466098,
 0.6722947542092659,
 0.974197643187433,
 0.22314565232199796,
 0.7013036126226159,
 0.7854136670309722,
 0.973436523293239,
 0.19061268008564278,
 0.9781166161452991,
 0.7744380456419202,
 0.18540677432428695,
 0.2224739139197443,
 0.7144688323263281,
 0.22314565232199796,
 0.19061268008564278,
 0.7744380456419202,
 0.7853663553460483,
 0.973436523293239,
 0.1952168640392154,
 0.968697723547009,
 0.812486936020514,
 0.7744380456419202,
 0.7775380028663664,
 0.812486936020514,
 0.16588622382703366,
 0.25307597552275124,
 0.7013036126226159,
 0.812486936020514,
 0.9725403193735386,
 0.7013036126226159,
 0.1952168640392154,
 0.2587605180978532,
 0.22836345053971946,
 0.1952168640392154,
 0.22314565232199796,
 0.749999913547945,
 0.18540677432428695,
 0.7145260913282749,
 0.74693704

### Task 6: Discussion
1. Discuss how Bayesian updating improves the accuracy of the classifier.
2. What are the limitations of the model built in this lab?


### Answers
1. Bayesian updating allows us to continuously improve our model as new data becomes available. By updating the prior probabilities with new evidence, we can refine our predictions and make them more accurate. This iterative process helps in incorporating new information and adjusting the model to better reflect the underlying patterns in the data. In the context of our email classifier, Bayesian updating helps in dynamically adjusting the probabilities of an email being spam or ham based on the features observed in the new emails.

2.  - Simplistic Assumptions: The model assumes that the features are independent of each other, which might not be true in real-world scenarios. This is known as the "naive" assumption in Naive Bayes classifiers.
    - Feature Engineering: The model's performance heavily depends on the quality and relevance of the features used. If important features are missing or irrelevant features are included, the model's accuracy can be affected.
    - Static Model: The model does not adapt to changes in the data distribution over time unless explicitly updated with new data.

## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.