# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [232]:
from sklearn.model_selection import train_test_split
import os, json, random
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [233]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})
df_coin

Unnamed: 0,flip_result
0,heads
1,tails
2,tails
3,tails
4,tails
...,...
995,tails
996,tails
997,tails
998,tails


### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [234]:
flip_counts = df_coin['flip_result'].value_counts()
flip_counts

flip_result
tails    521
heads    479
Name: count, dtype: int64

### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [235]:
p_heads = flip_counts['heads'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")

Probability of Heads: 0.479


In [236]:
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Tails: {p_tails}")

Probability of Tails: 0.521


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [237]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.
def random_data(n_samples = 1000):
    # Simulating data
    np.random.seed(42)
    data = {
        'email_length': np.random.normal(100, 20, n_samples).astype(int),
        'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
        'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.5, 0.5]),
        'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
        'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
    }
    return data

df = pd.DataFrame(random_data(1250))

# Replace labels with ones with some relationship
for index, row in df.iterrows():
    prob = min(1, .7 *row["contains_free"] + .7*row["contains_winner"]+.1)
    df.at[index, 'label'] = np.random.choice(['spam', 'ham'], p=[prob, 1-prob])

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)
df


Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,1,1,morning,spam
1,97,0,0,evening,ham
2,112,0,1,night,ham
3,130,0,1,morning,spam
4,95,0,0,afternoon,ham
...,...,...,...,...,...
1245,75,0,1,evening,spam
1246,100,0,1,morning,spam
1247,84,0,0,evening,spam
1248,104,0,1,morning,spam


In [238]:
df_emails = pd.read_csv('simulated_email_dataset.csv') if os.path.exists('simulated_email_dataset.csv') else df
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,1,1,morning,spam
1,97,0,0,evening,ham
2,112,0,1,night,ham
3,130,0,1,morning,spam
4,95,0,0,afternoon,ham


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [239]:
print("No `NULL` values") if df_emails[df_emails.isnull().any(axis=1)].empty else print(df_emails[df_emails.isnull().any(axis=1)]) 

No `NULL` values


In [240]:
cuts_email_length = ['small', 'medium', 'large']
def get_email_length_series(df):
    return pd.qcut(df['email_length'], len(cuts_email_length), labels=cuts_email_length)
df_emails['len_cuts'] = get_email_length_series(df_emails)
df_emails, df_test = train_test_split(df_emails, test_size=250, train_size=1000, random_state=42)
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label,len_cuts
1194,68,0,1,evening,spam,small
911,104,0,1,afternoon,ham,medium
422,95,0,1,morning,spam,medium
670,124,0,0,evening,ham,large
931,137,1,0,afternoon,spam,large


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [241]:
class Probablity_Calcualtor(): # P(A|B) = P(B|A) * P(A) / P(B)
    def __init__(self, a, condition_a, b, condition_b, dfMain):
        self.dfMain = dfMain # dfMain: The DataFrame containing the data Table
        self.a = a # Event A
        self.condition_a = condition_a # condition representing event A
        self.b = b # Event B
        self.condition_b = condition_b # condition representing event B
        pass
    def length(self, df): # NOTE: Not Important
            return len(df) # df.shape[0]
    def calculate_intersection(self, condition_a, condition_b, dfMain): # P(A ∩ B) = P(A|B) * P(B) # Rows where Both A and B are True
        # Calculate P(A ∩ B): Probability of the intersection of A and B
        probability_a_and_b = self.length(dfMain[condition_a & condition_b]) / self.length(dfMain)
        return probability_a_and_b
    def calculate_conditional_probability(self): # P(A|B) = P(A ∩ B) / P(B)
        # Calculate P(B): Probability of event B
        probability_b = self.length(self.dfMain[self.condition_b]) / self.length(self.dfMain)
        # Calculate P(A ∩ B): Probability of the intersection of A and B
        probability_a_and_b = self.calculate_intersection(self.condition_a, self.condition_b, self.dfMain)
        # Calculate P(A|B) = P(B|A) * P(A) / P(B) = P(A ∩ B) / P(B)
        if probability_b > 0:  # Avoid division by zero
            conditional_probability_a_given_b = probability_a_and_b / probability_b
        else:
            conditional_probability_a_given_b = 0
        return conditional_probability_a_given_b # P(A|B)
    def print_conditional_probability(self):
        result = self.calculate_conditional_probability()
        # Print the result
        print(f"P({self.a} | {self.b}): {result:.4f}")
        return result
def calculateIndivualProbability(column, value2count, df): # P(X) | Basically just the probability of a value of a Individual Column
    return df[column].value_counts()[value2count] / len(df)

In [242]:
def jsonPrint(data): # Print a dictionary in a JSON format in a Nice Format.
    print(json.dumps(data, indent=4))
label_interval_dict = { # Create a dictionary mapping labels to intervals
    interval:str(label)
    for label, interval in zip(cuts_email_length, sorted(pd.qcut(df_emails['email_length'], len(cuts_email_length)).unique()))
}
jsonPrint({str(k):v for k,v in label_interval_dict.items()})

{
    "(34.999, 91.0]": "small",
    "(91.0, 108.0]": "medium",
    "(108.0, 154.0]": "large"
}


In [243]:
def classify_email(df_emails, email_features): # , conditionalProbability=True
    def returnedProbablityFromObj(df_emails, email_features, column, check_label):
        pXspamObj = Probablity_Calcualtor(
            column, df_emails[column] == email_features[column],
            check_label.upper(), df_emails['label'] == check_label,
            df_emails
        )
        return pXspamObj.calculate_conditional_probability()
    # Calculate the length of the email with cut intervals.
    # if email_features['email_length'] < bins[0]:
    #     print(f"{email_features['email_length']} is smaller than the smallest bin and is out of range.")
    # elif email_features['email_length'] > bins[-1]:
    #     print(f"{email_features['email_length']} is larger than the largest bin and is out of range.")
    # else:
    for interval, size in label_interval_dict.items():
        if email_features['email_length'] in interval:
            email_features.update({'len_cuts': size})
            break
    # P(spam)
    P_spam = len(df_emails[df_emails['label'] == 'spam']) / len(df_emails)
    # P(L|spam)
    P_L_given_spam = returnedProbablityFromObj(df_emails, email_features, 'len_cuts', 'spam')
    # P(F|spam)
    P_F_given_spam = returnedProbablityFromObj(df_emails, email_features, 'contains_free', 'spam')
    # P(W|spam)
    P_W_given_spam = returnedProbablityFromObj(df_emails, email_features, 'contains_winner', 'spam')
    # P(TOD|spam)
    P_TOD_given_spam = returnedProbablityFromObj(df_emails, email_features, 'time_of_day', 'spam')
    # P(ham)
    P_ham = len(df_emails[df_emails['label'] == 'ham']) / len(df_emails)
    # P(L|ham)
    P_L_given_ham = returnedProbablityFromObj(df_emails, email_features, 'len_cuts', 'ham')
    # P(F|ham)
    P_F_given_ham = returnedProbablityFromObj(df_emails, email_features, 'contains_free', 'ham')
    # P(W|ham)
    P_W_given_ham = returnedProbablityFromObj(df_emails, email_features, 'contains_winner', 'ham')
    # P(TOD|ham)
    P_TOD_given_ham = returnedProbablityFromObj(df_emails, email_features, 'time_of_day', 'ham')
    # Probability of spam given the features P(spam | L, F, W, TOD)/ P(ham | L, F, W, TOD)
    # == (P(L|spam) * P(F|spam) * P(W|spam) * P(TOD|spam) * P(spam)/P(L) * P(F) * P(W) * P(TOD)) / (P(L|ham) * P(F|ham) * P(W|ham) * P(TOD|ham) * P(ham)/P(L) * P(F) * P(W) * P(TOD))
    # == P(L|spam) * P(F|spam) * P(W|spam) * P(TOD|spam) * P(spam) / P(L|ham) * P(F|ham) * P(W|ham) * P(TOD|ham) * P(ham)
    P_spam_given_features = (
        P_L_given_spam * P_F_given_spam * P_W_given_spam * P_TOD_given_spam * P_spam
    ) / (
        P_L_given_spam * P_F_given_spam * P_W_given_spam * P_TOD_given_spam * P_spam +
        P_L_given_ham * P_F_given_ham * P_W_given_ham * P_TOD_given_ham * P_ham
    )
    return P_spam_given_features

def isSpamBool(p_Spam_Probaility):
    return p_Spam_Probaility > 0.5

### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [244]:
def random_email_features(df_emails): # Randomly select email features
    return {
        'email_length': random.randint(df_emails['email_length'].min(), df_emails['email_length'].max()),
        'contains_free': random.choice([0, 1]), # random.choice(df_emails['contains_free'].unique())
        'contains_winner': random.choice([0, 1]), # random.choice(df_emails['contains_winner'].unique())
        'time_of_day': random.choice(df_emails['time_of_day'].unique())
    }
email_features = random_email_features(df_emails)
jsonPrint(email_features)

{
    "email_length": 35,
    "contains_free": 1,
    "contains_winner": 0,
    "time_of_day": "evening"
}


In [245]:
p_Spam_Probaility = classify_email(df_emails, email_features)
print(f"P(spam|email): {p_Spam_Probaility:.4f}")

P(spam|email): 0.7430


### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [246]:
df_test = pd.DataFrame(random_data(300)).drop(columns=['label'])
df_test['len_cuts'] = get_email_length_series(df_test)
df_test

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,len_cuts
0,109,0,1,evening,large
1,97,0,0,night,medium
2,112,1,1,afternoon,large
3,130,1,0,morning,large
4,95,1,0,evening,medium
...,...,...,...,...,...
295,86,0,1,night,small
296,117,0,0,morning,large
297,106,0,0,night,medium
298,116,1,1,morning,large


In [247]:
df_test['label_calculation'] = df_test.apply(lambda row: int(isSpamBool(classify_email(df_emails, dict(row)))), axis=1)
df_test['label'] = df_test.apply(lambda row: 'spam' if row['label_calculation'] == 1 else 'ham', axis=1)
if 'label_calculation' in df_test.columns:
    df_test.drop(columns=['label_calculation'], inplace=True)
df_test

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,len_cuts,label
0,109,0,1,evening,large,spam
1,97,0,0,night,medium,ham
2,112,1,1,afternoon,large,spam
3,130,1,0,morning,large,spam
4,95,1,0,evening,medium,spam
...,...,...,...,...,...,...
295,86,0,1,night,small,spam
296,117,0,0,morning,large,ham
297,106,0,0,night,medium,ham
298,116,1,1,morning,large,spam


### Task 6: Discussion
1. Discuss how Bayesian updating improves the accuracy of the classifier.
2. What are the limitations of the model built in this lab?


## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.