# Lab 2 - Probability in Machine Learning

Welcome to the Probability in Machine Learning Lab! In this lab, we will explore how probability theory plays a crucial role in machine learning. We will start with a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment

First, let's import the necessary libraries.


In [1]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example

### Objective:
To understand basic probability and Python coding through a coin flip example.

### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [2]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [3]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

flip_result
heads    519
tails    481
Name: count, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [4]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.519
Probability of Tails: 0.481


## Part 2: Bayesian Email Classifier

### Objective:
Now, you will build a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset
First, load and explore the dataset. You can either find and use a dataset or use the following code to simulate a sample dataset.

In [5]:
# The following code snippet creates a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)
data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)


In [6]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
df_emails = pd.read_csv('simulated_email_dataset.csv')
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,spam
2,112,0,0,morning,spam
3,130,1,0,afternoon,ham
4,95,0,1,afternoon,spam


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [7]:
df_emails = pd.read_csv('simulated_email_dataset.csv')
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,spam
2,112,0,0,morning,spam
3,130,1,0,afternoon,ham
4,95,0,1,afternoon,spam


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset.

In [9]:
# Your code for calculating the probability of spam and ham emails in the dataset goes here

total_counts=df_emails['label'].value_counts()
total_ham=total_counts['ham']
total_spam=total_counts['spam']

p_ham=total_ham/len(df_emails)
p_spam=total_spam/len(df_emails)

print(total_ham, total_spam)
print("Ham probability" , p_ham)
print("spam probability" , p_spam)



"""


"""


#probability of winner -> ham/spam
win_df=df[df_emails["contains_winner"]==1]
win_counts=win_df['label'].value_counts()
p_win_ham=win_counts['ham']/len(win_df)
p_win_spam=win_counts['spam']/len(win_df)

p_win=(len(win_df)/len(df_emails))
#prob of !winner
nwin_df=df[df_emails["contains_winner"]==0]
nwin_counts=nwin_df['label'].value_counts()
p_nwin_ham=nwin_counts['ham']/len(nwin_df)
p_nwin_spam=nwin_counts['spam']/len(nwin_df)
p_nwin=(len(nwin_df)/len(df_emails))


#p of ham/spam given sm

#df_emails["label"].value_counts()

#prob of morning Spam/ham
#prob of evening Spam/ham
#prob of afternoon Spam/ham


m_df=df_emails[df_emails["time_of_day"]=="morning"]
a_df=df_emails[df_emails["time_of_day"]=="afternoon"]
e_df=df_emails[df_emails["time_of_day"]=="evening"]
n_df=df_emails[df_emails["time_of_day"]=="night"]

m_counts=m_df['label'].value_counts()
a_counts=a_df['label'].value_counts()
e_counts=e_df['label'].value_counts()
n_counts=n_df['label'].value_counts()

p_m=(len(m_df)/len(df_emails))
p_a=(len(a_df)/len(df_emails))
p_e=(len(e_df)/len(df_emails))
p_n=(len(n_df)/len(df_emails))

#win_counts

#have probabilities of ham/spam given any single condition p(b given spam)
#have probabilities of any given condition: len(df)/1000
#have probability of spam/ham given a
#p(a|b)=p(a)*p(b|a)/p(b)



591 409
Ham probability 0.591
spam probability 0.409


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [39]:
# Write a function using Bayes' Theorem for classification
#p(a) prob of spam or ham
#p(b) prob of the event
#prob of event given 

#calc p(a|b) 
#p (spam|free) given p(spam) p(free) p(free|ham) 
#replace "model" with the 

def bayes(condition, given, pba):
    return (pba*condition)/given

    
#calculate probabilities of spam/ham given constants
def model(df):
    #p(spam|free)
    free_df=df[df["contains_free"]==1]
    free_counts=free_df['label'].value_counts()
    p_free=(len(free_df)/len(df))
    p_free_s=(free_counts['spam']/total_spam) #p(free|spam)
    p_free_h=(free_counts['ham']/total_ham)
    p_s_free=bayes(p_spam, p_free, p_free_s)
    p_h_free=bayes(p_ham, p_free, p_free_h)
    
    #p(spam|win)
    win_df=df[df["contains_winner"]==1]
    win_counts=win_df['label'].value_counts()
    p_win=(len(win_df)/len(df))
    p_win_s=(win_counts['spam']/total_spam) #p(winner|spam)
    p_win_h=(win_counts['ham']/total_ham)

    p_s_win=bayes(p_spam, p_win, p_win_s)
    p_h_win=bayes(p_ham, p_win, p_win_h)
    
    #p spam|morning
    m_df=df[df["time_of_day"]=='morning']
    m_counts=m_df['label'].value_counts()
    p_m=(len(m_df)/len(df))
    p_m_s=(m_counts['spam']/total_spam) #p(morning|spam)
    p_m_h=(m_counts['ham']/total_ham)
    
    p_s_m=bayes(p_spam, p_m, p_m_s)
    p_h_m=bayes(p_ham, p_m, p_m_h)
    
    #p spam|afternoon
    a_df=df[df["time_of_day"]=='afternoon']
    a_counts=a_df['label'].value_counts()
    p_a=(len(a_df)/len(df))
    p_a_s=(a_counts['spam']/total_spam) #p(afternoon|spam)
    p_a_h=(a_counts['ham']/total_ham)
    
    p_s_a=bayes(p_spam, p_a, p_a_s) 
    p_h_a=bayes(p_ham, p_a, p_a_h)
    
    #p spam|evening
    e_df=df[df["time_of_day"]=='evening']
    e_counts=e_df['label'].value_counts()
    p_e=(len(e_df)/len(df))
    p_e_s=(e_counts['spam']/total_spam) #p(evening|spam)
    p_e_h=(e_counts['ham']/total_ham)
    
    p_s_e=bayes(p_spam, p_e, p_e_s)
    p_h_e=bayes(p_ham, p_e, p_e_h)
    
    #p spam|night
    n_df=df[df["time_of_day"]=='night']
    n_counts=n_df['label'].value_counts()
    p_n=(len(n_df)/len(df))
    p_n_s=(n_counts['spam']/total_spam) #p(night|spam)
    p_n_h=(m_counts['ham']/total_ham)
    
    
    p_s_n=bayes(p_spam, p_n, p_n_s)
    p_h_n=bayes(p_ham, p_n, p_n_h)
    
    n_df=df[df["time_of_day"]=='night']
    n_counts=n_df['label'].value_counts()
    p_n=(len(n_df)/len(df))
    p_n_s=(n_counts['spam']/total_spam) #p(night|spam)
    p_n_h=(m_counts['ham']/total_ham)
    
    #calculate size intervals
    max_s=df["email_length"].max()
    min_s=df["email_length"].min()
    total_range=max_s-min_s
    increment=int(total_range/3)
    s_lim=min_s+increment
    m_lim=s_lim+increment
    
    #calculate size intervals

    #probability of small Spam/ham 
    sm_df=df_emails[df["email_length"]<s_lim]
    sm_counts=sm_df['label'].value_counts()
    p_sm=(len(sm_df)/len(df))
    p_sm_s=(sm_counts['spam']/total_spam)
    p_sm_h=(sm_counts['ham']/total_spam)
    
    p_s_sm=bayes (p_spam, p_sm, p_sm_s)
    p_h_sm=bayes (p_ham, p_sm, p_sm_h)
    #prob of medium Spam/ham

    med_df=df[df["email_length"]<m_lim]
    med_df=med_df[med_df["email_length"]>s_lim]
    med_counts=med_df['label'].value_counts()
    
    p_med=(len(med_df)/len(df))
    #print (med_counts)
    p_med_h=med_counts['ham']/total_ham
    p_med_s=med_counts['spam']/total_spam
    p_s_med=bayes(p_spam, p_med, p_med_s)
    p_h_med=bayes(p_ham, p_med, p_med_h)
    #prob of large Spam/ham
    
    l_df=df[df["email_length"]>m_lim]
    l_counts=l_df['label'].value_counts()
    p_l_h=l_counts['ham']/total_ham
    p_l_s=l_counts['spam']/total_spam
    p_l=(len(l_df)/len(df))
    p_s_l=bayes(p_spam, p_l, p_l_s)
    p_h_l=bayes(p_ham, p_l, p_l_h)
    
    """
    print ("free and win")
    print(p_s_free, p_h_free)
    print(p_s_win, p_h_win)
    
    print ("time")
    print(p_s_m, p_h_m)
    print (p_s_a, p_h_a)
    print (p_s_e, p_h_e)
    print (p_s_n, p_h_n)
    print ("size")
    print (p_s_sm, p_h_sm)
    print (p_s_med, p_h_med)
    print (p_s_l, p_h_l)
    """
    return p_s_free, p_h_free, p_s_win, p_h_win, p_s_m, p_h_m, p_s_a, p_h_a, p_s_e, p_h_e, p_s_n, p_h_n, p_s_sm, p_h_sm, p_s_med, p_h_med, p_s_l, p_h_l, s_lim, m_lim

p_s_free, p_h_free, p_s_win, p_h_win, p_s_m, p_h_m, p_s_a, p_h_a, p_s_e, p_h_e, p_s_n, p_h_n, p_s_sm, p_h_sm, p_s_med, p_h_med, p_s_l, p_h_l, s_lim, m_lim=model(df_emails)

#
#m_lim

### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [50]:
# Your code goes here

np.random.seed(1234567)
data_2 = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

newdf = pd.DataFrame(data_2)
print (newdf.head())
p_s_free, p_h_free, p_s_win, p_h_win, p_s_m, p_h_m, p_s_a, p_h_a, p_s_e, p_h_e, p_s_n, p_h_n, p_s_sm, p_h_sm, p_s_med, p_h_med, p_s_l, p_h_l, s_lim, m_lim=model(newdf)
print ('testing model')

#calculate whats needed for bayes theorem, which is p(a|b)->p(a)*p(b|a)/p(b)
#p(spam|free)->p(spam)*p(free|spam)/p(free)
#need to fix calculations for p_free_spam and so on
#calculate p(x|spam/ham) and replace p_free_spam/ham
#then edit it into model

def classify(email, s_lim, m_lim):   
    
    if email["email_length"]<s_lim:
        #print("small")
        ps_size=p_s_sm 
        ph_size=p_h_sm
    elif email["email_length"]>s_lim and email["email_length"]<m_lim:
        #print ("medium")
        ps_size=p_s_med
        ph_size=p_h_med
    else:
        #print ("large")
        ps_size=p_s_l
        ph_size=p_h_l
    
    if email["contains_free"]==1:
        ps_free=p_s_free
        ph_free=p_h_free
    elif email["contains_free"]==0:
        ps_free=1-p_s_free
        ph_free=1-p_h_free
    else:
        print ("check contains_free")
        ps_free, ph_free=1
    
    
    if email["contains_winner"]==1:
        ps_win=p_win_spam
        ph_win=p_win_ham
    elif email["contains_winner"]==0:
        ps_win=1-p_win_spam
        ph_win=1-p_win_ham
    else:
        print ("check contains_win")
        ps_win, ph_win=1
    
    if email["time_of_day"]=="morning":
        ps_t=p_s_m
        ph_t=p_h_m
    elif email["time_of_day"]=="afternoon":
        ps_t=p_s_a
        ph_t=p_h_a
    elif email["time_of_day"]=="evening":
        ps_t=p_s_e
        ph_t=p_h_e
    elif email["time_of_day"]=="night":
        ps_t=p_s_n
        ph_t=p_h_n
    else:
        print ("something is wrong with time")
        ps_t, ph_t=1

    final_s=ps_size*ps_free*ps_win*ps_t
    final_h=ph_size*ph_free*ph_win*ph_t
    print (final_s, final_h)
    if (final_s>final_h):

        return 'spam'
    elif (final_s<final_h):
        return 'ham'
    else:
        print ("Something is weird, could be spam or ham.")
        
    return "yolo"
#df_emails.head()    

for i in range (20):
    print (classify(newdf.loc[i], s_lim, m_lim))

#for small 13, 0 med, 3 large

   email_length  contains_free  contains_winner time_of_day label
0            89              0                0     evening   ham
1            91              0                0       night   ham
2           134              1                0     morning   ham
3           112              0                0       night  spam
4           100              0                1     morning   ham
testing model
0.06027148274786828 0.05766015240283422
spam
0.06470280339882242 0.06818098616018291
ham
0.032500770056851654 0.09666016895427644
ham
0.06470280339882242 0.06818098616018291
ham
0.03850287719277937 0.08784667788770689
ham
0.06027148274786828 0.05766015240283422
spam
0.045930840328277954 0.06839696165770438
ham
0.05507373573144391 0.061414934098485356
ham
0.0510985743190279 0.06428653659499667
ham
0.04264824217831615 0.08148666166020374
ham
0.05507373573144391 0.061414934098485356
ham
0.06512156025405262 0.09808404031453533
ham
0.03850287719277937 0.08784667788770689
ham
0.02724472596

### Task 6: Discussion
1. Which probability distribution would you choose for an email classifier? Explain your answer.
I would say a bernoulli distribution - either the email is spam or ham, nothing in between.

2. Discuss how Bayesian updating improves the accuracy of the classifier.
Bayesian updating should be able to account for different combinations of attributes; the more it updates the more accurate you should be. 

3. What are the limitations of the model built in this lab?
-Because you are multiply probabilities, the fact that every feature is more likely to be ham than spam means that the classifier will be heavily biased towards ham. Also, having to manually write out code for each feature makes it impractical to weigh many different words, or combinations of words, or account for word frequency...
it also requires that we have labels and pre-determined knowledge to set up.



## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.