# Probability in Machine Learning

Exploring how probability theory plays a crucial role in machine learning. Using a simple coin flip example to grasp the basics and then move on to build a Bayesian email classifier. Let's dive in!

## Setting Up the Environment


In [1]:
import pandas as pd
import numpy as np

## Part 1: Coin Flip Probability Example


### Simulating Coin Flips
We will simulate flipping a coin 1000 times.


In [3]:
# Simulating 1000 coin flips, 0 for 'tails' and 1 for 'heads'
coin_flips = np.random.choice(['heads', 'tails'], size=1000)
df_coin = pd.DataFrame({'flip_result': coin_flips})
print(coin_flips)
print(df_coin)

['tails' 'tails' 'heads' 'tails' 'heads' 'tails' 'tails' 'tails' 'heads'
 'heads' 'tails' 'tails' 'heads' 'tails' 'heads' 'heads' 'tails' 'heads'
 'heads' 'tails' 'tails' 'heads' 'tails' 'tails' 'tails' 'heads' 'heads'
 'heads' 'tails' 'heads' 'tails' 'heads' 'tails' 'tails' 'tails' 'heads'
 'heads' 'heads' 'heads' 'tails' 'heads' 'tails' 'heads' 'heads' 'heads'
 'tails' 'tails' 'tails' 'tails' 'heads' 'tails' 'tails' 'heads' 'tails'
 'tails' 'heads' 'heads' 'tails' 'tails' 'tails' 'tails' 'heads' 'tails'
 'heads' 'tails' 'heads' 'tails' 'heads' 'heads' 'tails' 'heads' 'heads'
 'heads' 'heads' 'heads' 'heads' 'heads' 'tails' 'tails' 'heads' 'heads'
 'heads' 'tails' 'heads' 'tails' 'heads' 'heads' 'heads' 'tails' 'tails'
 'tails' 'heads' 'tails' 'heads' 'heads' 'tails' 'heads' 'tails' 'heads'
 'tails' 'tails' 'heads' 'heads' 'tails' 'tails' 'heads' 'heads' 'tails'
 'tails' 'heads' 'tails' 'heads' 'tails' 'tails' 'tails' 'tails' 'heads'
 'tails' 'tails' 'tails' 'heads' 'tails' 'tails' 't

### Analyzing Flip Results
Now, let's count how many heads and tails we got.

In [4]:
flip_counts = df_coin['flip_result'].value_counts()
print(flip_counts)

tails    511
heads    489
Name: flip_result, dtype: int64


### Calculating Probabilities
Next, we will calculate the probability of getting heads or tails.

In [7]:
p_heads = flip_counts['heads'] / len(df_coin)
p_tails = flip_counts['tails'] / len(df_coin)
print(f"Probability of Heads: {p_heads}")
print(f"Probability of Tails: {p_tails}")

Probability of Heads: 0.502
Probability of Tails: 0.498


## Part 2: Bayesian Email Classifier

### Objective:
Building a Bayesian email classifier to differentiate between 'spam' and 'ham' (not spam) emails.

### Task 1: Exploring the Dataset

In [4]:
# Create a simulated email classification (spam and not spam) dataset with 1000 data points.

import pandas as pd
import numpy as np

# Sample size
n_samples = 1000

# Simulating data
np.random.seed(42)
data = {
    'email_length': np.random.normal(100, 20, n_samples).astype(int),
    'contains_free': np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3]),
    'contains_winner': np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2]),
    'time_of_day': np.random.choice(['morning', 'afternoon', 'evening', 'night'], n_samples),
    'label': np.random.choice(['spam', 'ham'], n_samples, p=[0.4, 0.6])
}

df = pd.DataFrame(data)
print(df)

# Saving the dataset
df.to_csv('simulated_email_dataset.csv', index=False)


     email_length  contains_free  contains_winner time_of_day label
0             109              0                0     morning   ham
1              97              0                0     morning  spam
2             112              0                0     morning  spam
3             130              1                0   afternoon   ham
4              95              0                1   afternoon  spam
..            ...            ...              ...         ...   ...
995            94              0                1       night   ham
996           135              0                0       night  spam
997           112              0                0     evening  spam
998            88              0                1   afternoon  spam
999           111              0                0     evening  spam

[1000 rows x 5 columns]


In [63]:
# Load the dataset (Replace 'path_to_dataset' with the actual file path). You can uncomment the codes below. Notice what `df_emails.head()` is representing.
df_emails = pd.read_csv('simulated_email_dataset.csv')

# This command prints the first 5 rows of the dataset
df_emails.head()

Unnamed: 0,email_length,contains_free,contains_winner,time_of_day,label
0,109,0,0,morning,ham
1,97,0,0,morning,spam
2,112,0,0,morning,spam
3,130,1,0,afternoon,ham
4,95,0,1,afternoon,spam


### Task 2: Data Preprocessing
You need to preprocess the data for analysis. This involves normalizing and encoding the features.

In [64]:
# Checking for missing values in the df_emails dataframe
print("Check for missing values:")
missing_values = np.where(pd.isnull(df_emails))
print(missing_values)

Check for missing values:
(array([], dtype=int64), array([], dtype=int64))


The command returns an empty array which means there are no empty values

In [65]:
# Checking for duplicated rows
print("Checking for duplicated rows:")
duplicated_rows = df_emails.duplicated().any()
print(duplicated_rows)

Checking for duplicated rows:
True


In [66]:
# Droping duplicated rows
df_no_duplicates = df_emails.drop_duplicates()

print(df_no_duplicates.duplicated().any())

False


df_no_duplicates does not have any duplicated values anymore as the above command returns false

In [67]:
# Checking the data types
print("Checking the data types")
print(df_no_duplicates.dtypes)

Checking the data types
email_length        int64
contains_free       int64
contains_winner     int64
time_of_day        object
label              object
dtype: object


In [68]:
# Converting categorical data to numerical data in time_of_day using label encoding
from sklearn.preprocessing import LabelEncoder

df_converted = df_no_duplicates.copy()

# Define a mapping for label encoding
time_of_day_mapping = {'morning': 0, 'afternoon': 1, 'evening': 2, 'night': 3}

# Apply label encoding using map
df_converted.loc[:, 'time_of_day_encoded'] = df_converted['time_of_day'].map(time_of_day_mapping)

# Convert the label into 0s and 1s
label_mapping = {'ham': 0, 'spam': 1}
df_converted.loc[:, 'label_encoded'] = df_converted['label'].map(label_mapping)

# Add transformed data into a csv file
df_converted.to_csv('cleaned_dataset.csv')

print("Rows and Columns left for unique rows")
print(df_converted.shape)

print("Transformed data")
print(df_converted.head())

Rows and Columns left for unique rows
(724, 7)
Transformed data
   email_length  contains_free  contains_winner time_of_day label  \
0           109              0                0     morning   ham   
1            97              0                0     morning  spam   
2           112              0                0     morning  spam   
3           130              1                0   afternoon   ham   
4            95              0                1   afternoon  spam   

   time_of_day_encoded  label_encoded  
0                    0              0  
1                    0              1  
2                    0              1  
3                    1              0  
4                    1              1  


### Task 3: Probability Calculation
Calculate the probability of spam and ham emails in the dataset. We will be using the cleaned the dataset

In [69]:
df_cleaned = pd.read_csv('cleaned_dataset.csv')

email_counts = df_cleaned['label'].value_counts()
print("Counts")
print(email_counts)

# prior knowledge
p_spam = email_counts['spam'] / len(df_cleaned)
p_ham = email_counts['ham'] / len(df_cleaned)
print(f"Probability of ham: {p_ham}")
print(f"Probability of spam: {p_spam}")

Counts
ham     411
spam    313
Name: label, dtype: int64
Probability of ham: 0.5676795580110497
Probability of spam: 0.43232044198895025


### Task 4: Implementing Bayes' Theorem
Implement Bayes' Theorem to classify emails as spam or ham.

In [70]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

# Feature selection (Use contains_free, contains_winner, time_of_day_encoded)
# The email length was not considered relevant to the model
X = df_cleaned.drop(['label', 'label_encoded', 'time_of_day', 'email_length'], axis=1)

# Use the label encoded in 1s and 0s
y = df_cleaned['label_encoded']


# Spliting the dataset into training (80%) and testing (20%) dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# The classifier
clf = MultinomialNB()

clf.fit(X_train, y_train)

### Task 5: Model Testing
Test the model on a new dataset and evaluate its performance. You can use a subset of the dataset that you created or create a new one.

In [71]:
# Testing the model with the 20% testing data
y_pred = clf.predict(X_test)

# Getting the accuracy of the Naive Bayes classifier
acc = accuracy_score(y_test, y_pred)

print("Accuracy:")
print(f"{accuracy * 100}%")

Accuracy:
49.6551724137931%
