##### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2025 Semester 1

## Assignment 1: Scam detection with naive Bayes


**Student ID(s):**     `1462474`


This iPython notebook is a template which you will use for your Assignment 1 submission.

**NOTE: YOU SHOULD ADD YOUR RESULTS, GRAPHS, AND FIGURES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).** Results, figures, etc. which appear in this file but are NOT included in your report will not be marked.

**Adding proper comments to your code is MANDATORY. **

## 1. Supervised model training


In [28]:
## Import necessary libraries
import numpy as np
import pandas as pd

#### Read in supervised training dataset

In [29]:
sms_df = pd.read_csv('sms_supervised_train.csv')

#### Reformat text to help with tokenising

In [30]:
# Ensure all values in the column are strings
sms_df['textPreprocessed'] = sms_df['textPreprocessed'].astype(str)

# Ensure no null values affect tokenising
# Find num rows
n_rows = sms_df.shape[0]
print("Number of entries before dropping null values: ", n_rows)
sms_df = sms_df.dropna(subset=['textPreprocessed'])
print("Number of entries before after dropping null values: ", n_rows)


Number of entries before dropping null values:  2000
Number of entries before after dropping null values:  2000


Since we have 2000 values before and after, it is safe to conclude we have no null entries

In [31]:
# Ensure class data is of the same data type
sms_df['class'] = sms_df['class'].astype(int)

#### Build vocabulary list

In [32]:
# Define vocab list (set for build efficiency)
vocab_set = set()

# Add each word to the vocabulary set
for text in sms_df['textPreprocessed']:
    # Split text into words
    words = text.split()
    vocab_set.update(words)

# Convert our set into a list as required
vocab_list = list(vocab_set)
# Free the set to save memory
del vocab_set

#### Build count matrix

In [33]:
# Initialise empty count matrix
count_matrix = np.zeros((sms_df.shape[0], len(vocab_list)))


# Make a vocab dictionary for faster lookups
vocab_dict = {word: i for i, word in enumerate(vocab_list)}


# Fill count matrix by looking at every word in each row
# and counting how many times it appears
for index,text in sms_df['textPreprocessed'].items():
    for word in text.split():
        if word in vocab_dict:
            word_index = vocab_dict.get(word)
            count_matrix[index][word_index] += 1



#### Compute the prior probability of each class:


In [39]:
# The dataset has two classes, so we have two priors

# Class 0 (non-malicious)
n_c0 = sms_df[sms_df['class'] == 0].shape[0]

# Calculate our prior
p_c0 = n_c0/n_rows


# Class 1 (malicious)
n_c1 = sms_df[sms_df['class'] == 1].shape[0]

# Calculate our prior
p_c1 = n_c1/n_rows


print(f"Our two priors are p_c0 = {p_c0}\nand p_c1 = {p_c1} ")


Our two priors are p_c0 = 0.8
and p_c1 = 0.2 


##### Find the probability of each word appearing in a message from each class

In [None]:
# We will use laplace smoothing to ensure every event has a non-zero probability
# Since we have sparse data (more on report)



## 2. Supervised model evaluation

## 3. Extending the model with semi-supervised training

## 4. Supervised model evaluation