##### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2025 Semester 1

## Assignment 1: Scam detection with naive Bayes


**Student ID(s):**     `1462474`


This iPython notebook is a template which you will use for your Assignment 1 submission.

**NOTE: YOU SHOULD ADD YOUR RESULTS, GRAPHS, AND FIGURES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).** Results, figures, etc. which appear in this file but are NOT included in your report will not be marked.

**Adding proper comments to your code is MANDATORY. **

## 1. Supervised model training


In [1]:
## Import necessary libraries
import numpy as np
import math
import pandas as pd

#### Read in supervised training dataset

In [2]:
sms_df = pd.read_csv('sms_supervised_train.csv')

#### Reformat text to help with tokenising

In [3]:
# Ensures data types are as intended and no nulls
def preprocess_data(df):
    df['textPreprocessed'] = df['textPreprocessed'].astype(str)
    df = df.dropna(subset=['textPreprocessed'])
    df['class'] = df['class'].astype(int)
    return df

sms_df = preprocess_data(sms_df)

#### Build vocabulary list

In [4]:
# Define vocab list (set for build efficiency)
vocab_set = set()

# For each row in the dataset, split text into words and add them to the vocab set
# Convert to a list at the end
def create_vocab_list(df):
    vocab_set = set()
    for text in df['textPreprocessed']:
        words = text.split()
        vocab_set.update(words)
    return list(vocab_set)

# Create vocab list
vocab_list = create_vocab_list(sms_df)


#### Build count matrix

In [5]:
# Creates a count matrix where each row represents a text instance
# Columns represent the words in the vocab list.
# Returns the matrix and a dictionary mapping words to their index in the vocab list
def build_count_matrix(df, vocab_list):
    # Initialise empty count matrix
    count_matrix = np.zeros((df.shape[0], len(vocab_list)))
    # Create dictionary mapping words to their index in the vocab list
    vocab_dict = {word: i for i, word in enumerate(vocab_list)}
    for index, text in df['textPreprocessed'].items():
        for word in text.split():
            # If the word is in the vocab list, increment its count in the matrix
            if word in vocab_dict:
                word_index = vocab_dict.get(word)
                count_matrix[index][word_index] += 1
    return count_matrix, vocab_dict

# Build count matrix
count_matrix,vocab_dict = build_count_matrix(sms_df, vocab_list)

#### Compute the prior probability of each class:


In [6]:
# Calculate the prior probability of each class given the dataframe
def calculate_priors(df):
    n_rows = df.shape[0]
    # Count the number of instances in each class
    n_c0 = df[df['class'] == 0].shape[0]
    n_c1 = df[df['class'] == 1].shape[0]
    # Calculate the prior probability of each class
    p_c0 = n_c0 / n_rows
    p_c1 = n_c1 / n_rows
    return [p_c0, p_c1]

# Calculate priors
priors = calculate_priors(sms_df)
p_c0 = priors[0]
p_c1 = priors[1]

# Answer to question 1 in our report
print(f"Our two priors are p_c0 = {p_c0}\nand p_c1 = {p_c1} ")


Our two priors are p_c0 = 0.8
and p_c1 = 0.2 


#### Find the probability of each word appearing in a message from each class (likelihood)

In [7]:
# First we need to find the number of times each word appears in each class

''' Counts the number of times each word appears in a given class
    returns an list of counts '''
def count_words_in_class(class_num):
    # Find the indexes of the rows in the count matrix that belong to the given class
    index_list = np.where(sms_df['class'] == class_num)[0]
    # Get the count matrix for the given class (only shows rows where we classify as class_num)
    class_count_matrix = count_matrix[index_list]
    # Sums the word counts across all text instances for the class
    count_list = class_count_matrix.sum(axis=0)
    return count_list

# Class 0 count
count_c0 = count_words_in_class(0)

# Class 1 count
count_c1 = count_words_in_class(1)


In [85]:
# We will use laplace smoothing to ensure every event has a non-zero probability
# Since we have sparse data (more on report)

# Laplace smoothing value
alpha = 1

# Laplace function returns the conditional probability of a word given a class
# with laplace smoothing
def laplace(count, alpha, total, v):
    return (count+alpha) / (total + v*alpha)

# Calculates the likelihoods p_(c,i) of word i appearing in a given class c
# returns the probability lists for each word in the vocab list and each class
def calculate_likelihoods(count_c0, count_c1, vocab_dict, alpha=1):
    # Initialise empty lists to store the likelihoods
    class_0_c = []
    class_1_c = []
    # Calculate total count of words in each class
    total_c0 = count_c0.sum()
    total_c1 = count_c1.sum()
    # Find the length of the vocab list
    V = len(vocab_dict)
    # For each class calculate the likelihood list
    for word, index in vocab_dict.items():
        count = count_c0[index]
        # Find prob using laplace smoothing
        p_ci = laplace(count, alpha, total_c0, V)
        class_0_c.append(p_ci)
    for word, index in vocab_dict.items():
        count = count_c1[index]
        # Find prob using laplace smoothing
        p_ci = laplace(count, alpha, total_c1, V)
        class_1_c.append(p_ci)
    return class_0_c, class_1_c

# Find the likelihoods for each class
class_0_c, class_1_c = calculate_likelihoods(count_c0, count_c1, vocab_dict, alpha)


# Check probabilities sum to roughly one
print(sum(class_1_c))
print(sum(class_0_c))


0.9999999999999745
1.0000000000000047


### Further supervised model training questions for report

In [86]:
# Question 2
# Find the most probable words in each class

# We find the 10 most probable words by sorting the probabilities list
# and finding their indexes in the original vocab list

'''Finds and returns the n most probable words in a given class
    returns a list in order of probability (descending) and a list
     containing their probability values'''
def find_n_most_probable(n, class_num):
    # Find the count list for the given class
    if class_num == 0:
        prob_list = class_0_c
    elif class_num == 1:
        prob_list = class_1_c
    else:
        return

    # Sort the list and get the indexes of the n most probable words
    sorted_indexes = np.argsort(prob_list)[-n:][::-1]

    sorted_probs = np.sort(prob_list)[-n:][::-1]

    # Get the words from our vocab list using the indexes
    most_probable_words = [vocab_list[i] for i in sorted_indexes]

    return most_probable_words, sorted_probs


# Find the 10 most probable words in class 0
print("Most probable words in class 0:", find_n_most_probable(10, 0)[0])

# List their probability values
print("Probability values for class 0:", find_n_most_probable(10, 0)[1])

# Find the 10 most probable words in class 1
print("Most probable words in class 1:", find_n_most_probable(10, 1)[0])
print("Probability values for class 1:", find_n_most_probable(10, 1)[1])

Most probable words in class 0: ['.', ',', '?', 'u', '...', '!', '..', ';', '&', 'go']
Probability values for class 0: [0.07930378 0.02602418 0.02557645 0.0189165  0.0187486  0.01718155
 0.01494291 0.013152   0.01309604 0.01113723]
Most probable words in class 1: ['.', '!', ',', 'call', '£', 'free', '/', '2', '&', '?']
Probability values for class 1: [0.05652174 0.02434783 0.02347826 0.02054348 0.01391304 0.01054348
 0.00913043 0.00880435 0.00869565 0.00847826]


#### Question 3: Finding the most predictive words


In [87]:
# Calculate the probability ratio of a word occurring in class 0 vs class 1
class_0_c = np.array(class_0_c)
class_1_c = np.array(class_1_c)
p_ratio_non_mal = class_0_c/class_1_c

# We don't have to worry about dividing by zero as we have already applied laplace smoothing

# Find the 10 most strongly predictive words of the non-malicious class
sorted_c0_div_c1 = np.argsort(p_ratio_non_mal)[-10:][::-1]
most_predictive_words_non_mal = [vocab_list[i] for i in sorted_c0_div_c1]
non_scam_ratios = np.sort(p_ratio_non_mal)[-10:][::-1]

# Find the 10 most strongly predictive words of the malicious class
p_ratio_mal = class_1_c/class_0_c
sorted_c1_div_c0 = np.argsort(p_ratio_mal)[-10:][::-1]
most_predictive_words_mal = [vocab_list[i] for i in sorted_c1_div_c0]
scam_ratios = np.sort(p_ratio_mal)[-10:][::-1]

# Print the most predictive words of the non-malicious class
print("The most predictive words of the non-malicious class are:", most_predictive_words_non_mal)
print("The associated probability ratios are:", non_scam_ratios)



# Print the most predictive words of the malicious class
print("The most predictive words of the malicious class are:", most_predictive_words_mal)
print("The associated probability ratios are:", scam_ratios)




The most predictive words of the non-malicious class are: [';', '...', 'gt', 'lt', ':)', 'ü', 'lor', 'ok', 'hope', 'd']
The associated probability ratios are: [60.49921648 57.49570928 54.06312962 53.54824267 47.88448623 31.92299082
 28.83366913 24.71457354 24.71457354 21.1103649 ]
The most predictive words of the malicious class are: ['prize', 'tone', '£', 'select', 'claim', 'paytm', 'code', 'award', 'won', '18']
The associated probability ratios are: [99.05086957 64.09173913 49.71965217 46.61217391 45.96478261 36.90130435
 34.95913043 32.04586957 31.07478261 29.1326087 ]


## 2. Supervised model evaluation

### Predicting the labels of our test set

In [88]:
# Read in our test dataset
test_df = pd.read_csv('sms_test.csv')


In [89]:
# Apply the same preprocessing steps as before
# Ensure all values in the column are strings
test_df['textPreprocessed'] = test_df['textPreprocessed'].astype(str)

# Ensure no null values affect tokenising
# Find num rows
test_rows = test_df.shape[0]
print("Number of entries before dropping null values: ", test_rows)
test_df = test_df.dropna(subset=['textPreprocessed'])
print("Number of entries before after dropping null values: ", test_rows)


Number of entries before dropping null values:  1000
Number of entries before after dropping null values:  1000


In [90]:
# For each test instance, compute a count vector
# which represents the number of times each word in the vocab list appears
# This will be in the form of an N_test x V matrix
def build_test_count_matrix(df, vocab_list, vocab_dict):
    # Initialise empty count matrix
    test_count_matrix = np.zeros((df.shape[0], len(vocab_list)))
    # Counters and sets for question 2
    unique_words_in_vocab = set()
    unique_words_not_in_vocab = set()
    num_test_words_in_vocab = 0
    num_test_words_not_in_vocab = 0

    # Add words that exist in our vocab to the new count matrix for each row in the data
    for index, text in df['textPreprocessed'].items():
        for word in text.split():
            if word in vocab_dict:
                word_index = vocab_dict.get(word)
                test_count_matrix[index][word_index] += 1
                unique_words_in_vocab.add(word)
                num_test_words_in_vocab += 1
            else:
                unique_words_not_in_vocab.add(word)
                num_test_words_not_in_vocab += 1

    return test_count_matrix, num_test_words_in_vocab, num_test_words_not_in_vocab, unique_words_in_vocab, unique_words_not_in_vocab



# Build the test count matrix
test_count_matrix, num_test_words_in_vocab, num_test_words_not_in_vocab, unique_words_in_vocab, unique_words_not_in_vocab = build_test_count_matrix(test_df, vocab_list, vocab_dict)



# Count the number of unique words and in the vocab list and test (for question 2)
num_unique_words_in_vocab = len(unique_words_in_vocab)
num_unique_words_not_in_vocab = len(unique_words_not_in_vocab)
num_test_words_total = num_test_words_in_vocab + num_test_words_not_in_vocab




In [91]:
# Check if all test instances contain words from the vocab list
# If an instance contains no words from the vocab list, we won't classify it

for row in test_count_matrix:
    if row.sum() == 0:
        print("Found a row with no words from the vocab list")
        print(row)



No test items skipped as this code does not output any rows (refer to q2 below)

#### Compute the posterior probability of each class given the observed word count

In [92]:
# For each test instance, we compute the posterior probability of observing each
# word given the class label
def calc_posterior(count_matrix, likelihoods_list, label_index):
    # Initialise empty list to store posterior probabilities
    posterior_probs = []

    prior_value = priors[label_index]

    # For each test instance
    for row in count_matrix:
        # Initialise posterior probability
        posterior = np.log(prior_value)

        # For each word in the vocab list
        for i, word_freq in enumerate(row):
            if word_freq > 0:
                # Get the likelihood of the word given the class label
                likelihood_value = likelihoods_list[i]

                # Take the log of this value to combat overflow
                log_likelihood_value = np.log(likelihood_value)


                # Update the posterior probability
                posterior += (log_likelihood_value * word_freq)


        # Find the log multinomial coefficient
        # Using log gamma function to avoid overflow
        log_multinomial_coefficient = (
            math.lgamma(row.sum() + 1) - np.sum([math.lgamma(c + 1) for c in row])
        )

        posterior += log_multinomial_coefficient

        # Take the exponent of the posterior probability
        posterior = np.exp(posterior)


        # Add the posterior probability to our list
        posterior_probs.append(posterior)
    return posterior_probs

# Class 0 posterior
c0_posterior = calc_posterior(test_count_matrix, class_0_c, 0)

# Class 1 posterior

c1_posterior = calc_posterior(test_count_matrix, class_1_c, 1)


print(max(c1_posterior))


0.011304347826086959


#### Predict the labels of each test instance

In [93]:
''' Predicts the labels of each test instance using the posterior probabilities '''
def predict_nb(c0_posterior, c1_posterior):

    # Using our previously calculated posterior probabilities:

    # Find argmax for each instance
    argmax_labels = []
    associated_probs = []

    for post_prob_0, post_prob_1 in zip(c0_posterior, c1_posterior):
        # Find the maximum posterior probability
        if post_prob_0 >= post_prob_1:
            associated_probs.append(post_prob_0)
            argmax_labels.append(0)

        # Otherwise if probabilities are less we classify as 1 (we use >= above to ensure we don't misclassify important messages)
        else:
            associated_probs.append(post_prob_1)
            argmax_labels.append(1)

    return argmax_labels, associated_probs


# Run on our dataset
pred_labels, probs = predict_nb(c0_posterior, c1_posterior)

print(pred_labels[1:20])



[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]


#### Find the accuracy of our model (Q1)

In [97]:
# Overall accuracy of our classifier

def get_accuracy(labels, true_labels):
    # Find the number of correct predictions
    num_correct = np.sum(labels == true_labels)

    # Calculate the accuracy
    accuracy = num_correct / len(true_labels)

    return accuracy

# Get the true labels
true_labels = test_df['class'].values

# Prints the accuracy, confusion matrix, precision and recall
# of our model's predictions
def model_eval(true_labels, pred_labels):
    accuracy = get_accuracy(true_labels, pred_labels)
    print("Accuracy:", accuracy)

    # Confusion matrix
    confusion_matrix = np.zeros((2, 2), dtype=int)
    # Fill the confusion matrix
    for true, predicted in zip(true_labels, pred_labels):
        confusion_matrix[true][predicted] += 1
    print("Confusion Matrix:\n", confusion_matrix)

    TN = confusion_matrix[0, 0]
    FP = confusion_matrix[0, 1]
    FN = confusion_matrix[1, 0]
    TP = confusion_matrix[1, 1]

    # Calculate precision and recall
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")

# Evaluate our model
model_eval(true_labels, pred_labels)

Accuracy: 0.975
Confusion Matrix:
 [[785  15]
 [ 10 190]]
Precision: 0.93
Recall: 0.95


#### How often we encountered out of vocab words (Q2)

In [106]:
def find_vocab_proportions(num_unique_words_not_in_vocab, num_unique_words_in_vocab, num_test_words_in_vocab, num_test_words_total):
    # Proportion of unique words in the test set that were not in the vocab list
    prop_unique_not_in_vocab = num_unique_words_not_in_vocab / (num_unique_words_not_in_vocab + num_unique_words_in_vocab)
    prop_unique_in_vocab = 1 - prop_unique_not_in_vocab

    print(f"Proportion of unique words in vocab: {prop_unique_in_vocab}")
    print(f"Proportion of unique words not in vocab: {prop_unique_not_in_vocab}")

    # Proportion of words in the test set that were not in the vocab list
    prop_in_vocab = num_test_words_in_vocab / num_test_words_total
    prop_not_in_vocab = 1 - prop_in_vocab

    print(f"Proportion of test words in vocab: {prop_in_vocab}")
    print(f"Proportion of test words not in vocab: {prop_not_in_vocab}")

find_vocab_proportions(num_unique_words_not_in_vocab, num_unique_words_in_vocab, num_test_words_in_vocab, num_test_words_total)


Proportion of unique words in vocab: 0.9427178549664839
Proportion of unique words not in vocab: 0.05728214503351615
Proportion of test words in vocab: 0.9836797957695113
Proportion of test words not in vocab: 0.016320204230488744


#### Confidence of Classification (Q3)

In [36]:
# For all test instances, divide the posterior likelihoods for each class

likelihood_list_c0 = []
likelihood_list_c1 =[]

# For each test instance, we compute the ratio of the posterior probabilities
for post_prob_0, post_prob_1 in zip(c0_posterior, c1_posterior):
    likelihood_list_c0.append(post_prob_0 / post_prob_1)
    likelihood_list_c1.append(post_prob_1 / post_prob_0)

# Find the 3 most confident classifications for c0 (non-malicious)
c0_indexes = np.argsort(likelihood_list_c0)[-3:][::-1]
c0_ratios = np.sort(likelihood_list_c0)[-3:][::-1]
instances_c0 = test_df['textPreprocessed'].iloc[c0_indexes]

print(c0_ratios)
print(instances_c0)


[9.13499445e+37 2.69038186e+29 3.18292454e+25]
341    time : rs. transaction number & & & & & & & & ...
223    ? ? ? ? .. .. u u u u , , ... ... ... ... say ...
969    . every & & & & & & ; ; ; ; ; ; lt lt lt # # #...
Name: textPreprocessed, dtype: object


In [37]:
# Find the 3 most confident classifications for c1 (malicious)
c1_indexes = np.argsort(likelihood_list_c1)[-3:][::-1]
c1_ratios = np.sort(likelihood_list_c1)[-3:][::-1]
instances_c1 = test_df['textPreprocessed'].iloc[c1_indexes]

print(c1_ratios)
print(instances_c1)

[1.35388960e+20 1.28709054e+20 1.14912397e+20]
844    . 4 + call £ - * holiday & urgent 18 t landlin...
985    . 3 4 + ! call : £ offer * holiday & urgent 18...
460    . . . , please order text call / : customer to...
Name: textPreprocessed, dtype: object


In [39]:
# On the boundary between the two classes
# Find the 3 text instances with R values closest to 1
close_to_1_indexes = sorted(range(len(likelihood_list_c0)), key=lambda i: abs(likelihood_list_c0[i] - 1))[:3]
close_to_1_ratios = [likelihood_list_c0[i] for i in close_to_1_indexes]

instances_close_to_1 = test_df['textPreprocessed'].iloc[close_to_1_indexes]

print(close_to_1_ratios)
print(instances_close_to_1)


[1.0170981352199624, 1.0441124738285466, 0.9297658433847452]
90                  . call dear
455                . reply glad
767    . . tell return re order
Name: textPreprocessed, dtype: object


## 3. Extending the model with semi-supervised training

### Active Learning

Split the labelled data into a training and validation set

In [110]:
sms_df_train = sms_df.sample(frac=0.8, random_state=1462474)
sms_df_validation = sms_df.drop(sms_df_train.index)

We must read in the unlabelled data

In [111]:
sms_unlabelled = pd.read_csv('sms_unlabelled.csv')

In [112]:
# First we select 200 instances by random
sms_ul_sample = sms_unlabelled.sample(200, random_state=1462474)

# Append the sampled instances to the training set
sms_df_train = pd.concat([sms_df_train, sms_ul_sample], ignore_index=True)

In [113]:
# Preprocess the data
sms_df_train = preprocess_data(sms_df_train)
sms_df_validation = preprocess_data(sms_df_validation)


Retrain our model on the new training set

In [114]:
# Retrain our model

new_vocab_list = create_vocab_list(sms_df_train)
# Initialise empty count matrix

# Build count matrix
new_count_matrix, new_vocab_dict = build_count_matrix(sms_df_train, new_vocab_list)


# Calculate priors
priors = calculate_priors(sms_df_train)
p_c0_new = priors[0]
p_c1_new = priors[1]

print(priors)


[0.7966666666666666, 0.20333333333333334]


In [115]:
# Calculate likelihoods using laplace smoothing
count_c0_new = count_words_in_class(0)
count_c1_new = count_words_in_class(1)
class_0_c_new, class_1_c_new = calculate_likelihoods(count_c0_new, count_c1_new, new_vocab_dict, alpha)

#### Model is trained now we must evaluate it on our validation set

_

## 4. Supervised model evaluation