# Unsupervised Machine Learning Final Project
Aug2023

Data Source: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

# Project Goals
For this project we'll be looking at text messages and classifying them as spam or legitimate (ham) text messages. The data is labeled, and we'll start with using an unsupervised model and compare it against a supervised model. Stepwise, we'll be doing the following:
1. Import the data
2. Clean the data
3. Explore the data
4. Build the unsupervised model
5. Explore improvements to the unsupervised model
6. Build a supervised model
7. Explore improvements to the supervised model
8. Compare supervised vs. unsupervised models
9. Conclusions

# The Data
This data contains text messages collected from a various number of sources (also described in more detail on the Kaggle source url):
1. 425 spam messages from Grumbletext website
2. 3,375 ham messages from NUS SMS Corpus
3. 450 ham messages from Caroline Tag's PhD thesis
4. 1000 ham messages and 322 spam messages from SMS Spam Corpus v.0.1 Big

In total there are 5572 messages with a breakdown of 87% ham and 13% spam. 

## NOTE
The data doesn't use the default utf-8 encoding, we need to use ISO-8859-1 to read the raw data correctly.

In [35]:
# Import modules
import polars as pl
import altair as alt
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import enchant

from sklearn.cluster import AgglomerativeClustering as Agg
from sklearn.neighbors import KNeighborsClassifier as KNN

In [36]:
# Import the data
url = "https://raw.githubusercontent.com/nsxydis/final-project/main/spam.csv"
df = pl.read_csv(url, encoding = 'ISO-8859-1')

In [37]:
# Clean the data
df.to_pandas().info()

# Review '' column
print("\n'' column")
print(df[''].unique(), '\n')

print('_duplicated_0 column')
print(df['_duplicated_0'].unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   v1             5572 non-null   object
 1   v2             5572 non-null   object
 2                  50 non-null     object
 3   _duplicated_0  12 non-null     object
dtypes: object(4)
memory usage: 174.3+ KB

'' column
shape: (44,)
Series: '' [str]
[
	"\" not \"what …
	" like you are …
	" b'coz nobody …
	"JUST REALLYNEE…
	" don't miss ur…
	"GN"
	" that's the ti…
	" SHE SHUDVETOL…
	" HOWU DOIN? FO…
	" I don't mind"
	" HOPE UR OK...…
	".;-):-D""
	…
	" wanted to say…
	"u hav2hear it!…
	" smoke hella w…
	" just as a sho…
	" GOD said"
	" PO Box 1146 M…
	" ENJOYIN INDIA…
	" Dont Come Nea…
	"DEVIOUSBITCH.A…
	" bt not his gi…
	" PO Box 5249"
	" we made you h…
	" the person is…
] 

_duplicated_0 column
shape: (11,)
Series: '_duplicated_0' [str]
[
	"whoever is the…
	" why to miss t…
	"U NO THECD ISV…


# Data Cleaning
It appears some of the data was split into separate columns because there were commas in the text fields. We'll combine these columns together and clean up the data. Additionally we'll rename the columns and drop unnecessary fields.

NOTE: Upon further testing the additional text fields severely impact model performance and will be ignored. The columns are not noted in the source data page, so it is unclear what this data is.

In [38]:
# def append(x):
#     a = x['v2']
    
#     # Get the blank column and add a comma if theres data
#     b = x[''] 
#     b = f", {b}" if b != '' else ""

#     c = x['_duplicated_0']
#     c = f", {c}" if c != '' else ""

#     return a + b + c

# struct = ['v2', '', '_duplicated_0']
# df = df.with_columns(pl.struct(struct).apply(lambda x: append(x)).alias('text'))
# df = df.rename({'v1' : 'category'})
# df = df.drop(['v2', '', '_duplicated_0'])

In [39]:
# Rename the dataset and drop the extra columns
df = df.rename({'v1' : 'category', 'v2' : 'text'})
df = df.drop(['', '_duplicated_0'])

# Exploring the data

In [40]:
# Add the word count to the data
def wordCount(text):
    return len(text.split(' '))

df = df.with_columns(pl.col('text').apply(lambda x: wordCount(x)).alias('words'))

In [41]:
# Explore the data
alt.data_transformers.disable_max_rows()

# Spam vs Ham
title = ['Click the columns to separate the distributions', "Count of spam vs. ham messages"]
selection = alt.selection_point(encodings = ['x'])
breakdown = alt.Chart(df.to_pandas(), title = title).mark_bar().encode(
    x = 'category',
    y = 'count()',
    color = 'category'
).add_params(selection)

# Distribution of message length
title = "Distribution of word counts"
distribution = alt.Chart(df.to_pandas(), title = title).mark_bar().encode(
    x = alt.X('words', bin = True),
    y = 'count()',
    color = 'category'
).transform_filter(selection)

breakdown | distribution

# Distribution notes
We can see that the normal (ham) messages are mostly under 20 words and skewed whereas the spam messages are typically between 20 to 40 words with a more normal (but still skewed) distribution.

# Model Fitting
Next we'll try fitting a NMF model based on the text field. To do that we'll have to vectorize the words and then fit our model.

In [42]:
# Make a class that can be used for different models
class spamhamNMF:
    def __init__(self, data, testSize = 0.3):
        # Vectorization
        self.vectorizer = TfidfVectorizer()
        self.embed = self.vectorizer.fit(df['text'].to_list())

        # Separate the data
        self.x = data['text'].to_list()
        self.y = data['category'].to_list()

        # Train test split
        self.x_train, self.x_test, self.y_train, self.y_test = \
        train_test_split(self.x, self.y, test_size = testSize, random_state=42)

        # Initialize the model
        self.model = NMF(n_components=2, random_state=42)

        ###############

    def train(self, report = True):
        '''Predict and score the training data'''
        self.embedTrain = self.vectorizer.transform(self.x_train)
        self.model.fit(self.embedTrain)
        matrix = self.model.transform(self.embedTrain)
        self.ypTrain = [np.argmax(row) for row in matrix]

        # Score
        self.trainScore, self.ypTrain = self.score(self.y_train, self.ypTrain)
        if report:
            print(f"Train Accuracy: {self.trainScore}")


    def test(self, report = True):
        '''Predict and score the test data'''
        self.embedTest = self.vectorizer.transform(self.x_test)
        matrix = self.model.transform(self.embedTest)
        self.ypTest = [np.argmax(row) for row in matrix]

        # Score
        self.testScore, self.ypTest = self.score(self.y_test, self.ypTest)
        if report:
            print(f"Test Accuracy: {self.testScore}")

    def score(self, y, yp):
        '''Categories yp and scores it against y'''
        # Determine which number aligns with each group
        df = pl.DataFrame({'y' : y, 'yp' : yp})
        ham = df.filter(pl.col('y') == 'ham')
        spam = df.filter(pl.col('y') == 'spam')

        ypHam = ham['yp'].mode()[0]
        ypSpam = spam['yp'].mode()[0]

        # Check that both groups weren't assigned together
        if ypHam == ypSpam:
            alternate = 1 if ypHam == 0 else 1
            
            if len(ham) > len(spam):
                ypSpam = alternate    
            else:
                ypHam = alternate

        # Create a dictionary
        if ypHam == 0:
            pred = {0 : 'ham', 1 : 'spam'}
        else:
            pred = {0 : 'spam', 1 : 'ham'}

        # Convert the predictions
        ypConverted = []
        for item in yp:
            ypConverted.append(pred[item])

        # Score the accuracy
        count = 0
        for n in range(len(y)):
            if y[n] == ypConverted[n]:
                count += 1

        # Return the accuracy and converted yp
        return count / len(y), ypConverted

In [43]:
nmf = spamhamNMF(df)
nmf.train()
nmf.test()
confusion_matrix(nmf.y_test, nmf.ypTest)

Train Accuracy: 0.823076923076923
Test Accuracy: 0.8265550239234449


array([[1340,  113],
       [ 177,   42]], dtype=int64)

# NMF Model improvements Part 1
Let's try removing stop words and seeing the impact.

In [44]:
def stopWords(text):
    '''Remove stopwords from the text'''
    stop = set(stopwords.words('english'))
    words = word_tokenize(text)
    new = ""
    for word in words:
        if word not in stop and (word.replace('-', "").isalnum()):
            new += f" {word}"
    return new

dfImproved = df.with_columns(pl.col('text').apply(lambda x: stopWords(x)))

In [45]:
nmf = spamhamNMF(dfImproved)
nmf.train()
nmf.test()
confusion_matrix(nmf.y_test, nmf.ypTest)

Train Accuracy: 0.8615384615384616
Test Accuracy: 0.8672248803827751


array([[1333,  120],
       [ 102,  117]], dtype=int64)

# NMF Model Improvements Part 1 Results

# NMF Model improvements Part 2
Let's try converting slang words to plain english.

In [46]:
dfSlang = {}
d = enchant.Dict('en_US')
def slang(text):
    '''Creates a list of slang words if there are any, otherwise returns []'''
    slangWords = []
    words = word_tokenize(text)
    for word in words:
        if d.check(word) == False:
            slangWords.append(word)
            if word.lower() not in dfSlang:
                dfSlang[word.lower()] = 1
            else:
                dfSlang[word.lower()] += 1
    return slangWords

dfImproved2 = dfImproved.with_columns(pl.col('text').apply(lambda x: slang(x)).alias('slang'))

In [47]:
# Identify the most used slang words
temp = {
    'word' : [],
    'count' : []
}
for word in dfSlang:
    temp['word'].append(word)
    temp['count'].append(dfSlang[word])

dfSlang = pl.DataFrame(temp)
del temp
dfSlang = dfSlang.sort(by = 'count', descending = True)
dfSlang = dfSlang.filter(pl.col('count') > 20)

In [48]:
# Manually adjusted slang translation
slangTranslation = {'lt': 'love that',
 'ur': 'your',
 'ok': 'okay',
 'txt': 'text',
 'lor': '?',
 'da': 'dad',
 'dont': "do not",
 'pls': 'please',
 'na': 'no',
 'im': 'I am',
 'msg': 'message',
 'lol': 'laughing out loud',
 'gon': 'gone',
 'gud': 'good',
 'ìï': '?',
 'haha': 'hah',
 'thk': 'thanks',
 'mins': 'minutes',
 'sms': 'message',
 'thats': 'that is',
 'liao': 'already',
 'luv': 'love',
 '150ppm': '150 parts per million',
 'aight': 'alright',
 'thanx': 'thanks',
 'leh': 'you',
 'didnt': 'did not',
 'tmr': 'tomorrow',
 'wif': 'wife',
 '150p': '150 parts per million',
 'hav': 'have',
 'abt': 'about',
 'juz': 'just',
 'tv': 'television',
 'plz': 'please',
 'havent': 'have not',
 'haf': 'have',
 'oso': 'significant other',
 'goin': 'going',
 'colour': 'color',
 'todays': "today's",
 'wid': 'with',
 'sae': '?'}

In [49]:
# Function to add slang words to our translator
def translate(slangWord):
    if word not in slangTranslation:
        slangTranslation[word] = input(f"{word}: ")

for word in dfSlang['word'].to_list():
    translate(word)


In [50]:
# Convert the slang words to normal words
def convertSlang(text):
    words = word_tokenize(text)
    new = ""
    for word in words:
        if word.lower() in slangTranslation:
            word = slangTranslation[word.lower()]
            if word == "?":
                word = ""
        new += f" {word}"
    return new

dfImproved2 = dfImproved2.with_columns(pl.col('text').apply(lambda x: convertSlang(x)))    

In [51]:
nmf = spamhamNMF(dfImproved2)
nmf.train()
nmf.test()
confusion_matrix(nmf.y_test, nmf.ypTest)

Train Accuracy: 0.8007692307692308
Test Accuracy: 0.8086124401913876


array([[1350,  103],
       [ 217,    2]], dtype=int64)

# Unsupervised Model Improvements Part 2 Results
From our second round we see that the performance has declined greatly from the previous models. That would suggest that the use of slang is quite important in separating spam from non spam messages. The model now heavily over classifies messages as spam and is very inaccurate.

In [52]:
class spamhamKNN:
    def __init__(self, data, neighbors = 3, testSize = 0.3):
        # Vectorization
        self.vectorizer = TfidfVectorizer()
        self.embed = self.vectorizer.fit(df['text'].to_list())

        # Separate the data
        self.x = data['text'].to_list()
        self.y = data['category'].to_list()

        # Train test split
        self.x_train, self.x_test, self.y_train, self.y_test = \
        train_test_split(self.x, self.y, test_size = testSize, random_state=42)

        # Initialize the model
        self.model = KNN(n_neighbors=neighbors)

    def train(self, report = True):
        '''Predict and score the training data'''
        self.embedTrain = self.vectorizer.transform(self.x_train)
        self.model.fit(self.embedTrain, self.y_train)
        self.ypTrain = self.model.predict(self.embedTrain)
        self.trainAccuracy = self.model.score(self.embedTrain, self.y_train)
        if report:
            print(f"Train Accuracy: {self.trainAccuracy}")
    
    def test(self, report = True):
        '''Predict and score the test data'''
        self.embedTest = self.vectorizer.transform(self.x_test)
        self.ypTest = self.model.predict(self.embedTest)
        self.testAccuracy = self.model.score(self.embedTest, self.y_test)
        if report:
            print(f"Test Accuracy: {self.testAccuracy}")


In [53]:
for n in range(1, 11, 2):
    print(f"{n} neighbors")
    knn = spamhamKNN(df, neighbors=n)
    knn.train()
    knn.test()

1 neighbors


Train Accuracy: 1.0
Test Accuracy: 0.9473684210526315
3 neighbors
Train Accuracy: 0.9446153846153846
Test Accuracy: 0.9198564593301436
5 neighbors
Train Accuracy: 0.9182051282051282
Test Accuracy: 0.9102870813397129
7 neighbors
Train Accuracy: 0.9023076923076923
Test Accuracy: 0.8965311004784688
9 neighbors
Train Accuracy: 0.9515384615384616
Test Accuracy: 0.9449760765550239


In [54]:
knn = spamhamKNN(df, neighbors=9)
knn.train()
knn.test()
confusion_matrix(knn.y_test, knn.ypTest)

Train Accuracy: 0.9515384615384616
Test Accuracy: 0.9449760765550239


array([[1452,    1],
       [  91,  128]], dtype=int64)

In [55]:
class spamhamAgg:
    def __init__(self, data, testSize = 0.3):
        # Vectorization
        self.vectorizer = TfidfVectorizer()
        self.embed = self.vectorizer.fit(df['text'].to_list())

        # Separate the data
        self.x = data['text'].to_list()
        self.y = data['category'].to_list()

        # Train test split
        self.x_train, self.x_test, self.y_train, self.y_test = \
        train_test_split(self.x, self.y, test_size = testSize, random_state=42)

        # Initialize the model
        self.model = Agg(n_clusters=2)

    def train(self, report = True):
        '''Predict and score the training data'''
        self.embedTrain = self.vectorizer.transform(self.x_train)
        self.ypTrain = self.model.fit_predict(self.embedTrain.toarray())

        # Score
        self.trainScore, self.ypTrain = self.score(self.y_train, self.ypTrain)
        if report:
            print(f"Train Accuracy: {self.trainScore}")

    def test(self, report = True):
        '''Predict and score the test data'''
        self.embedTest = self.vectorizer.transform(self.x_test)
        self.ypTest = self.model.fit_predict(self.embedTest.toarray())

        # Score
        self.testScore, self.ypTest = self.score(self.y_test, self.ypTest)
        if report:
            print(f"Test Accuracy: {self.testScore}")

    def score(self, y, yp):
        '''Categories yp and scores it against y'''
        # Determine which cluster aligns with each group
        df = pl.DataFrame({'y' : y, 'yp' : yp})
        ham = df.filter(pl.col('y') == 'ham')
        spam = df.filter(pl.col('y') == 'spam')

        ypHam = ham['yp'].mode()[0]
        ypSpam = spam['yp'].mode()[0]

        # Check that both groups weren't assigned together
        if ypHam == ypSpam:
            alternate = 1 if ypHam == 0 else 1
            
            if len(ham) > len(spam):
                ypSpam = alternate    
            else:
                ypHam = alternate

        # Create a dictionary
        if ypHam == 0:
            pred = {0 : 'ham', 1 : 'spam'}
        else:
            pred = {0 : 'spam', 1 : 'ham'}

        # Convert the predictions
        ypConverted = []
        for item in yp:
            ypConverted.append(pred[item])

        # Score the accuracy
        count = 0
        for n in range(len(y)):
            if y[n] == ypConverted[n]:
                count += 1

        # Return the accuracy and converted yp
        return count / len(y), ypConverted

In [56]:
agg = spamhamAgg(df)
agg.train()
agg.test()
confusion_matrix(agg.y_test, agg.ypTest)

agg = spamhamAgg(dfImproved)
agg.train()
agg.test()
confusion_matrix(agg.y_test, agg.ypTest)

agg = spamhamAgg(dfImproved2)
agg.train()
agg.test()
confusion_matrix(agg.y_test, agg.ypTest)

Train Accuracy: 0.8594871794871795
Test Accuracy: 0.8588516746411483
Train Accuracy: 0.8594871794871795
Test Accuracy: 0.8594497607655502
Train Accuracy: 0.8274358974358974
Test Accuracy: 0.8522727272727273


array([[1425,   28],
       [ 219,    0]], dtype=int64)