# Assignment

## Instructions

### Text Classification for Spam Detection

In this assignment, you will build a text classification model using Naive Bayes to classify SMS messages as spam or ham (non-spam). You will implement text preprocessing techniques and use the Vector Space Model (TF-IDF) to represent the text data.

#### Dataset

You will be using the SMS Spam Collection dataset, which contains a set of SMS messages that have been labeled as either spam or ham (legitimate). This dataset is available through several Python libraries or can be downloaded directly.

#### Tasks

1. **Text Preprocessing**:

   - Load the dataset
   - Implement tokenization
   - Apply stemming or lemmatization
   - Remove stopwords

2. **Feature Extraction**:

   - Use TF-IDF vectorization to convert the text data into numerical features
   - Explore the most important features for spam and ham categories

3. **Classification**:

   - Split the data into training and testing sets
   - Train a Multinomial Naive Bayes classifier
   - Evaluate the model using appropriate metrics (accuracy, precision, recall, F1-score)
   - Create a confusion matrix to visualize the results

4. **Analysis**:
   - Analyze false positives and false negatives
   - Identify characteristics of messages that are frequently misclassified
   - Suggest improvements to your model

#### Starter Code

In [1]:
# Import necessary libraries
import os
from pathlib import Path
from urllib.request import urlretrieve
import pandas as pd
import numpy
import nltk
from nltk.tokenize import word_tokenize

from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report


In [2]:
# Download the data
data_path = Path('./notebooks/data')
if not os.path.exists(data_path):
    os.makedirs(data_path)
    print(f'Data folder not exists. Create folder')

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"

file_path = Path('./notebooks/data/sms.tsv')

if file_path.exists():
    print('File already exist, skip download.')
else:
    try:
        print('Downloading Data File')
        urlretrieve(url, file_path)
        print('Download completed')
    except Exception as e:
        print("Error downloading file. Please check if the file exist at the location:", url)
        print("Please also check if Internet connection is present.")


File already exist, skip download.


In [3]:
# Convert data to Pandas dataframe
sms_data = pd.read_csv('sms.tsv', sep='\t', header=None, names=['label', 'message'])
print(sms_data.head())


  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [4]:

# Check data distribution
print(sms_data['label'].value_counts())


label
ham     4825
spam     747
Name: count, dtype: int64


**The target label are highly skewed.**

In [5]:
sms_data['target'] = sms_data['label'].map({'ham':0,'spam':1})

In [6]:
sms_data.head()

Unnamed: 0,label,message,target
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


### Tokenization / Stemming / Lemmatization and Stopwords Removal

In [7]:
# TODO: Implement text preprocessing
# - Tokenization
# - Stemming/Lemmatization
# - Stopwords removal

In [8]:
# Download NLTK package 
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/aiml/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/aiml/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/aiml/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/aiml/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/aiml/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/aiml/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [9]:
# Initialize 
snowball = SnowballStemmer(language='english')
lemmatizer = WordNetLemmatizer()
stop_words  = set(stopwords.words('english'))

In [10]:
# 3. Define preprocessing function
def tokenize_filter_stem_lemma(text):
    # tokenize & lowercase
    tokens = word_tokenize(text.lower())
    # remove stop-words & non-alpha tokens
    filtered = [tok for tok in tokens 
                if tok.isalpha() and tok not in stop_words]
    # stem and lemmatize
    stems  = [snowball.stem(tok)     for tok in filtered]
    lemmas = [lemmatizer.lemmatize(tok) for tok in filtered]
    # return all three processed text
    return {
        'tokens': filtered,
        'stems':   stems,
        'lemmas':  lemmas
    }

# 4. Apply to DF column
processed = sms_data['message'].apply(tokenize_filter_stem_lemma)


In [11]:
processed

0       {'tokens': ['go', 'jurong', 'point', 'crazy', ...
1       {'tokens': ['ok', 'lar', 'joking', 'wif', 'u',...
2       {'tokens': ['free', 'entry', 'wkly', 'comp', '...
3       {'tokens': ['u', 'dun', 'say', 'early', 'hor',...
4       {'tokens': ['nah', 'think', 'goes', 'usf', 'li...
                              ...                        
5567    {'tokens': ['time', 'tried', 'contact', 'u', '...
5568    {'tokens': ['ü', 'b', 'going', 'esplanade', 'f...
5569    {'tokens': ['pity', 'mood', 'suggestions'], 's...
5570    {'tokens': ['guy', 'bitching', 'acted', 'like'...
5571    {'tokens': ['rofl', 'true', 'name'], 'stems': ...
Name: message, Length: 5572, dtype: object

In [12]:

# Expand into separate columns
sms_data = pd.concat(
    [sms_data,
     pd.DataFrame(list(processed))],
    axis=1
)
sms_data['clean_text'] = sms_data['lemmas'].apply(lambda toks: ' '.join(toks))


In [13]:
sms_data

Unnamed: 0,label,message,target,tokens,stems,lemmas,clean_text
0,ham,"Go until jurong point, crazy.. Available only ...",0,"[go, jurong, point, crazy, available, bugis, n...","[go, jurong, point, crazi, avail, bugi, n, gre...","[go, jurong, point, crazy, available, bugis, n...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,0,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,"[free, entry, wkly, comp, win, fa, cup, final,...","[free, entri, wkli, comp, win, fa, cup, final,...","[free, entry, wkly, comp, win, fa, cup, final,...",free entry wkly comp win fa cup final tkts may...
3,ham,U dun say so early hor... U c already then say...,0,"[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]","[u, dun, say, early, hor, u, c, already, say]",u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,"[nah, think, goes, usf, lives, around, though]","[nah, think, goe, usf, live, around, though]","[nah, think, go, usf, life, around, though]",nah think go usf life around though
...,...,...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,1,"[time, tried, contact, u, pound, prize, claim,...","[time, tri, contact, u, pound, prize, claim, e...","[time, tried, contact, u, pound, prize, claim,...",time tried contact u pound prize claim easy ca...
5568,ham,Will ü b going to esplanade fr home?,0,"[ü, b, going, esplanade, fr, home]","[ü, b, go, esplanad, fr, home]","[ü, b, going, esplanade, fr, home]",ü b going esplanade fr home
5569,ham,"Pity, * was in mood for that. So...any other s...",0,"[pity, mood, suggestions]","[piti, mood, suggest]","[pity, mood, suggestion]",pity mood suggestion
5570,ham,The guy did some bitching but I acted like i'd...,0,"[guy, bitching, acted, like, interested, buyin...","[guy, bitch, act, like, interest, buy, someth,...","[guy, bitching, acted, like, interested, buyin...",guy bitching acted like interested buying some...


### TF-IDF and Train Test Split

In [14]:
# TODO: Apply TF-IDF vectorization
# TODO: Split data into training and testing sets

In [15]:
X = sms_data["clean_text"]
y = sms_data["target"]

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

In [17]:
vectorizer = TfidfVectorizer(max_df=0.9, min_df=5, ngram_range=(1,2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

### Multinomial Naive Bayes Classification

In [18]:
# TODO: Train a Multinomial Naive Bayes classifier

In [19]:
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)
y_pred = clf.predict(X_test_tfidf)

In [20]:
results = pd.DataFrame({
    'message': X_test,
    'true_label': y_test,
    'pred_label': y_pred
}).reset_index(drop=True)

### Model Evaluation

In [21]:
# TODO: Evaluate the model

In [22]:
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
cm_df = pd.DataFrame(cm, index=clf.classes_, columns=clf.classes_)
print("Confusion Matrix:")
print(cm_df, "\n")

Confusion Matrix:
     0    1
0  953    2
1   24  136 



In [23]:
#Compute and display the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       955
           1       0.99      0.85      0.91       160

    accuracy                           0.98      1115
   macro avg       0.98      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



### Result Analysis

In [24]:
# TODO: Analyze misclassifications

**There are 2 sms that are predicted as spam but it did not turn out as spam. There are 24 sms that are actually spam but it was not classified as spam.**

In [25]:
false_positives = results[
    (results['pred_label'] == 1) & 
    (results['true_label'] == 0)
]

# False negatives: model predicted ham (0) but it was spam (1)
false_negatives = results[
    (results['pred_label'] == 0) & 
    (results['true_label'] == 1)
]

In [26]:
false_positives

Unnamed: 0,message,true_label,pred_label
407,currently scotland,0,1
978,free call,0,1


In [27]:
false_negatives

Unnamed: 0,message,true_label,pred_label
83,hi sue year old work lapdancer love sex text l...,1,0
181,yes place town meet exciting adult single uk t...,1,0
222,new message call,1,0
307,email alertfrom jeri stewartsize prescripiton ...,1,0
367,send logo ur lover name joined heart txt love ...,1,0
425,next amazing xxx video sent enjoy one vid enou...,1,0
458,people dogging area call join like minded guy ...,1,0
521,dear voucher holder next meal u use following ...,1,0
538,realize year thousand old lady running around ...,1,0
549,missed call alert number called left message,1,0


**The following are text that are spam but wrongly classified as non-spam.**

In [28]:
for i in false_negatives['message']:
    print(i)

hi sue year old work lapdancer love sex text live bedroom text sue textoperator
yes place town meet exciting adult single uk txt chat
new message call
email alertfrom jeri stewartsize prescripiton drvgsto listen email call
send logo ur lover name joined heart txt love mobno eg love adam eve yahoo txtno ad
next amazing xxx video sent enjoy one vid enough text back keyword get next video
people dogging area call join like minded guy arrange evening minapn
dear voucher holder next meal u use following link pc enjoy dining experiencehttp
realize year thousand old lady running around tattoo
missed call alert number called left message
mila blonde new uk look sex uk guy u like fun text mtalk increment
let send free anonymous masked message im sending message see potential abuse
wml c
yes place town meet exciting adult single uk txt chat
oh god found number glad text back xafter msg cst std ntwk chg
fantasy football back tv go sky gamestar sky active play dream team scoring start saturday reg

### Conclusion / Further Actions

**It depends on the cost of getting spam and the cost of missing email that are not spam. The higher cost of either getting spam or missing non-spam message will affect our decision and focus. If the cost of missing important sms is high, then we should focus on precision. If the cost of getting the spam is high, then we should focus on recall.**

**There are several ways to improve the model:**

- **We can use grid search to fine the optional parameters and also using recall or precision as the scoring metrics.**
- **We can adjust the probability of the classification to lower than 0.5 in order to capture more spam or we raise the threshold to higher then 0.5 to avoid missing non-spam message.**
- **We can resample the training data to focus more on spam data and less on non-spam data.**
- **We can experiment different ngram parameters to get the best performance.**

## End