# Assignment 2 — SMS Spam Detection (Text)
**Student:** Justus Izuchukwu Onuh  
**Institution:** Ho Chi Minh City University of Technology  
**Course:** Programming Platform for Data Analysis and Visualization (CO5177)  
**Lecturer:** LE THANH SACH  

**Dataset:** SMS Spam Collection (UCI Machine Learning Repository)  
**Filename on UCI:** `SMSSpamCollection`  
**Donated / Published on UCI:** 21 June 2012  
**Instances:** 5,574.  
**License:** CC BY 4.0.  

---

## Objectives
1. Load and explore the SMS Spam dataset.  
2. Perform thorough exploratory data analysis (EDA) with visualizations.  
3. Preprocess text messages and extract features for machine learning.  
4. Train and evaluate at least two classification models (e.g., Naive Bayes, Logistic Regression).  
5. Compare model performance using relevant metrics (accuracy, precision, recall, F1, confusion matrix).  
6. Present results, discussion, and conclusions in markdown (suitable for Colab markdown report).

---


The following cells implement the full workflow step-by-step. Comments and markdown are written as if *I* (the student) am performing the assignment.

## Plan / Tasks
1. Load dataset from the UCI repository (raw text file).  
2. Inspect structure, labels (ham/spam), and basic stats.  
3. Clean and preprocess text (lowercase, remove punctuation, optional stopwords, tokenization).  
4. Feature extraction: TF-IDF vectorization (and optionally count vectors or simple text features like message length).  
5. Train/Test split and baseline model (Multinomial Naive Bayes).  
6. Additional model: Logistic Regression (with class weighting or parameter tuning).  
7. Evaluate with confusion matrix and classification report.  
8. Conclude and list references.


In [None]:
# 1) Imports and helper functions
# This cell installs and imports necessary libraries.

# --- Importing Python standard libraries ---
import os
import io
import sys
from pathlib import Path
import re
import string

# --- Data processing and analysis libraries ---
import numpy as np
import pandas as pd

# --- Visualization library ---
import matplotlib.pyplot as plt

# --- Machine learning libraries ---
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score,
    precision_score, recall_score, f1_score
)

# For reproducibility and consistent results
RANDOM_STATE = 42

print("✅ Libraries successfully imported!")

def clean_text(text):
    """
    Clean input text by:
    1. Converting to lowercase
    2. Removing URLs and emails
    3. Removing punctuation
    4. Collapsing extra whitespace
    """
    print(f" Cleaning text: {text[:50]}...")  # show first 50 characters of text
    
    text = str(text).lower()  # convert to lowercase

    # Remove URLs and emails
    text = re.sub(r'http\S+|www\S+|\S+@\S+', ' ', text)

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove multiple spaces and trim
    text = re.sub(r'\s+', ' ', text).strip()

    print(f"✨ Cleaned text: {text[:50]}...")  # show first 50 characters of cleaned text
    return text


✅ Libraries successfully imported!


In [9]:
# 2) Download & Load SMS Spam Dataset 
# - Downloads from public mirror (GitHub UCI mirror)
# - Saves dataset to data/spam.csv
# - Loads into pandas and prints dataset info

import os
from pathlib import Path
import pandas as pd
import requests

# Public mirror of SMS Spam Collection dataset
DATA_URL = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"

DATA_DIR = Path("data")
DATA_FILE = DATA_DIR / "spam.csv"

def download_dataset(save_dir=DATA_DIR, save_path=DATA_FILE, url=DATA_URL):
    """Download SMS Spam dataset without Kaggle API, save as spam.csv"""
    print(f"📁 Ensuring data directory exists: {save_dir}")
    save_dir.mkdir(parents=True, exist_ok=True)

    # Check if dataset already exists
    if save_path.exists():
        print(f"✅ Dataset already exists at: {save_path.resolve()}")
        return

    print("⬇️ Downloading SMS Spam dataset (no Kaggle API)...")
    try:
        response = requests.get(url)
        response.raise_for_status()
        
        # Save as CSV
        with open(save_path, "wb") as f:
            f.write(response.content)

        print(f"✅ Download complete! File saved to: {save_path.resolve()}")
    except Exception as e:
        print("❌ ERROR: Download failed. Please check your internet connection.")
        print("Error details:", e)

def load_sms_spam(file_path=DATA_FILE):
    """Load dataset into pandas and format columns"""
    print(f"📄 Loading dataset from: {file_path.resolve()}")

    df = pd.read_csv(file_path, sep="\t", header=None, names=["label", "message"], encoding="latin-1")

    print("✅ Dataset loaded successfully!")
    print(f"Shape: {df.shape}")

    # Display class distribution
    print("\nLabel counts:")
    print(df["label"].value_counts())

    # Preview
    print("\n🔍 Preview (first 8 rows):")
    print(df.head(8).to_string(index=False))

    return df

# ---- Execute Steps ----
download_dataset()
df = load_sms_spam()


📁 Ensuring data directory exists: data
⬇️ Downloading SMS Spam dataset (no Kaggle API)...
✅ Download complete! File saved to: /Users/izunwaonu/Desktop/CO5177_Assingment /data/spam.csv
📄 Loading dataset from: /Users/izunwaonu/Desktop/CO5177_Assingment /data/spam.csv
✅ Dataset loaded successfully!
Shape: (5572, 2)

Label counts:
label
ham     4825
spam     747
Name: count, dtype: int64

🔍 Preview (first 8 rows):
label                                                                                                                                                          message
  ham                                                  Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
  ham                                                                                                                                    Ok lar... Joking wif u oni...
 spam      Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text F

### Dataset Download & Loading (Local Save Method)

In this project, we use the **SMS Spam Collection Dataset** for building an SMS spam detection system.  
The dataset contains labeled SMS messages as either:

- `ham` — legitimate messages  
- `spam` — unwanted/promotional/scam messages  

#### What We Did

1. Created a `data/` folder in our project directory  
2. Downloaded the dataset file `spam.csv` **manually (no Kaggle API)**  
3. Saved the file inside the `data/` folder  
4. Loaded the CSV file into pandas  
5. Cleaned the column names:  
   - `v1` → `label`  
   - `v2` → `message`  
6. Printed:
   - Dataset shape  
   - Class distribution  
   - First 8 rows  

#### Dataset Summary

| Item | Value |
|------|------|
| Total rows | 5,572 |
| Columns | 2 (`label`, `message`) |
| Ham messages | 4,825 |
| Spam messages | 747 |

**Class Imbalance Notice:**  
The dataset contains **much more ham than spam**, which means evaluation metrics like **precision, recall, and F1-score** will be important — not only accuracy.

#### 🔍 Preview of Dataset

Examples of messages:



## EDA (Exploratory Data Analysis)
Perform the following checks and visualizations:
- Class distribution (ham vs spam)
- Message length distribution (characters & words)
- Most common tokens in ham vs spam (top-n words)
- Example messages for each class


In [7]:

# 3) EDA - run after you successfully load `df`
try:
    print('Dataset shape:', df.shape)
    print('\nLabel distribution:')
    print(df['label'].value_counts())

    # Add basic features
    df['msg_len_chars'] = df['message'].apply(len)
    df['msg_len_words'] = df['message'].apply(lambda s: len(str(s).split()))

    display(df.groupby('label')[['msg_len_chars','msg_len_words']].describe().T)

    # Plot distributions (matplotlib, no custom colors)
    plt.figure(figsize=(8,4))
    df['msg_len_chars'].hist(bins=50)
    plt.title('Message length (chars) - overall')
    plt.xlabel('Chars')
    plt.ylabel('Count')
    plt.show()

    plt.figure(figsize=(8,4))
    df[df['label']=='ham']['msg_len_chars'].hist(bins=50)
    plt.title('Message length (chars) - ham')
    plt.xlabel('Chars')
    plt.ylabel('Count')
    plt.show()

    plt.figure(figsize=(8,4))
    df[df['label']=='spam']['msg_len_chars'].hist(bins=50)
    plt.title('Message length (chars) - spam')
    plt.xlabel('Chars')
    plt.ylabel('Count')
    plt.show()

    # Show a few examples
    print('\nExamples of ham messages:')
    display(df[df['label']=='ham'].sample(6, random_state=RANDOM_STATE)['message'].tolist()[:6])

    print('\nExamples of spam messages:')
    display(df[df['label']=='spam'].sample(6, random_state=RANDOM_STATE)['message'].tolist()[:6])

except NameError:
    print('`df` not found. Run the Load dataset cell first.')


`df` not found. Run the Load dataset cell first.


## Preprocessing & Modeling
1. Clean text with `clean_text()` defined earlier.  
2. Vectorize using `TfidfVectorizer` (limit max_features e.g., 5000).  
3. Train/Test split (80/20).  
4. Train Multinomial Naive Bayes and Logistic Regression.  
5. Evaluate and compare metrics.


In [8]:

# 4) Preprocessing, Vectorization, Train/Test split, Modeling
try:
    # Basic cleaning
    df['clean_msg'] = df['message'].astype(str).apply(clean_text)

    # Vectorize
    tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2), stop_words='english')
    X = tfidf.fit_transform(df['clean_msg'])
    y = (df['label'] == 'spam').astype(int)  # spam=1, ham=0

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y)

    print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

    # Model 1: Multinomial Naive Bayes
    mnb = MultinomialNB()
    mnb.fit(X_train, y_train)
    y_pred_mnb = mnb.predict(X_test)

    # Model 2: Logistic Regression
    lr = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=RANDOM_STATE)
    lr.fit(X_train, y_train)
    y_pred_lr = lr.predict(X_test)

    # Evaluation helper
    def eval_model(y_true, y_pred, model_name='model'):
        print(f'==== {model_name} ====\n')
        print('Accuracy:', accuracy_score(y_true, y_pred))
        print('Precision:', precision_score(y_true, y_pred))
        print('Recall:', recall_score(y_true, y_pred))
        print('F1:', f1_score(y_true, y_pred))
        print('\nClassification report:\n')
        print(classification_report(y_true, y_pred, target_names=['ham','spam']))
        cm = confusion_matrix(y_true, y_pred)
        print('\nConfusion matrix:\n', cm)

    eval_model(y_test, y_pred_mnb, 'MultinomialNB')
    print('\n-----------------------------------\n')
    eval_model(y_test, y_pred_lr, 'Logistic Regression')

except NameError:
    print('`df` not found or dataset not loaded. Run the Load dataset cell first.')


`df` not found or dataset not loaded. Run the Load dataset cell first.


## Model improvement suggestions
- Hyperparameter tuning (GridSearchCV) for Logistic Regression (C) and TfidfVectorizer (max_features, ngram_range).  
- Try additional models: LinearSVC, Random Forest (on TF-IDF reduced with SelectKBest or TruncatedSVD), or simple neural nets.  
- Use cross-validation and report mean ± std metrics.  
- Analyze feature importances / top tokens for spam class (inspect coefficients from LR or log-count ratios for NB).  
- Add more engineered features: presence of URLs, phone numbers, all-caps tokens, number of digits, punctuation counts.


## Results, Discussion & Conclusion
(Write your results here after running the models in Colab. Include figures and tables.)

- Briefly summarize which model performed better and why.  
- Discuss limitations (dataset bias, short messages, domain differences).  
- Conclude and propose future work.


## References
- Almeida, T. & Hidalgo, J. (2011). SMS Spam Collection [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5CC84. (UCI dataset page: SMSSpamCollection).  
- UCI Machine Learning Repository — SMS Spam Collection. Dataset page and file: `SMSSpamCollection`.  

(When you submit on Colab, make sure the notebook's markdown cells are visible and that you produce narrative explanations as required by the lecturer.)