# SMS Spam Classification - Data Preparation

This notebook prepares the SMS spam dataset for model training. We will:
1. Load the raw data
2. Preprocess and clean it
3. Split into train/validation/test sets
4. Save the splits to CSV files

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## 1. Load Data

The SMS Spam Collection dataset is tab-separated with two columns: label (ham/spam) and the message text.

In [2]:
def load_data(file_path):
    """
    Load SMS spam data from the given file path.
    Returns a dataframe with 'label' and 'message' columns.
    """
    df = pd.read_csv(file_path, sep='\t', header=None, names=['label', 'message'], encoding='latin-1')
    return df

In [3]:
# load the data
df = load_data('SMSSpamCollection')
print(f"Loaded {len(df)} messages")
df.head()

Loaded 5572 messages


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# check class distribution
df['label'].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

We have around 87% ham and 13% spam messages. This is an imbalanced dataset which is expected in real-world spam detection scenarios.

## 2. Preprocess Data

Let's do some basic preprocessing:
- Check for missing values
- Convert labels to binary (0 for ham, 1 for spam)
- Basic text cleaning

In [5]:
def preprocess_data(df):
    """
    Preprocess the SMS data:
    - Remove missing values
    - Convert labels to binary (ham=0, spam=1)
    - Clean text (lowercase, strip whitespace)
    """
    # make a copy
    df = df.copy()
    
    # remove any missing values
    df = df.dropna()
    
    # convert labels to binary
    df['label'] = df['label'].map({'ham': 0, 'spam': 1})
    
    # basic text cleaning - lowercase and strip whitespace
    df['message'] = df['message'].str.lower().str.strip()
    
    return df

In [6]:
# check for missing values before preprocessing
print("Missing values:")
print(df.isnull().sum())

Missing values:
label      0
message    0
dtype: int64


In [7]:
# preprocess the data
df_clean = preprocess_data(df)
print(f"After preprocessing: {len(df_clean)} messages")
df_clean.head()

After preprocessing: 5572 messages


Unnamed: 0,label,message
0,0,"go until jurong point, crazy.. available only ..."
1,0,ok lar... joking wif u oni...
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor... u c already then say...
4,0,"nah i don't think he goes to usf, he lives aro..."


In [8]:
# verify label conversion
print("Label distribution after preprocessing:")
print(df_clean['label'].value_counts())
print(f"\nSpam percentage: {df_clean['label'].mean()*100:.1f}%")

Label distribution after preprocessing:
label
0    4825
1     747
Name: count, dtype: int64

Spam percentage: 13.4%


## 3. Split Data

We'll split the data into:
- Train: 70%
- Validation: 15%
- Test: 15%

Using stratified splitting to maintain class distribution across splits.

In [9]:
def split_data(df, train_size=0.7, val_size=0.15, test_size=0.15, random_state=42):
    """
    Split data into train, validation, and test sets.
    Uses stratified splitting to maintain class distribution.
    """
    # first split: separate test set
    train_val, test = train_test_split(
        df, 
        test_size=test_size, 
        random_state=random_state,
        stratify=df['label']
    )
    
    # second split: separate train and validation
    val_ratio = val_size / (train_size + val_size)
    train, val = train_test_split(
        train_val,
        test_size=val_ratio,
        random_state=random_state,
        stratify=train_val['label']
    )
    
    return train, val, test

In [10]:
# split the data
train_df, val_df, test_df = split_data(df_clean)

print(f"Train size: {len(train_df)} ({len(train_df)/len(df_clean)*100:.1f}%)")
print(f"Validation size: {len(val_df)} ({len(val_df)/len(df_clean)*100:.1f}%)")
print(f"Test size: {len(test_df)} ({len(test_df)/len(df_clean)*100:.1f}%)")

Train size: 3900 (70.0%)
Validation size: 836 (15.0%)
Test size: 836 (15.0%)


In [11]:
# verify class distribution is maintained in all splits
print("Spam percentage in each split:")
print(f"  Train: {train_df['label'].mean()*100:.1f}%")
print(f"  Validation: {val_df['label'].mean()*100:.1f}%")
print(f"  Test: {test_df['label'].mean()*100:.1f}%")

Spam percentage in each split:
  Train: 13.4%
  Validation: 13.4%
  Test: 13.4%


The spam percentage is consistent across all splits (~13%), which means our stratified splitting worked correctly.

## 4. Save Splits

Save the train, validation, and test sets to CSV files.

In [12]:
def save_splits(train_df, val_df, test_df):
    """
    Save the data splits to CSV files.
    """
    train_df.to_csv('train.csv', index=False)
    val_df.to_csv('validation.csv', index=False)
    test_df.to_csv('test.csv', index=False)
    print("Saved: train.csv, validation.csv, test.csv")

In [13]:
# save the splits
save_splits(train_df, val_df, test_df)

Saved: train.csv, validation.csv, test.csv


In [14]:
# quick verification - load and check saved files
print("Verification - loading saved files:")
print(f"train.csv: {len(pd.read_csv('train.csv'))} rows")
print(f"validation.csv: {len(pd.read_csv('validation.csv'))} rows")
print(f"test.csv: {len(pd.read_csv('test.csv'))} rows")

Verification - loading saved files:
train.csv: 3900 rows
validation.csv: 836 rows
test.csv: 836 rows


## Summary

Data preparation complete:
- Loaded 5572 SMS messages
- Preprocessed (cleaned text, converted labels to binary)
- Split into train (70%), validation (15%), test (15%)
- Saved to CSV files

The data is now ready for model training in train.ipynb.