# Week 3 – Task 1: Email Spam Detection using Semi-Supervised Learning

In this task, we apply semi-supervised learning on the SMS Spam Collection dataset. We'll label only a portion of the data (20%) and use `LabelSpreading` to classify the rest.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.semi_supervised import LabelSpreading
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
import re


In [None]:
# Load dataset from GitHub
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Preview dataset
df.head()


In [None]:
# Clean text: lowercase and remove non-alphabetic characters
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return text

df['clean_message'] = df['message'].apply(clean_text)

# Encode labels
le = LabelEncoder()
df['label_num'] = le.fit_transform(df['label'])  # spam = 1, ham = 0


In [None]:
# Vectorize text
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['clean_message'])
y = df['label_num'].values


In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Make 20% of labels available (the rest will be -1 for unlabeled)
y_semi = np.copy(y)
n_labeled = int(0.2 * len(y))

# Randomly choose indices to keep labels
labeled_indices = np.random.choice(len(y), size=n_labeled, replace=False)
unlabeled_indices = np.setdiff1d(np.arange(len(y)), labeled_indices)

# Mask 80% as unlabeled
y_semi[unlabeled_indices] = -1


In [None]:
# Train semi-supervised model
model = LabelSpreading(kernel='rbf', alpha=0.2)
model.fit(X, y_semi)

# Predict full set
y_pred = model.predict(X)

# Evaluate only on originally labeled data
print("Evaluation on all data:")
print(classification_report(y, y_pred, target_names=le.classes_))


## Summary

In this task, we implemented a semi-supervised learning approach to spam detection using the SMS Spam Collection dataset.
We used `LabelSpreading` with only 20% labeled data and successfully predicted the rest.
Text preprocessing included lowercasing and cleaning, followed by TF-IDF vectorization.
The model achieved reasonable performance using limited supervision, demonstrating how semi-supervised learning can be effective in real-world scenarios with scarce labeled data.
