# 01 - Project Overview & Data Preprocessing

**Project title:** Discovering Behavioral Patterns in Phishing Emails Using Association Rule Mining

**Problem statement (short):**
Phishing emails exploit recurring tactics (urgency, impersonation, deceptive links). This project uses Association Rule Mining (Apriori + rule evaluation) on a cleaned, combined phishing email dataset to surface frequent co-occurring tokens and interpretable rules (support, confidence, lift) that characterize phishing behavior.

This notebook loads the final dataset, performs text cleaning, and writes a cleaned CSV to `data/processed/cleaned_phishing.csv`.

### Imports and Dataset Load

In [3]:
%pip install nltk
# Imports
import pandas as pd
from pathlib import Path
import re
import nltk

# NLTK resources (download once)
nltk.download('stopwords', quiet=True)

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Paths
RAW = Path("../data/raw/final/phishing_email.csv")
OUT = Path("../data/processed/cleaned_phishing.csv")

# Load dataset (robust to common column names)
df = pd.read_csv(RAW)
print("Raw shape:", df.shape)
print("Columns:", df.columns.tolist())

# Identify text column
text_col = None
for c in ['text_combined','text','message','body','content']:
    if c in df.columns:
        text_col = c
        break
if text_col is None:
    raise KeyError("Couldn't find a text column. Please ensure CSV has one of: text_combined, text, message, body, content")
print("Using text column:", text_col)

# If no label column, try to find a reasonable one, else create placeholder
label_col = None
for c in ['label','spam','class','target']:
    if c in df.columns:
        label_col = c
        break
if label_col is None:
    print("No label column found; creating 'label' with default 1 (phishing) for all rows")
    df['label'] = 1
    label_col = 'label'
print("Using label column:", label_col)


[notice] A new release of pip available: 22.3 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting nltk
  Downloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 1.8 MB/s eta 0:00:00
Collecting click
  Downloading click-8.3.1-py3-none-any.whl (108 kB)
     -------------------------------------- 108.3/108.3 kB 3.2 MB/s eta 0:00:00
Collecting joblib
  Downloading joblib-1.5.3-py3-none-any.whl (309 kB)
     -------------------------------------- 309.1/309.1 kB 2.4 MB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2025.11.3-cp311-cp311-win_amd64.whl (277 kB)
     -------------------------------------- 277.7/277.7 kB 1.4 MB/s eta 0:00:00
Collecting tqdm
  Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, joblib, click, nltk
Successfully installed click-8.3.1 joblib-1.5.3 nltk-3.9.2 regex-2025.11.3 tqdm-4.67.1
Note: you may need to restart the kernel to use updated packages.
Raw shape: (82486, 2)
Columns: ['text_combined', 'label']
Using text column: text_combined
Using l

### Cleaning Function and Apply

In [4]:
# Cleaning function (robust and documented)
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def clean_text(text):
    if pd.isna(text): return ''
    s = str(text).lower()
    # remove URLs and emails
    s = re.sub(r'http\S+|https\S+|www\.\S+',' ', s)
    s = re.sub(r'\S+@\S+',' ', s)
    # remove HTML tags
    s = re.sub(r'<.*?>',' ', s)
    # remove non-letter characters
    s = re.sub(r'[^a-z\s]', ' ', s)
    # collapse spaces
    s = re.sub(r'\s+', ' ', s).strip()
    # tokenize, remove stopwords, stem
    tokens = [stemmer.stem(w) for w in s.split() if w not in stop_words and len(w)>1]
    return ' '.join(tokens)

# Apply cleaning (this may take some time depending on dataset size)
df['clean_text'] = df[text_col].apply(clean_text)
df['word_count'] = df['clean_text'].apply(lambda s: len(s.split()))
print("After cleaning — sample:")
display(df[[text_col, 'clean_text', 'word_count']].head(5))

After cleaning — sample:


Unnamed: 0,text_combined,clean_text,word_count
0,hpl nom may 25 2001 see attached file hplno 52...,hpl nom may see attach file hplno xl hplno xl,10
1,nom actual vols 24 th forwarded sabrae zajac h...,nom actual vol th forward sabra zajac hou ect ...,152
2,enron actuals march 30 april 1 201 estimated a...,enron actual march april estim actual march fl...,17
3,hpl nom may 30 2001 see attached file hplno 53...,hpl nom may see attach file hplno xl hplno xl,10
4,hpl nom june 1 2001 see attached file hplno 60...,hpl nom june see attach file hplno xl hplno xl,10


### Basic QC and Save

In [5]:
# Quick QC
print("Null clean_text rows:", df['clean_text'].isna().sum())
print("Min/Max word_count:", df['word_count'].min(), df['word_count'].max())
print("Label distribution:")
print(df[label_col].value_counts(dropna=False))

# Ensure processed dir exists and save
OUT.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(OUT, index=False)
print("Saved cleaned data to", OUT)

Null clean_text rows: 0
Min/Max word_count: 0 463175
Label distribution:
label
1    42891
0    39595
Name: count, dtype: int64
Saved cleaned data to ..\data\processed\cleaned_phishing.csv


Notes:
- We stemmed words (Porter); if you prefer lemmatization for interpretability, replace PorterStemmer with WordNetLemmatizer (requires wordnet download).
- `clean_text` removes URLs and email addresses so rules represent tokens (words) rather than raw links.