# Modeling Baseline Pipeline for Phishing URL Detection

In this notebook, we build a full modeling pipeline for detecting phishing URLs.  
We start with a dataset that contains two columns: a **URL** and a **label** indicating whether the URL is **benign** or **phishing**.

While we use a well-known research dataset for this example (PhishStorm), **any dataset with labeled URLs** can be used — because **all features in this pipeline will be created from scratch** using custom Python functions.

PhishTank dataset is our addditioonal dataset. It has 50K additional phishing links. 

📘 **Datasets citation**:  

> [1] S. Marchal, J. Francois, R. State, and T. Engel.  
> *PhishStorm: Detecting Phishing with Streaming Analytics*.  
> IEEE Transactions on Network and Service Management (TNSM), 11(4):458–471, 2014.

> PhishTank: https://phishtank.org/developer_info.php (dowloaded on Oct 4th 2025, the dataset is updated every day but we will use Oct 4th version)

> Alexa 1 million TOP popular domains: https://www.kaggle.com/datasets/nayjest/alexa-domains-1m/data?select=alexa_domains_1M.txt

## Set up

In [23]:
# Confirm we are in the correct venv
import sys
print(sys.executable)

/Users/polinacsv/Documents/github_clones/phishing_URL_detection/.venv/bin/python


In [1]:
# Load libraries
from phishing_URL_detection.load_data import load_phishing_data, load_alexa_domains
import pandas as pd 

# Load Data

In [25]:
# Load the PhishStorm dataset

df_storm = load_phishing_data(
    data_dir='../data',
    filename='urlset.csv', 
    url_col='domain',        
    label_col='label'     
)

# Quick preview
df_storm.head()

Unnamed: 0,url,label
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,1
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,1
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,1
3,mail.printakid.com/www.online.americanexpress....,1
4,thewhiskeydregs.com/wp-content/themes/widescre...,1


In [26]:
df_storm.shape

(95913, 2)

In [27]:
# Load the PhishTank dataset

df_tank = load_phishing_data(
    data_dir='../data',
    filename='verified_online.csv', 
    url_col='url',        
    label_col='verified'     
)

# Quick preview
df_tank.head()

Unnamed: 0,url,label
0,http://allegrolokalnie.pl-37968.cfd,1
1,https://japan-aotucheck.index-sign13.ftzldk.to...,1
2,https://allegrolokalnie.kategorie7451825902527...,1
3,https://clinkft.wixsite.com/my-site-1,1
4,https://2024.amda.ug/plugins/content/,1


In [28]:
df_tank.shape

(50646, 2)

In [29]:
# Take random samples 
sample_storm = df_storm.sample(n=200, random_state=42)   # 200 rows from PhishStorm
sample_tank  = df_tank.sample(n=200, random_state=42)    # 200 rows from PhishTank

# Save samples for upload 
sample_storm.to_csv("../data/sample_phishstorm.csv", index=False)
sample_tank.to_csv("../data/sample_phishtank.csv", index=False)

Next, let's load a list of the most visited domains globally from the Alexa Top 1M dataset. This list serves as a proxy for trusted or popular websites.

We'll use it later to engineer features that indicate whether a domain in a URL appears in this trusted set, which may help differentiate between phishing and legitimate URLs.

In [2]:
# Load the Alexa domain list
alexa_df = load_alexa_domains(
    data_dir='../data',
    filename='alexa_domains_1M.txt'
)

alexa_df.head()

Unnamed: 0,alexa_domain,rank
0,google.com,1
1,facebook.com,2
2,youtube.com,3
3,baidu.com,4
4,yahoo.com,5


In [3]:
alexa_df.shape

(1000000, 2)

In [4]:
# Take random samples 
sample_alexa = alexa_df.sample(n=200, random_state=42)   # 200 rows from PhishStorm

# Save samples for upload 
sample_alexa.to_csv("../data/sample_alexa.csv", index=False)

# PhishStorm EDA

In [30]:
# Basic inspection
shape = df_storm.shape
head = df_storm.head(10)
label_counts = df_storm['label'].value_counts(normalize=False)
label_distribution = df_storm['label'].value_counts(normalize=True)

shape, head, label_counts, label_distribution

((95913, 2),
                                                  url  label
 0  nobell.it/70ffb52d079109dca5664cce6f317373782/...      1
 1  www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...      1
 2  serviciosbys.com/paypal.cgi.bin.get-into.herf....      1
 3  mail.printakid.com/www.online.americanexpress....      1
 4  thewhiskeydregs.com/wp-content/themes/widescre...      1
 5               smilesvoegol.servebbs.org/voegol.php      1
 6  premierpaymentprocessing.com/includes/boleto-2...      1
 7  myxxxcollection.com/v1/js/jih321/bpd.com.do/do...      1
 8                                super1000.info/docs      1
 9  horizonsgallery.com/js/bin/ssl1/_id/www.paypal...      1,
 label
 0    48009
 1    47904
 Name: count, dtype: int64,
 label
 0    0.500547
 1    0.499453
 Name: proportion, dtype: float64)