# Modeling Pipeline for Phishing URL Detection

In this notebook, we build a full modeling pipeline for detecting phishing URLs.  
We start with a dataset that contains two columns: a **URL** and a **label** indicating whether the URL is **benign** or **phishing**.

While we use a well-known research dataset for this example, **any dataset with labeled URLs** can be used — because **all features in this pipeline will be created from scratch** using custom Python functions.

📘 **Dataset citation**:  

> [1] S. Marchal, J. Francois, R. State, and T. Engel.  
> *PhishStorm: Detecting Phishing with Streaming Analytics*.  
> IEEE Transactions on Network and Service Management (TNSM), 11(4):458–471, 2014.


## Set up

In [1]:
# Load packages
from detecting_phishing_urls.load_data import load_phishing_data, load_alexa_domains


First, let's load our main dataset with URLs and labels.

In [2]:
# Load the PhishStorm dataset

df = load_phishing_data(
    data_dir='../data',
    filename='urlset.csv', 
    url_col='domain',        
    label_col='label'     
)

# Quick preview
df.head()

Unnamed: 0,url,label
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,1
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,1
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,1
3,mail.printakid.com/www.online.americanexpress....,1
4,thewhiskeydregs.com/wp-content/themes/widescre...,1


In [3]:
df.shape

(95913, 2)

Next, let's load a list of the most visited domains globally from the Alexa Top 1M dataset. This list serves as a proxy for trusted or popular websites.

We'll use it later to engineer features that indicate whether a domain in a URL appears in this trusted set, which may help differentiate between phishing and legitimate URLs.


In [5]:
# Load the Alexa domain list
alexa_df = load_alexa_domains(
    data_dir='../data',
    filename='alexa_domains_1M.txt'
)

alexa_df.head()


Unnamed: 0,alexa_domain
0,google.com
1,facebook.com
2,youtube.com
3,baidu.com
4,yahoo.com
