# Phishing Dataset - Download, Preprocessing, and Model Pipeline

In this notebook, we will be setting up the **Phishing URL Detection** dataset for machine learning models. This notebook is designed to help you understand the entire workflow, from dataset acquisition, preparing the data for model training, creating models and training the models with the dataset.

Let’s dive in!

## Step 1: Downloading the Phishing Dataset

We will start by downloading the **Phishing URL Detection** dataset using the `Kaggle API`. The dataset will be downloaded into the `raw_data` directory.

Note: Make sure you already have the Kaggle API set up, as outlined in the project's README file.


In [None]:
import sys
import os

BASE_DIR = os.path.abspath('..')
sys.path.append(os.path.join(BASE_DIR, 'scripts'))

from download_data import download_dataset
RAW_DATA_DIR = os.path.join(BASE_DIR, 'data/raw_data/phishing')
phishing_dataset = 'sergioagudelo/phishing-url-detection'
download_dataset(phishing_dataset, RAW_DATA_DIR)



## Step 2: Inspect the Dataset

Now that we have downloaded the dataset, let’s load it into a Pandas DataFrame and inspect the first few rows to understand the data structure. This may take a few seconds...


In [3]:
import pandas as pd
import sys
import os

BASE_DIR = os.path.abspath('..')
RAW_DATA_DIR = os.path.join(BASE_DIR, 'data/raw_data/phishing')
dataset = os.path.join(RAW_DATA_DIR, 'out.csv')
df = pd.read_csv(dataset)

df.head()

Unnamed: 0,url,source,label,url_length,starts_with_ip,url_entropy,has_punycode,digit_letter_ratio,dot_count,at_count,dash_count,tld_count,domain_has_digits,subdomain_count,nan_char_entropy,has_internal_links,whois_data,domain_age_days
0,apaceast.cloudguest.central.arubanetworks.com,Cisco-Umbrella,legitimate,45,False,3.924535,False,0.0,4,0,0,0,False,3,0.310387,False,"{'domain_name': ['ARUBANETWORKS.COM', 'arubane...",8250.0
1,quintadonoval.com,Majestic,legitimate,17,False,3.572469,False,0.0,1,0,0,0,False,0,0.240439,False,"{'domain_name': ['QUINTADONOVAL.COM', 'quintad...",10106.0
2,nomadfactory.com,Majestic,legitimate,16,False,3.32782,False,0.0,1,0,0,0,False,0,0.25,False,"{'domain_name': ['NOMADFACTORY.COM', 'nomadfac...",8111.0
3,tvarenasport.com,Majestic,legitimate,16,False,3.5,False,0.0,1,0,0,0,False,0,0.25,False,"{'domain_name': ['TVARENASPORT.COM', 'tvarenas...",5542.0
4,widget.cluster.groovehq.com,Cisco-Umbrella,legitimate,27,False,3.93027,False,0.0,3,0,0,0,False,2,0.352214,False,"{'domain_name': 'GROOVEHQ.COM', 'registrar': '...",5098.0
