# Phishing Dataset - Download, Preprocessing, and Model Pipeline

In this notebook, we will be setting up the **Phishing URL Detection** dataset for machine learning models. This notebook is designed to help you understand the entire workflow, from dataset acquisition, preparing the data for model training, creating models and training the models with the dataset.

Let’s dive in!

## Downloading the Phishing Dataset

We will start by downloading the **Phishing URL Detection** dataset using the `Kaggle API`. The dataset will be downloaded into the `raw_data` directory.

Note: Make sure you already have the Kaggle API set up, as outlined in the project's README file.


In [3]:
import sys
import os

BASE_DIR = os.path.abspath('..')
RAW_DATA_DIR = os.path.join(BASE_DIR, 'data/raw_data/phishing')
sys.path.append(os.path.join(BASE_DIR, 'scripts'))

from download_data import download_dataset
phishing_dataset = 'sergioagudelo/phishing-url-detection'
download_dataset(phishing_dataset, RAW_DATA_DIR)

Current working directory: /home/rishupishu/Documents/HWUD/YEAR 4/SEM 1/F20DL/CW/Dubai_UG-6/notebooks


## Inspect the Dataset

Now that we have downloaded the dataset, let’s load it into a Pandas DataFrame and inspect the first few rows to understand the data structure. This may take a few seconds...


In [None]:
import pandas as pd

dataset = os.path.join(RAW_DATA_DIR, 'out.csv')
df = pd.read_csv(dataset)

# df.head()

## Preprocessing the Data

Next, we will preprocess the dataset to prepare it for machine learning models. The steps will include:

1. Load the data.
2. Handling missing values.
3. Encoding categorical features
4. Outlier Detection and Removal
5. Normalization/Standarization
6. Flattening the Data
7. Principle Component Analysis
8. Splitting the data into training and testing sets.
9. Saving the data into flattened and unflattened sets.


In [5]:
import sys
import os

BASE_DIR = os.path.abspath('..')

RAW_DATA_PATH = os.path.join(BASE_DIR, 'data/raw_data/phishing/out.csv')
PROCESSED_TRAIN_DIR = os.path.join(BASE_DIR, 'data/processed_data/phishing/train')
PROCESSED_TEST_DIR = os.path.join(BASE_DIR, 'data/processed_data/phishing/test')

from preprocess_phishing_data import preprocessing_pipeline
preprocessing_pipeline(RAW_DATA_PATH, use_pca=True, n_components=50)


Step 1: Loading data...
Data loaded successfully!
Step 2: Handling missing data...


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mode()[0], inplace=True)  # Impute mode for categorical
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].median(), inplace=True)  # Impute median for numerical
The behavior will change in pandas 3.0. This inplace method wil

Missing data handled!
Step 3: Encoding categorical features...
Categorical features encoded!
Step 4: Removing outliers...
Outliers removed!
Step 5: Scaling features...
Features scaled!
Step 6: Splitting data into training and testing sets...
Data split into training and testing sets!
Step 7: Applying PCA with 50 components...
PCA applied!
Step 8: Saving processed data...
Data preprocessing complete! Data saved as flattened.
