# Phishing Dataset - Download, Preprocessing, and Model Pipeline

In this notebook, we will be setting up the **Phishing URL Detection** dataset for machine learning models. This notebook is designed to help you understand the entire workflow, from dataset acquisition, preparing the data for model training, creating models and training the models with the dataset.

Let’s dive in!

## Downloading the Phishing Dataset

We will start by downloading the **Phishing URL Detection** dataset using the `Kaggle API`. The dataset will be downloaded into the `raw_data` directory.

Note: Make sure you already have the Kaggle API set up, as outlined in the project's README file.


In [27]:
import sys
import os
BASE_DIR = os.path.abspath('..')
RAW_DATA_DIR = os.path.join(BASE_DIR, 'data/raw_data/phishing')
sys.path.append(os.path.join(BASE_DIR, 'scripts'))

In [17]:
from download_data import download_dataset
phishing_dataset = 'sergioagudelo/phishing-url-detection'
download_dataset(phishing_dataset, RAW_DATA_DIR)

Downloading sergioagudelo/phishing-url-detection dataset...
Dataset URL: https://www.kaggle.com/datasets/sergioagudelo/phishing-url-detection
Downloaded and extracted to /home/rishupishu/Documents/HWUD/YEAR 4/SEM 1/F20DL/CW/Dubai_UG-6/data/raw_data/phishing


## Inspect the Dataset

Now that we have downloaded the dataset, let’s load it into a Pandas DataFrame and inspect the first few rows to understand the data structure. This may take a few seconds...


In [23]:
import pandas as pd

dataset = os.path.join(RAW_DATA_DIR, 'out.csv')
df = pd.read_csv(dataset)

# df.head()

## Preprocessing the Data

Next, we will preprocess the dataset to prepare it for machine learning models. The steps will include:

1. Load the data.
2. Handling missing values.
3. Encoding categorical features
4. Outlier Detection and Removal
5. Normalization/Standarization
6. Flattening the Data
7. Principle Component Analysis
8. Splitting the data into training and testing sets.
9. Saving the data into flattened and unflattened sets.


### Load the data

In this step, we will load the phishing dataset from the specified file path using pandas. The dataset contains information on URLs and various features that will be used for phishing detection.

In [26]:
from preprocess_phishing_data import preprocessing_pipeline
RAW_DATA_PATH = './data/raw_data/phishing/out.csv'
preprocessing_pipeline(RAW_DATA_PATH, use_pca=True, n_components=50)


NameError: name 'preprocessing_pipeline' is not defined