<a href="https://colab.research.google.com/github/neal-logan/dsba6211-summer2024/blob/main/nophishing/02_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#No Phishing: Detecting Malicious URLs

#### DSBA 5122 - Summer 2024 - Neal Logan

## Introduction

#### Problem

Malicious URLs are a common component of phishing attacks.  They are sometimes used to exploit technical vulnerabilities, executing malicious code  automatically when the message is presented to the target, when the target interacts with the message, or when the target follows the link to the malicious URL.  However, attacks relying primarily on social engineering present a more difficult challenge, for example by leading phishing targets to sites that appear entirely legitimate even to relatively vigilant and sophisticated internet users.  Detecting these malicious URLs provides us with several opportunities for defense, and is an important part of engineering secure systems.

#### Model Applications

Malicious URL detection can be used in several ways.  It can be used to provide suspected-phishing warnings to users, particularly in web browsers or email/messaging systems.  It can also be used by organizations, which can warn or block users from visiting suspected phishing sites.  And finally, malicious URL detection can be used by to identify or refer malicious sites for takedowns or to help direct law enforcement efforts against the threat actors behind the malicious sites.

#### Dataset

I will use a dataset provided by HuggingFace user pirocheto:

https://huggingface.co/datasets/pirocheto/phishing-url


## Literature Review

I asked ChatGPT to find relevant projects which included code.  After reviewing them and summarizing their contents, I included the most relevant three projects below.


#### Phishing URL Detection by Pirocheto
https://github.com/pirocheto/phishing-url-detection

This repository contains a complete project for phishing URL detection using machine learning and MLOps practices. It uses a TF-IDF vectorizer using both character and word n-grams) with a linear SVM model. The code is designed to be lightweight and fast, suitable for embedding in applications, and can work offline, without an internet connection. The repository also includes instructions for reproducing the model and running the pipeline.

This project is relevant both for the subject matter and because of its relation to the dataset I'm using.

#### PhishShield by Praneeth Katuri
https://github.com/praneeth-katuri/PhishShield

This GitHub repository provides a comprehensive solution for detecting phishing websites using analytical models and custom transformers for preprocessing. It includes feature-based and text-based models, including random forest, LGBM, SVC, logistic regression, and Multinomial Naive Bayes, and takes advantage of grid-search with cross-validation. The repository also offers Flask deployment for real-time URL prediction and caching for performance improvement.

This project is relevant because it explores and compares a variety of techniques for detecting malicious URLs.

#### Phishing Link Detection by Sayan Maity
https://github.com/Sayan-Maity-Code/Phishing-link-detection

This project uses Multinomial Naive Bayes and Logistic Regression to detect malicious URLs. The model's preprocessing involves tokenization and TF-IDF vectorization. The project includes scripts for training and evaluating the model.

This project is relevant mainly in that it provides an additoinal perspective on the topic.



## Data Preparation

The dataset as posted is already largely cleaned and partly preprocessed. In addition to the raw URL and binary phishing label, it contains 87 features, including:

* 56 from URL syntax and structure,
* 24 from page content, and
* 7 from external services.

#### Cleaning

Little to no cleaning is necessary.  However, some variables do include apparently-invalid values, for example TODO


#### Rejected Features

I will drop the feature containing the raw URL.  The classification modeling techniques I will use can't make use of the raw URLs directly, and I don't intend to generate new features from it yet.


#### Feature Engineering

TODO



#### Preprocessing Pipeline

TODO Expound


I will perform some limited additional preprocessing in my model pipelines including scaling, bucketing, and dimensionality reduction, on a feature-by-feature basis following a more thorough analysis.

The scaling/bucketing steps for many features will consist only of the application of standard scaling and ordinal bucketing. However, features like domain_registration_length or web_traffic likely need to be log-scaled or percentile-bucketed. Other features, like page_rank, may not benefit from scaling or bucketing, and won't be transformed.

 I will also transform the target variable to binary.   





### Data Preparation

In [1]:
# Load and prepare training data
import pandas as pd

train_url = 'https://raw.githubusercontent.com/neal-logan/dsba6211-summer2024/main/nophishing/data/phishing-url-pirochet-train.csv'
df = pd.read_csv(train_url)

#Create numeric target variable column
df['y'] = df['status'].replace('legitimate', 0).replace('phishing', 1)

#Drop unnecessary columns
df = df.drop(columns=['status','url'])

#X/y split
X = df.drop(columns=['y'])
y = df['y']

In [2]:
#Split training set into training and validation set (test set not yet loaded)

from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(
    X,
    y,
    test_size = 0.2,
    random_state = 42)

Xy_train = X_train.copy()
Xy_train['y'] = y_train

### Data Cleaning

### Preliminary Modeling

#### Baseline Model Performance

TODO quick overview/table of results establishing baseline for model performance

Note that performance of RF and GBT on the validation set was strong enough that further development may not make sense from the perspective of precision/recall/AUC metrics.


In [4]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
import pandas as pd


# Define model evaluation function

def print_model_evaluation(
    title: str,
    pipe : Pipeline,
    X : pd.DataFrame,
    y : pd.DataFrame):

    print(title)
    pred_y = pipe.predict(X)
    print(confusion_matrix(pred_y, y))
    print("\nArea under ROC curve: " + str(roc_auc_score(pred_y, y)))
    print("\nPrecision: " + str(precision_score(pred_y, y)))
    print("\nRecall: " + str(recall_score(pred_y, y)))


#### Logistic Regression

In [5]:
#Set up pipeline

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

pipe_lr = make_pipeline(
      StandardScaler(),
      LogisticRegression(random_state=42)
)

pipe_lr.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [6]:
print_model_evaluation("Logistic Regression\nPerformance on Training Set",
                       pipe_lr, X_train, y_train)

print_model_evaluation("Logistic Regression\nPerformance on Validation Set",
                       pipe_lr, X_validation, y_validation)

Logistic Regression
Performance on Training Set
[[2934  158]
 [ 142 2892]]

Area under ROC curve: 0.9510487438184406

Precision: 0.9481967213114754

Recall: 0.9531970995385629
Logistic Regression
Performance on Validation Set
[[711  44]
 [ 42 735]]

Area under ROC curve: 0.9438339001252909

Precision: 0.9435173299101413

Recall: 0.9459459459459459


#### Random Forest

In [7]:
#Set up & run pipeline - random forest

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

pipe_rf = make_pipeline(
      StandardScaler(),
      RandomForestClassifier(random_state=42)
)

pipe_rf.fit(X_train, y_train)


In [8]:
print_model_evaluation("Random Forest\nPerformance on Training Set",
                       pipe_rf, X_train, y_train)

print_model_evaluation("Random Forest\nPerformance on Validation Set",
                       pipe_rf, X_validation, y_validation)

Random Forest
Performance on Training Set
[[3076    0]
 [   0 3050]]

Area under ROC curve: 1.0

Precision: 1.0

Recall: 1.0
Random Forest
Performance on Validation Set
[[729  31]
 [ 24 748]]

Area under ROC curve: 0.9640612217071176

Precision: 0.9602053915275995

Recall: 0.9689119170984456


#### Gradient-boosted Trees

In [9]:
# Set up and run pipeline - gradient boosted trees

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import make_pipeline

pipe_gbt = make_pipeline(
      StandardScaler(),
      HistGradientBoostingClassifier(random_state=42)
)

pipe_gbt.fit(X_train, y_train)

In [10]:
print_model_evaluation("Gradient-boosted Trees\nPerformance on Training Set",
                       pipe_gbt, X_train, y_train)

print_model_evaluation("Gradient-boosted Trees\nPerformance on Validation Set",
                       pipe_gbt, X_validation, y_validation)

Gradient-boosted Trees
Performance on Training Set
[[3076    0]
 [   0 3050]]

Area under ROC curve: 1.0

Precision: 1.0

Recall: 1.0
Gradient-boosted Trees
Performance on Validation Set
[[728  26]
 [ 25 753]]

Area under ROC curve: 0.9666917826433827

Precision: 0.9666238767650834

Recall: 0.967866323907455
