## Imports

In [1]:
!pip install ucimlrepo --quiet

In [2]:
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



## Load Data

Dataset *PhiUSIIL Phishing URL (Website)* from UCI Machine Learning Repository, available under **Creative Commons Attribution 4.0 International**.

* Link

[PhiUSIIL Phishing URL](https://archive.ics.uci.edu/dataset/967/phiusiil+phishing+url+dataset)

* Credits:

Prasad, A. & Chandra, S. (2024). PhiUSIIL Phishing URL (Website). UCI Machine Learning Repository. https://doi.org/10.1016/j.cose.2023.103545.

In [3]:
# fetch dataset
phishing_url = fetch_ucirepo(id=967)

# data (as pandas dataframes)
X = phishing_url.data.features
y = phishing_url.data.targets

Looking at one observation of the data, we see that there are a couple of categorical variables.

In [4]:
# View one observation
df = pd.concat([X, y], axis=1)
df['TLD'] = df.TLD.astype('category')
df.sample(1).T

Unnamed: 0,112239
URL,http://www.mediadll.tk
URLLength,21
Domain,www.mediadll.tk
DomainLength,15
IsDomainIP,0
TLD,tk
URLSimilarityIndex,73.553719
CharContinuationRate,1.0
TLDLegitimateProb,0.000386
URLCharProb,0.05524


Let's look at the dataset information to check for data types and missing values.

In [5]:
# info check for missing values and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235795 entries, 0 to 235794
Data columns (total 55 columns):
 #   Column                      Non-Null Count   Dtype   
---  ------                      --------------   -----   
 0   URL                         235795 non-null  object  
 1   URLLength                   235795 non-null  int64   
 2   Domain                      235795 non-null  object  
 3   DomainLength                235795 non-null  int64   
 4   IsDomainIP                  235795 non-null  int64   
 5   TLD                         235795 non-null  category
 6   URLSimilarityIndex          235795 non-null  float64 
 7   CharContinuationRate        235795 non-null  float64 
 8   TLDLegitimateProb           235795 non-null  float64 
 9   URLCharProb                 235795 non-null  float64 
 10  TLDLength                   235795 non-null  int64   
 11  NoOfSubDomain               235795 non-null  int64   
 12  HasObfuscation              235795 non-null  int64   
 13 

Next, we will train a LightGBM model. As the documentation says, LGBM can deal with categories without the need of One Hot Encoding.

In [25]:
df['label'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
1,0.571895
0,0.428105


In [6]:
# X & Y
X = df.drop(['URL', 'Domain', 'Title' ,'label'], axis=1)
y = df['label']

# Split Train and Validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Train LightGBM with imbalance handling
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'force_col_wise': True,
    'categorical_feature': 'TLD',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'learning_rate': 0.05,
    'is_unbalance': False  # Handle class imbalance
}

# Fit model
model = lgb.train(params, train_data, num_boost_round=100)

# Predictions and evaluation
y_pred = (model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))

Please use categorical_feature argument of the Dataset constructor to pass this parameter.


[LightGBM] [Info] Number of positive: 107815, number of negative: 80821
[LightGBM] [Info] Total Bins 5135
[LightGBM] [Info] Number of data points in the train set: 188636, number of used features: 51
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.571550 -> initscore=0.288180
[LightGBM] [Info] Start training from score 0.288180
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     20124
           1       1.00      1.00      1.00     27035

    accuracy                           1.00     47159
   macro avg       1.00      1.00      1.00     47159
weighted avg       1.00      1.00      1.00     47159



In [34]:
from sklearn.feature_selection import mutual_info_classif

# Calculate mutual information
mi_scores = mutual_info_classif(X.drop('TLD', axis=1), y)  # Use mutual_info_regression for regression tasks
mi_df = pd.DataFrame({'Feature': X.drop('TLD', axis=1).columns, 'Mutual_Information': mi_scores})
mi_df = mi_df.sort_values(by='Mutual_Information', ascending=False)

print(mi_df)


                       Feature  Mutual_Information
3           URLSimilarityIndex            0.680634
22                  LineOfCode            0.601134
49             NoOfExternalRef            0.561811
44                   NoOfImage            0.543306
47                 NoOfSelfRef            0.527752
46                      NoOfJS            0.500222
23           LargestLineLength            0.488082
45                     NoOfCSS            0.446420
36                HasSocialNet            0.417297
13            LetterRatioInURL            0.381839
43            HasCopyrightInfo            0.351967
32              HasDescription            0.307276
21                     IsHTTPS            0.256180
19  NoOfOtherSpecialCharsInURL            0.242558
25       DomainTitleMatchScore            0.218950
37             HasSubmitButton            0.208576
20       SpacialCharRatioInURL            0.204529
5            TLDLegitimateProb            0.196610
26          URLTitleMatchScore 

In [10]:
# Selected columns
cols = ['TLD','LineOfCode','Pay', 'Robots', 'Bank', 'IsDomainIP']

# X & Y
X = df[cols]
y = df['label']

# Split Train and Validation
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

# Train LightGBM with imbalance handling
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'force_col_wise': True,
    'categorical_feature': 'TLD',
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 1,
    'is_unbalance': False  # Handle class imbalance
}

# Fit model
model = lgb.train(params, train_data, num_boost_round=100)

# Predictions and evaluation
y_pred = (model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))

Please use categorical_feature argument of the Dataset constructor to pass this parameter.


[LightGBM] [Info] Number of positive: 107815, number of negative: 80821
[LightGBM] [Info] Total Bins 518
[LightGBM] [Info] Number of data points in the train set: 188636, number of used features: 6
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.571550 -> initscore=0.288180
[LightGBM] [Info] Start training from score 0.288180
              precision    recall  f1-score   support

           0       0.97      0.96      0.97     20124
           1       0.97      0.98      0.97     27035

    accuracy                           0.97     47159
   macro avg       0.97      0.97      0.97     47159
weighted avg       0.97      0.97      0.97     47159



In [58]:
pd.DataFrame({
    'URL': df.loc[y_test.index, 'URL'],
    'Prediction': y_pred,
    'Label': y_test
}).query('Prediction != Label')

Unnamed: 0,URL,Prediction,Label
199191,https://novoitaufatura.com/consulte-sua-fatura...,1,0
170784,https://www.goldreserveinc.com,0,1
73382,https://www.innovativestate.com,0,1
105859,https://www.mobile-gigs.com,0,1
101385,https://www.entretantos.org,0,1
...,...,...,...
156465,https://www.jquery-plugins.net,0,1
33675,https://www.canoesmarshallislands.com,0,1
32031,https://www.alexisdiack.com,0,1
19690,https://bakry-gala.com/bakery/bakev2/gala,1,0


## Comparison

In [60]:
%%timeit

# Generate a dataset
X, y = make_classification(n_samples=1_000_000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train LightGBM with imbalance handling
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'force_col_wise': True,
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'learning_rate': 0.05,
    'is_unbalance': True  # Handle class imbalance
}
model = lgb.train(params, train_data, num_boost_round=100)

# Predictions and evaluation
y_pred = (model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))

[LightGBM] [Info] Number of positive: 399986, number of negative: 400014
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 800000, number of used features: 10
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499982 -> initscore=-0.000070
[LightGBM] [Info] Start training from score -0.000070
              precision    recall  f1-score   support

           0       0.98      0.98      0.98     99942
           1       0.98      0.98      0.98    100058

    accuracy                           0.98    200000
   macro avg       0.98      0.98      0.98    200000
weighted avg       0.98      0.98      0.98    200000

[LightGBM] [Info] Number of positive: 399986, number of negative: 400014
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 800000, number of used features: 10
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499982 -> initscore=-0.000070
[LightGBM] [Info] Start training from score -0.000070
   

In [22]:
import datetime

#import gradient boosting from sklearn
from sklearn.ensemble import GradientBoostingClassifier

In [23]:
# initial time
begin = datetime.datetime.now()

# Generate a dataset
X, y = make_classification(n_samples=1_000_000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model2 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.05, random_state=42)
model2.fit(X_train, y_train)

# Predictions and evaluation
y_pred = model2.predict(X_test)
print(classification_report(y_test, y_pred))
print('-----\n')

# end time
end = datetime.datetime.now()
print(end-begin)

              precision    recall  f1-score   support

           0       0.97      0.99      0.98     99942
           1       0.99      0.97      0.98    100058

    accuracy                           0.98    200000
   macro avg       0.98      0.98      0.98    200000
weighted avg       0.98      0.98      0.98    200000

-----

0:15:18.421318


## K2S
Another way to choose the best variables that maximize the separation of classes in a classification problem is using the Kolmogorov-Smirnov test.

* **The function written next** uses this test to compare the cumulative distribution of a variable for both classes. The higher the KS statistic, the higher is the separation of classes by the variable being tested,

In [19]:
from scipy.stats import ks_2samp # KS2

def ks_test(df, target):
  '''Function to compare the distributions of a variable grouped by the target class'''

  # Get numerical Explanatory column names from the training dataset
  cols = (df
          .drop(target, axis=1)
          .select_dtypes(include=['number'])
          .columns
          .tolist()
          )

  # Creating lists to hold the values
  ks_stats = []
  ks_p = []

  # Loop through columns to test the separability of classes
  for col in cols:
    group_0 = df.loc[df[target] == 0, col].dropna()
    group_1 = df.loc[df[target] == 1, col].dropna()
    ks_stat, p_value = ks_2samp(group_0, group_1)
    ks_stats.append(ks_stat)
    ks_p.append(p_value)

  # Creating a dataframe
  df_ks = pd.DataFrame({
      "Variable": cols,
      "KS_Value": ks_stats,
      "P_Value": np.round(ks_p,4)
      })

  df_ks = df_ks.sort_values(by='KS_Value', ascending=False)

  return df_ks

In [22]:
ks_test(df, 'label')

Unnamed: 0,Variable,KS_Value,P_Value
3,URLSimilarityIndex,0.992214,0.0
49,NoOfExternalRef,0.925315,0.0
22,LineOfCode,0.910212,0.0
47,NoOfSelfRef,0.906914,0.0
44,NoOfImage,0.880099,0.0
46,NoOfJS,0.858282,0.0
36,HasSocialNet,0.789495,0.0
45,NoOfCSS,0.783541,0.0
43,HasCopyrightInfo,0.750901,0.0
32,HasDescription,0.692471,0.0
