# Detect Phishing URLs
### Capstone 3 - Preprocessing and Modeling
Michael Garber

#### High-Level Steps
1. Preprocessing
    1. Create dummy/indicator features for categorical variables
    2. Standardize/scale numeric features
    3. Train/Test Split 
2. Modeling
    1. Fit your models with a training dataset
    2. Review model outcomes — Iterate over additional models as needed.
    3. Identify the final model that you think is the best model for this project

In [4]:
# Import Libraries
import pandas as pd
import os
import numpy as np
from sklearn.preprocessing import MinMaxScaler, TargetEncoder     # requires scikit-learn 1.3 or greater
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, precision_recall_curve, auc
import keras
from keras import layers
from scikeras.wrappers import KerasClassifier
import seaborn as sns
import matplotlib.pyplot as plt

#### Preprocessing

In [6]:
# Import Data set
dataDir = os.path.join('../data/interim/urlData_raw.csv')
urlData = pd.read_csv(dataDir)

  urlData = pd.read_csv(dataDir)


In [7]:
# add new useful feature: URL length
urlData['url_Length'] = urlData['url'].apply(len)

In [8]:
# Data Info
urlData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450175 entries, 0 to 450174
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Unnamed: 0        450175 non-null  int64 
 1   key_0             445854 non-null  object
 2   url               450175 non-null  object
 3   type              450175 non-null  object
 4   parsedUrl         450175 non-null  object
 5   urlPart_scheme    450175 non-null  object
 6   subDomain         379885 non-null  object
 7   domain            450167 non-null  object
 8   tld               445854 non-null  object
 9   urlPart_path      444917 non-null  object
 10  urlPart_query     65541 non-null   object
 11  urlPart_fragment  359 non-null     object
 12  tld_join          445854 non-null  object
 13  Domain            445451 non-null  object
 14  Type              445451 non-null  object
 15  TLD Manager       445451 non-null  object
 16  isIPaddress       450175 non-null  boo

**Additional data cleaning** - identify features that have **missing values** (that we ares till planning on keeping)
- subDomain
- domain
- tld
- urlPart_path
- urlPart_query
- urlPart_fragment
- Type
- TLD Manager    

In [10]:
# Additional data cleaning - set missing values to zero
values = {
    'subDomain': '0',
    'domain': '0',
    'tld': '0',
    'urlPart_path': '0',
    'urlPart_query': '0',
    'urlPart_fragment': '0',
    'Type': '0',
    'TLD Manager': '0'}

urlData = urlData.fillna(value=values)

In [11]:
# Data Info - checking NULLs again (key and tld will be dropped)
urlData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450175 entries, 0 to 450174
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Unnamed: 0        450175 non-null  int64 
 1   key_0             445854 non-null  object
 2   url               450175 non-null  object
 3   type              450175 non-null  object
 4   parsedUrl         450175 non-null  object
 5   urlPart_scheme    450175 non-null  object
 6   subDomain         450175 non-null  object
 7   domain            450175 non-null  object
 8   tld               450175 non-null  object
 9   urlPart_path      450175 non-null  object
 10  urlPart_query     450175 non-null  object
 11  urlPart_fragment  450175 non-null  object
 12  tld_join          445854 non-null  object
 13  Domain            445451 non-null  object
 14  Type              450175 non-null  object
 15  TLD Manager       450175 non-null  object
 16  isIPaddress       450175 non-null  boo

In [12]:
urlData.head()

Unnamed: 0.1,Unnamed: 0,key_0,url,type,parsedUrl,urlPart_scheme,subDomain,domain,tld,urlPart_path,urlPart_query,urlPart_fragment,tld_join,Domain,Type,TLD Manager,isIPaddress,isPhish_bool,url_Length
0,0,com,https://www.google.com,legitimate,"ParseResult(scheme='https', netloc='www.google...",https,www,google,com,0,0,0,com,.com,generic,VeriSign Global Registry Services,False,False,22
1,1,com,https://www.youtube.com,legitimate,"ParseResult(scheme='https', netloc='www.youtub...",https,www,youtube,com,0,0,0,com,.com,generic,VeriSign Global Registry Services,False,False,23
2,2,com,https://www.facebook.com,legitimate,"ParseResult(scheme='https', netloc='www.facebo...",https,www,facebook,com,0,0,0,com,.com,generic,VeriSign Global Registry Services,False,False,24
3,3,com,https://www.baidu.com,legitimate,"ParseResult(scheme='https', netloc='www.baidu....",https,www,baidu,com,0,0,0,com,.com,generic,VeriSign Global Registry Services,False,False,21
4,4,org,https://www.wikipedia.org,legitimate,"ParseResult(scheme='https', netloc='www.wikipe...",https,www,wikipedia,org,0,0,0,org,.org,generic,Public Interest Registry (PIR),False,False,25


In [13]:
urlData.columns[:]

Index(['Unnamed: 0', 'key_0', 'url', 'type', 'parsedUrl', 'urlPart_scheme',
       'subDomain', 'domain', 'tld', 'urlPart_path', 'urlPart_query',
       'urlPart_fragment', 'tld_join', 'Domain', 'Type', 'TLD Manager',
       'isIPaddress', 'isPhish_bool', 'url_Length'],
      dtype='object')

**Value Counts - urlPart_scheme**

In [15]:
#urlData[['url', 'urlPart_scheme', 'subDomain', 'tld', 'domain', 'type', 'TLD Manager', 'isIPaddress', 'isPhish_bool']]
pd.DataFrame(urlData['urlPart_scheme'].value_counts())

Unnamed: 0_level_0,count
urlPart_scheme,Unnamed: 1_level_1
https,352185
http,97947
httpss,35
ftp,8


**Value Counts - subDomain**

In [17]:
pd.DataFrame(urlData['subDomain'].value_counts())

Unnamed: 0_level_0,count
subDomain,Unnamed: 1_level_1
www,276100
0,70290
www.en,13626
www.music,1289
www.people,1228
...,...
www.ohv.parks,1
www.ohtheplaceswewillgo-books,1
www.ohr,1
www.ohomen171s-journey-through-life,1


**Value Counts - TLD Manager**

In [19]:
urlData['TLD Manager'].value_counts()

TLD Manager
VeriSign Global Registry Services                                                                               333004
Public Interest Registry (PIR)                                                                                   38393
Canadian Internet Registration Authority (CIRA) Autorité Canadienne pour les enregistrements Internet (ACEI)     10086
EDUCAUSE                                                                                                          6976
Nominet UK                                                                                                        5997
                                                                                                                 ...  
AS Domain Registry                                                                                                   1
University of Swaziland Department of Computer Science                                                               1
Dot London Domains Limited          

**Determine how to handle categorical features**

In [21]:
# Let's see cardinality / # of uniques for each feature - use to determine categorical fields to dummy and how to encode - one-hot vs label
urlData[['key_0', 'url', 'type', 'parsedUrl', 'urlPart_scheme',
       'subDomain', 'domain', 'tld', 'urlPart_path', 'urlPart_query',
       'urlPart_fragment', 'tld_join', 'Domain', 'Type', 'TLD Manager',
       'isIPaddress', 'isPhish_bool']].describe()

Unnamed: 0,key_0,url,type,parsedUrl,urlPart_scheme,subDomain,domain,tld,urlPart_path,urlPart_query,urlPart_fragment,tld_join,Domain,Type,TLD Manager,isIPaddress,isPhish_bool
count,445854,450175,450175,450175,450175,450175,450175,450175,450175,450175,450175,445854,445451,450175,450175,450175,450175
unique,415,450175,2,450132,4,32041,130747,832,317144,55325,72,415,360,5,260,2,2
top,com,https://www.google.com,legitimate,"ParseResult(scheme='http', netloc='new.sosnovs...",https,www,wikipedia,com,/,0,0,com,.com,generic,VeriSign Global Registry Services,False,False
freq,316414,1,345738,2,352185,276100,12895,316414,55253,384636,449816,316414,316414,376803,333004,447309,345738


> - Will use **mean encoding** for the **high cardinality columns** (e.x. domain) as using one-hot would create too many columns and with sparse data
> - Will **mean encode after train/test splitting** to **avoid data leakage**

In [23]:
# Select/Drop features - create new dataframe for this major data change
urlDataV2 = urlData.drop(['Unnamed: 0', 'key_0', 'url', 'type', 'parsedUrl', 'tld_join', 'Domain'], axis=1)

In [24]:
# View new DF
urlDataV2.head()

Unnamed: 0,urlPart_scheme,subDomain,domain,tld,urlPart_path,urlPart_query,urlPart_fragment,Type,TLD Manager,isIPaddress,isPhish_bool,url_Length
0,https,www,google,com,0,0,0,generic,VeriSign Global Registry Services,False,False,22
1,https,www,youtube,com,0,0,0,generic,VeriSign Global Registry Services,False,False,23
2,https,www,facebook,com,0,0,0,generic,VeriSign Global Registry Services,False,False,24
3,https,www,baidu,com,0,0,0,generic,VeriSign Global Registry Services,False,False,21
4,https,www,wikipedia,org,0,0,0,generic,Public Interest Registry (PIR),False,False,25


**Features to One-Hot encode**
- 'urlPart_scheme'
- 'Type'
- 'isIPaddress'
- 'isPhish_bool'

*...because they are lower cardinality*

In [26]:
# Let's rename "type" to a more descriptive name before encoding
urlDataV2 = urlDataV2.rename(columns={'Type':'TLD_type'})

##### Create dummies \ one-hot encode

In [28]:
# One-Hot encode features
urlDataV2 = pd.get_dummies(urlDataV2, columns=['urlPart_scheme', 'TLD_type'])

In [29]:
# view new DF
urlDataV2.head()

Unnamed: 0,subDomain,domain,tld,urlPart_path,urlPart_query,urlPart_fragment,TLD Manager,isIPaddress,isPhish_bool,url_Length,urlPart_scheme_ftp,urlPart_scheme_http,urlPart_scheme_https,urlPart_scheme_httpss,TLD_type_0,TLD_type_country-code,TLD_type_generic,TLD_type_generic-restricted,TLD_type_sponsored
0,www,google,com,0,0,0,VeriSign Global Registry Services,False,False,22,False,False,True,False,False,False,True,False,False
1,www,youtube,com,0,0,0,VeriSign Global Registry Services,False,False,23,False,False,True,False,False,False,True,False,False
2,www,facebook,com,0,0,0,VeriSign Global Registry Services,False,False,24,False,False,True,False,False,False,True,False,False
3,www,baidu,com,0,0,0,VeriSign Global Registry Services,False,False,21,False,False,True,False,False,False,True,False,False
4,www,wikipedia,org,0,0,0,Public Interest Registry (PIR),False,False,25,False,False,True,False,False,False,True,False,False


*Note: Due to use of **Target encoding**, we will **preprocess** data in a **modified order**...

    1. train/test split data
    2. target encode data
    3. Scale data

In [31]:
# one more peek at columns and object types
urlDataV2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450175 entries, 0 to 450174
Data columns (total 19 columns):
 #   Column                       Non-Null Count   Dtype 
---  ------                       --------------   ----- 
 0   subDomain                    450175 non-null  object
 1   domain                       450175 non-null  object
 2   tld                          450175 non-null  object
 3   urlPart_path                 450175 non-null  object
 4   urlPart_query                450175 non-null  object
 5   urlPart_fragment             450175 non-null  object
 6   TLD Manager                  450175 non-null  object
 7   isIPaddress                  450175 non-null  bool  
 8   isPhish_bool                 450175 non-null  bool  
 9   url_Length                   450175 non-null  int64 
 10  urlPart_scheme_ftp           450175 non-null  bool  
 11  urlPart_scheme_http          450175 non-null  bool  
 12  urlPart_scheme_https         450175 non-null  bool  
 13  urlPart_scheme

##### EDA: Feature correlation

In [33]:
# feature correlations (excluding non-numerics)
pd.DataFrame(urlDataV2.corrwith(other=urlDataV2['isPhish_bool'], numeric_only=True).sort_values(ascending=False))

Unnamed: 0,0
isPhish_bool,1.0
urlPart_scheme_http,0.959467
TLD_type_country-code,0.338015
TLD_type_0,0.175337
isIPaddress,0.14564
url_Length,0.085058
TLD_type_generic-restricted,0.070401
urlPart_scheme_ftp,0.006422
urlPart_scheme_httpss,-0.004846
TLD_type_sponsored,-0.077846


> **Top feature correlations to target**
> - urlPart_scheme_http**s** &emsp;&emsp;(~-96%)
> - urlPart_scheme_http &emsp;&emsp; (~+96%) 

Woah! The **url scheme** (specifically the use of **'http'** or **'https'**) is *highly* predictive of the target in this data set.


In [35]:
pd.DataFrame(urlDataV2[['urlPart_scheme_ftp', 'urlPart_scheme_http', 'urlPart_scheme_https', 'urlPart_scheme_httpss', 'isPhish_bool']].value_counts())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,count
urlPart_scheme_ftp,urlPart_scheme_http,urlPart_scheme_https,urlPart_scheme_httpss,isPhish_bool,Unnamed: 5_level_1
False,False,True,False,False,345702
False,True,False,False,True,97947
False,False,True,False,True,6483
False,False,False,True,False,35
True,False,False,False,True,7
True,False,False,False,False,1


> When the scheme is 'https', the URL **is** phishing     - 345702 times \
> When the scheme is 'http',  the URL is **not** phishing - 97947  times \
> URL scheme of 'http' or 'https' predicts the target value in over **98%** of rows in this data set!

In [37]:
# Calculate percentage of URLs that can be accurately classified solely via their urlPart_scheme*
#      https    http     total record count
print((345702 + 97947) / 450175.0)

0.9855034153384795


##### Train/Test Split

In [39]:
# assign X & y
X = urlDataV2.drop(columns=['isPhish_bool'], axis=1).copy()
y = urlDataV2['isPhish_bool'].copy()

# perform train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

##### Target/Mean Encode
> https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html#sklearn.preprocessing.TargetEncoder

In [41]:
# target encode (using auto-smoothing and cross-fitting)
targetEncoderAuto = TargetEncoder(smooth='auto')
colsToEncode = ['subDomain', 'domain', 'tld', 'urlPart_path', 'urlPart_query', 'urlPart_fragment', 'TLD Manager']

X_train_encoded = X_train.copy()
X_test_encoded  = X_test.copy()
X_train_encoded[colsToEncode] = targetEncoderAuto.fit_transform(X_train[colsToEncode], y_train)
X_test_encoded[colsToEncode]  = targetEncoderAuto.transform(X_test[colsToEncode])

In [42]:
# Check encoding results
pd.DataFrame(X_train_encoded).head()

Unnamed: 0,subDomain,domain,tld,urlPart_path,urlPart_query,urlPart_fragment,TLD Manager,isIPaddress,url_Length,urlPart_scheme_ftp,urlPart_scheme_http,urlPart_scheme_https,urlPart_scheme_httpss,TLD_type_0,TLD_type_country-code,TLD_type_generic,TLD_type_generic-restricted,TLD_type_sponsored
109475,0.047042,0.0,0.091356,0.231991,0.22056,0.231375,0.091356,False,62,False,False,True,False,False,False,True,False,False
360352,0.995309,1.0,1.0,0.231991,0.22056,0.231375,0.953371,False,109,False,True,False,False,False,True,False,False,False
133254,0.047303,0.0,0.161406,0.231995,0.220784,0.23137,0.169295,False,68,False,False,True,False,False,False,True,False,False
175455,0.047303,0.0,0.008247,0.109994,0.0,0.23137,0.069466,False,36,False,False,True,False,False,True,False,False,False
392895,0.231995,0.0,0.798733,0.231995,0.220707,0.231336,0.560678,False,46,False,True,False,False,False,False,False,True,False


> Encoding on columns 1-7 ('subDomain' to 'TLD Manager') looks good

##### Standardize and Scale

In [45]:
# MinMax Scale the data
scaler = MinMaxScaler()
scaler.fit(X_train_encoded)

X_train_preprocessed = scaler.transform(X_train_encoded)
X_test_preprocessed  = scaler.transform(X_test_encoded)

In [46]:
# Check scaling results
X_test_pp_DF = pd.DataFrame(X_test_preprocessed)  # make DF version X_test_preprocessed
X_test_pp_DF.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,0.0,0.0,0.003428,0.231993,0.220613,2.5e-05,0.003428,0.0,0.014129,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.047212,0.0,0.091958,0.231993,0.220613,2.5e-05,0.091958,0.0,0.028781,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.047212,0.231993,0.161421,0.109466,0.220613,2.5e-05,0.16926,0.0,0.008373,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.9952,0.231993,0.609738,0.231993,0.220613,2.5e-05,0.626788,0.0,0.010466,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.047212,0.231993,0.319154,0.231993,0.220613,2.5e-05,0.16926,0.0,0.060701,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [47]:
X_test_pp_DF.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
count,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0,112544.0
mean,0.225627,0.213279,0.232026,0.266458,0.235319,0.000558,0.232376,0.00654,0.02731,1.8e-05,0.217533,0.782361,8.9e-05,0.010698,0.127523,0.836117,0.003536,0.022125
std,0.368164,0.350663,0.227334,0.238814,0.108339,0.023083,0.222231,0.080604,0.019964,0.004216,0.41257,0.412643,0.009426,0.102877,0.33356,0.370171,0.059363,0.14709
min,0.0,0.0,0.0,0.0,0.0,2.5e-05,0.0,0.0,0.002093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.047212,0.0,0.161421,0.231993,0.220613,2.5e-05,0.16926,0.0,0.016745,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,0.047212,0.0,0.161421,0.231993,0.220613,2.5e-05,0.16926,0.0,0.023025,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.047212,0.231993,0.161421,0.231993,0.220613,2.5e-05,0.16926,0.0,0.032967,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.206698,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#### Modeling

In [49]:
def createModel(meta, optimizer='adam', hidden_layer_sizes=(0,)):
    from keras.models import Sequential
    from keras.layers import Dense
    
    model = Sequential()
    model.add(keras.Input(shape=(meta["n_features_in_"],)))
    for num_nodes in hidden_layer_sizes:
        model.add(layers.Dense(num_nodes, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['Recall'])
    return model

In [50]:
# create keras model
kModel = KerasClassifier(model=createModel, verbose=0)

In [51]:
# initialize a stratified KFold as the data set is unbalanced
cv_stratified = StratifiedKFold(n_splits=3, shuffle=True, random_state=7)

##### Test 3 Models

In [53]:
# create parameter frid
param_grid = {
    'model__optimizer': ['adam'],
    'model__hidden_layer_sizes': [
        [1],                # model 1: 1 hidden layer with 1 node
        [18],               # model 2: 1 hidden layers with 10 nodes
        [5, 5, 5, 5, 5]     # model 3: 5 hidden layers with 2 node each
    ],
}

##### Hyperparameter tune via grid search

In [55]:
# Grid search CV
grid = GridSearchCV(estimator=kModel, param_grid=param_grid, scoring='recall', n_jobs=-1, cv=cv_stratified, verbose=True)
grid_result = grid.fit(X_train_preprocessed, y_train)

Fitting 3 folds for each of 3 candidates, totalling 9 fits


In [56]:
# show detailed results of 
grid_result.cv_results_

{'mean_fit_time': array([20.54781906, 21.63885768, 37.54701932]),
 'std_fit_time': array([2.11126402, 4.81825184, 1.07044993]),
 'mean_score_time': array([15.36713958, 15.08696659, 13.05680601]),
 'std_score_time': array([4.13671715, 1.05922224, 3.34535187]),
 'param_model__hidden_layer_sizes': masked_array(data=[list([1]), list([18]), list([5, 5, 5, 5, 5])],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'param_model__optimizer': masked_array(data=['adam', 'adam', 'adam'],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'model__hidden_layer_sizes': [1], 'model__optimizer': 'adam'},
  {'model__hidden_layer_sizes': [18], 'model__optimizer': 'adam'},
  {'model__hidden_layer_sizes': [5, 5, 5, 5, 5], 'model__optimizer': 'adam'}],
 'split0_test_score': array([0.99073117, 0.9919951 , 0.9921483 ]),
 'split1_test_score': array([0.98946764, 0.99176561, 0.99165071]),
 'split2_test_score': a

In [57]:
# Summarize Gridsearch results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best: 0.992212 using {'model__hidden_layer_sizes': [18], 'model__optimizer': 'adam'}


##### Evaluate model on test set

In [59]:
# num of features
X_shape = X_test_preprocessed.shape[1]

In [60]:
# create the tuned model
finalModel = keras.Sequential()
finalModel.add(keras.Input(shape=(X_shape,)))
finalModel.add(layers.Dense(5, activation='relu'))
finalModel.add(layers.Dense(5, activation='relu'))
finalModel.add(layers.Dense(5, activation='relu'))
finalModel.add(layers.Dense(5, activation='relu'))
finalModel.add(layers.Dense(5, activation='relu'))
finalModel.add(layers.Dense(1, activation='sigmoid'))
finalModel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['Recall'])

In [61]:
# View Model Structure
finalModel.summary()

In [62]:
# Fit Final Model
finalModel.fit(X_train_preprocessed, y_train)

[1m10551/10551[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 3ms/step - Recall: 0.9889 - loss: 0.0858


<keras.src.callbacks.history.History at 0x2064dd1a350>

In [63]:
# predict with Final model
predictions = finalModel.predict(X_test_preprocessed)

[1m3517/3517[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 2ms/step


In [64]:
# function : convert continuous value to boolean
def makeBool(predictions):
    '''make predictions boolean (0.5 threshold)'''
    return predictions >= 0.5

In [65]:
# make the results boolean
predictions_bool = makeBool(predictions)

In [66]:
# Classification Report
classificationRpt = classification_report(y_test, predictions_bool, digits=5, target_names=['Legitimate URL', 'Phish URL'])
print(classificationRpt)

                precision    recall  f1-score   support

Legitimate URL    0.99788   0.99782   0.99785     86435
     Phish URL    0.99280   0.99299   0.99290     26109

      accuracy                        0.99670    112544
     macro avg    0.99534   0.99541   0.99537    112544
  weighted avg    0.99670   0.99670   0.99670    112544



> The recall of phish URLs with this model is ~99%