# Detect Phishing URLs
### Capstone 3 - Preprocessing and Modeling
Michael Garber

#### High-Level Steps
1. Preprocessing
    1. Create dummy/indicator features for categorical variables
    2. Standardize/scale numeric features
    3. Train/Test Split 
2. Modeling
    1. Fit your models with a training dataset
    2. Review model outcomes — Iterate over additional models as needed
    3. Identify the final model that you think is the best model for this project

In [4]:
# Import Libraries
import pandas as pd
import os
# import numpy as np
# from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# from matplotlib import pyplot as plt

#### Preprocessing

In [6]:
# Import Data set
dataDir = os.path.join('../data/interim/urlData_raw.csv')
urlData = pd.read_csv(dataDir)

  urlData = pd.read_csv(dataDir)


In [7]:
# Data Info
urlData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450175 entries, 0 to 450174
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Unnamed: 0        450175 non-null  int64 
 1   key_0             445854 non-null  object
 2   url               450175 non-null  object
 3   type              450175 non-null  object
 4   parsedUrl         450175 non-null  object
 5   urlPart_scheme    450175 non-null  object
 6   subDomain         379885 non-null  object
 7   domain            450167 non-null  object
 8   tld               445854 non-null  object
 9   urlPart_path      444917 non-null  object
 10  urlPart_query     65541 non-null   object
 11  urlPart_fragment  359 non-null     object
 12  tld_join          445854 non-null  object
 13  Domain            445451 non-null  object
 14  Type              445451 non-null  object
 15  TLD Manager       445451 non-null  object
 16  isIPaddress       450175 non-null  boo

In [8]:
urlData.head()

Unnamed: 0.1,Unnamed: 0,key_0,url,type,parsedUrl,urlPart_scheme,subDomain,domain,tld,urlPart_path,urlPart_query,urlPart_fragment,tld_join,Domain,Type,TLD Manager,isIPaddress,isPhish_bool
0,0,com,https://www.google.com,legitimate,"ParseResult(scheme='https', netloc='www.google...",https,www,google,com,,,,com,.com,generic,VeriSign Global Registry Services,False,False
1,1,com,https://www.youtube.com,legitimate,"ParseResult(scheme='https', netloc='www.youtub...",https,www,youtube,com,,,,com,.com,generic,VeriSign Global Registry Services,False,False
2,2,com,https://www.facebook.com,legitimate,"ParseResult(scheme='https', netloc='www.facebo...",https,www,facebook,com,,,,com,.com,generic,VeriSign Global Registry Services,False,False
3,3,com,https://www.baidu.com,legitimate,"ParseResult(scheme='https', netloc='www.baidu....",https,www,baidu,com,,,,com,.com,generic,VeriSign Global Registry Services,False,False
4,4,org,https://www.wikipedia.org,legitimate,"ParseResult(scheme='https', netloc='www.wikipe...",https,www,wikipedia,org,,,,org,.org,generic,Public Interest Registry (PIR),False,False


In [9]:
urlData.columns

Index(['Unnamed: 0', 'key_0', 'url', 'type', 'parsedUrl', 'urlPart_scheme',
       'subDomain', 'domain', 'tld', 'urlPart_path', 'urlPart_query',
       'urlPart_fragment', 'tld_join', 'Domain', 'Type', 'TLD Manager',
       'isIPaddress', 'isPhish_bool'],
      dtype='object')

In [10]:
#urlData[['url', 'urlPart_scheme', 'subDomain', 'tld', 'domain', 'type', 'TLD Manager', 'isIPaddress', 'isPhish_bool']]
pd.DataFrame(urlData['urlPart_scheme'].value_counts())

Unnamed: 0_level_0,count
urlPart_scheme,Unnamed: 1_level_1
https,352185
http,97947
httpss,35
ftp,8


In [11]:
pd.DataFrame(urlData['subDomain'].value_counts())

Unnamed: 0_level_0,count
subDomain,Unnamed: 1_level_1
www,276100
www.en,13626
www.music,1289
www.people,1228
www.genforum,1072
...,...
www.ohv.parks,1
www.ohtheplaceswewillgo-books,1
www.ohr,1
www.ohomen171s-journey-through-life,1


In [12]:
# Let's see # of uniques for each feature - use to determine categorical fields to dummy
urlData[['url', 'type', 'parsedUrl', 'urlPart_scheme',
       'subDomain', 'domain', 'tld', 'urlPart_path', 'urlPart_query',
       'urlPart_fragment', 'tld_join', 'Domain', 'Type', 'TLD Manager',
       'isIPaddress', 'isPhish_bool']].describe()

Unnamed: 0,url,type,parsedUrl,urlPart_scheme,subDomain,domain,tld,urlPart_path,urlPart_query,urlPart_fragment,tld_join,Domain,Type,TLD Manager,isIPaddress,isPhish_bool
count,450175,450175,450175,450175,379885,450167,445854,444917,65541,359,445854,445451,445451,445451,450175,450175
unique,450175,2,450132,4,32040,130746,831,317143,55325,71,415,360,4,259,2,2
top,https://www.google.com,legitimate,"ParseResult(scheme='http', netloc='new.sosnovs...",https,www,wikipedia,com,/,m=login,n=1252899642&fid=1&fav=1,com,.com,generic,VeriSign Global Registry Services,False,False
freq,1,345738,2,352185,276100,12895,316414,55253,579,204,316414,316414,376803,333004,447309,345738


In [13]:
# features to keep
urlData[['url', 'urlPart_scheme', 'subDomain', 'domain', 'tld', 'urlPart_path', 'urlPart_query', 'urlPart_fragment','Type', 'TLD Manager',
       'isIPaddress', 'isPhish_bool']].describe()

Unnamed: 0,url,urlPart_scheme,subDomain,domain,tld,urlPart_path,urlPart_query,urlPart_fragment,Type,TLD Manager,isIPaddress,isPhish_bool
count,450175,450175,379885,450167,445854,444917,65541,359,445451,445451,450175,450175
unique,450175,4,32040,130746,831,317143,55325,71,4,259,2,2
top,https://www.google.com,https,www,wikipedia,com,/,m=login,n=1252899642&fid=1&fav=1,generic,VeriSign Global Registry Services,False,False
freq,1,352185,276100,12895,316414,55253,579,204,376803,333004,447309,345738


In [125]:
# does the existing parsedURL parts capture the entire URL (no section of url lost?)
#urlData[['url', 'parsedUrl', 'isPhish_bool']]
urlData['parsedUrl']

0         ParseResult(scheme='https', netloc='www.google...
1         ParseResult(scheme='https', netloc='www.youtub...
2         ParseResult(scheme='https', netloc='www.facebo...
3         ParseResult(scheme='https', netloc='www.baidu....
4         ParseResult(scheme='https', netloc='www.wikipe...
                                ...                        
450170    ParseResult(scheme='http', netloc='ecct-it.com...
450171    ParseResult(scheme='http', netloc='faboleena.c...
450172    ParseResult(scheme='http', netloc='faboleena.c...
450173    ParseResult(scheme='http', netloc='atualizapj....
450174    ParseResult(scheme='http', netloc='writeassoci...
Name: parsedUrl, Length: 450175, dtype: object

In [38]:
# Fields that should be treated as categories

##### Create dummies

In [40]:
urlData['TLD Manager'].value_counts()

TLD Manager
VeriSign Global Registry Services                                                                               333004
Public Interest Registry (PIR)                                                                                   38393
Canadian Internet Registration Authority (CIRA) Autorité Canadienne pour les enregistrements Internet (ACEI)     10086
EDUCAUSE                                                                                                          6976
Nominet UK                                                                                                        5997
                                                                                                                 ...  
Dot London Domains Limited                                                                                           1
Macao Post and Telecommunications Bureau (CTT)                                                                       1
Premier Registry Limited            

##### Standardize and Scale

##### Train/Test Split

#### Modeling

##### Fit Model

##### Evaluate/compare models

##### Select Best Model

###### To do

- features to engineer: urlLength, riskRank per url part (give each a 1 o 5 rank based on mean)
- feature selection - some can be discarded/dropped
- problem - ANNs/MLPs cannot take raw text features. all inputs must be numeric, but it will take numeric representations/encoding of text data
    -label encode into categories based  - calculate this field after train/test split to avoid data leakage
- additional data cleaning
    - handle missing values
    
- Create dummies
    - identify 'category' features
        - check value counts to see which has reasonable # of categories
    - select minimum set of features with unique information
    - create new DF from that
    - dummy them
    
- make sure to use cross-validation for 


In [24]:
# https://machinelearningmastery.com/understanding-simple-recurrent-neural-networks-in-keras/

# TARGET/MEAN ENCODING
# Concept: Replaces a categorical value with the mean of the target variable for that category.
# How it works for Domain (e.g., for malicious URL detection):
# For each domain, calculate the proportion of malicious URLs associated with it in your training data.
# https://www.google.com/url?sa=E&source=gmail&q=google.com (if all benign) -> 0.0
# phishing.ru (if mostly malicious) -> 0.95
# Pros: Captures some predictive power directly, reduces dimensionality.
# Cons: Can lead to data leakage if not done carefully (e.g., using the entire dataset's target mean, rather than cross-validation folds). Sensitive to noise for categories with few samples.