# Detect Phishing URLs
### Capstone 3 - Data Wrangling and EDA
Michael Garber

#### High-Level Steps
1. Data Wrangling
    1. Data Collection
    2. Data Organization
    3. Data Definition
    4. Data Cleaning
1. Exploratory Data Analysis


#### Data Wrangling

##### Data Collection
Info/Data
- https://www.kaggle.com/datasets/hassaanmustafavi/phishing-urls-dataset

In [6]:
# Please Install KaggleHub if you don't already have it
%pip install kagglehub

Note: you may need to restart the kernel to use updated packages.


In [7]:
# Import Packages
import pandas as pd
import kagglehub
import os
import shutil
from urllib.parse import urlparse

In [8]:
# Download Data
kPath = kagglehub.dataset_download("hassaanmustafavi/phishing-urls-dataset")

In [9]:
# build source file path
fileName = 'url_dataset.csv'
srcFilePath = os.path.join(kPath, fileName)
dataDir = os.path.join(os.pardir, 'data', 'raw', fileName)

# Move File to project folder
shutil.copy(srcFilePath, dataDir)

'..\\data\\raw\\url_dataset.csv'

In [10]:
# Create pandas dataframe - urlData
urlData = pd.read_csv(dataDir)

In [11]:
# Create pandas dataframe - TLDs (Top-level Domains)
tldDir = '../data/raw/TLDs.csv'
tldNames = pd.read_csv(tldDir)

# TLDs source...
# https://www.iana.org/domains/root/db

In [12]:
# check dataframe urlData - # or rows
print(urlData.shape)

(450176, 2)


In [13]:
# check dataframe tldNames - # or rows
print(tldNames.shape)

(1591, 3)


##### Data Organization
Project file structure based on the cookiecutter data science template. \
[https://drivendata.github.io/cookiecutter-data-science/](https://drivendata.github.io/cookiecutter-data-science/)

Folder structure tree (GitHub) \
[https://github.com/mdgarber/DetectPhishURL/blob/7dd7d38c001590b4629f8810906a3724ab107fd5/DetectPhishURL/README.md](https://github.com/mdgarber/DetectPhishURL/blob/7dd7d38c001590b4629f8810906a3724ab107fd5/DetectPhishURL/README.md)

##### Data Definition

- Column names
- Data types
- Description of the columns
- Counts and unique values
- Ranges of values
- Calc Summary statistics

In [16]:
# Data types, unique values, range of index
urlData.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450176 entries, 0 to 450175
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     450176 non-null  object
 1   type    450176 non-null  object
dtypes: object(2)
memory usage: 6.9+ MB


Description of the columns
- url - The web address to be analyzed.
- type - The classification of the URL (phishing or legitimate).

In [18]:
# increase pandas dataframe column width - full size
pd.set_option('display.width', None)

In [19]:
# Value counts for URLs
pd.DataFrame(urlData['url' ].value_counts())

Unnamed: 0_level_0,count
url,Unnamed: 1_level_1
https://www.google.com,1
https://www.tabheaven.com/gales-eric-tabs.html,1
https://www.tabs-database.com/justin-mcroberts-chords.html,1
https://www.tabpower.com/a806.html,1
https://www.tabor.edu/alumni/directory?decade=1960,1
...,...
https://www.billboard.com/artist/anita-pointer/discography/songs/23169,1
https://www.billboard.com/artist/andy-kim/chart-history/21881,1
https://www.billboard.com/artist/alvino-rey/discography/compilations/9479?sort=alphabet,1
https://www.billboard.com/artist/3x-krazy/discography/albums/142659,1


- These URLs above are all unique ^

In [21]:
pd.DataFrame(urlData['type' ].value_counts()).head()

Unnamed: 0_level_0,count
type,Unnamed: 1_level_1
legitimate,345738
phishing,104438


In [22]:
# describe dataframe
urlData.describe()

Unnamed: 0,url,type
count,450176,450176
unique,450176,2
top,https://www.google.com,legitimate
freq,1,345738


Data Definition Summary

- Column names
    - **url**
        - Data types
            - series of string
        - Description of the column 
            - The web address to be analyzed.
        - Counts and unique values 
            - 450176 (all unique)
        - Ranges of values 
            - all unique URLs
    - **type**
        - Data types 
            - series of string
        - Description of the column
            - The classification of the URL (phishing or legitimate).
        - Counts and unique values
            - legitimate    345738
            - phishing      104438
        - Ranges of values
            - phishing
            - legitimate
            
- Calc Summary statistics
    - N/A ... text data

##### Data Cleaning

- this dataset is mostly clean  =)
- identify invalid URLs (those that urllib couldn't parse) and (remove?)

#### Exploratory Data Analysis

- create new features to parse URLs
- do value counts on the new url parts
- check correlations of the url parts to the target variable

In [26]:
# view the data head - URLs
urlData.head(10)

Unnamed: 0,url,type
0,https://www.google.com,legitimate
1,https://www.youtube.com,legitimate
2,https://www.facebook.com,legitimate
3,https://www.baidu.com,legitimate
4,https://www.wikipedia.org,legitimate
5,https://www.reddit.com,legitimate
6,https://www.yahoo.com,legitimate
7,https://www.google.co.in,legitimate
8,https://www.qq.com,legitimate
9,https://www.amazon.com,legitimate


In [27]:
# view the data head - only phish
urlData[urlData['type'] == 'phishing'].head(10)

Unnamed: 0,url,type
345738,http://atualizacaodedados.online,phishing
345739,http://webmasteradmin.ukit.me/,phishing
345740,http://stcdxmt.bigperl.in/klxtv/apps/uk/,phishing
345741,https://tubuh-syarikat.com/plugins/fields/files/,phishing
345742,http://rolyborgesmd.com/exceword/excel.php?.ra...,phishing
345743,http://ongelezen-voda.000webhostapp.com/inlogg...,phishing
345744,http://www.valenzaceramic.com/home/webapps/e52...,phishing
345745,http://membership-issue.forteimpex.com/dk2mmm=...,phishing
345746,http://membership-issue.forteimpex.com/dk2mmm=...,phishing
345747,http://chronopost-service-enligne.net/56123s/r...,phishing


In [28]:
# view the data head (to be used as metadata for URLs)
tldNames.tail(300)

Unnamed: 0,Domain,Type,TLD Manager
1291,.travelersinsurance,generic,"Travelers TLD, LLC"
1292,.trust,generic,Internet Naming Co.
1293,.trv,generic,"Travelers TLD, LLC"
1294,.tt,country-code,University of the West Indies Faculty of Engin...
1295,.tube,generic,Latin American Telecom LLC
...,...,...,...
1586,.zippo,generic,Not assigned
1587,.zm,country-code,Zambia Information and Communications Technolo...
1588,.zone,generic,"Binky Moon, LLC"
1589,.zuerich,generic,Kanton Zürich (Canton of Zurich)


In [29]:
# create function urlparse_try 
#    otherwise, parsing will fail on an invalid URL
def urlparse_try(url):
    try:
        return urlparse(url)
    except:
        return 'invalid URL'

In [30]:
# engineer new feature for EDA - parsedUrl
urlData['parsedUrl'] = urlData['url'].apply(urlparse_try)

In [31]:
# example of a parsed url
print('Example of a parsed url \n')
print(urlparse(urlData['url'][1]))

Example of a parsed url 

ParseResult(scheme='https', netloc='www.youtube.com', path='', params='', query='', fragment='')


Parsed URL parts
> scheme://netloc/path;parameters?query#fragment

In [33]:
'''
'scheme'
hostname
subdomain
domain
top level domain
port
path
query
fragement
'''

"\n'scheme'\nhostname\nsubdomain\ndomain\ntop level domain\nport\npath\nquery\nfragement\n"

In [175]:
urlData['parsedUrl'].head(300000).apply(lambda urlObj : urlObj.scheme)
#urlData['parsedUrl'][8].scheme

0         https
1         https
2         https
3         https
4         https
          ...  
299995    https
299996    https
299997    https
299998    https
299999    https
Name: parsedUrl, Length: 300000, dtype: object

In [125]:
# engineer new URL part features for EDA
#urlData['url_scheme']
urlData['parsedUrl'][0].scheme

'https'

In [185]:
# find invalid URLs (urls that failed urllib parsing...those will cause errors)
invalidUrlRows = urlData[urlData['parsedUrl'] == 'invalid URL']
pd.DataFrame(invalidUrlRows).head()

Unnamed: 0,url,type,parsedUrl
397556,http://ladiesfirst-privileges[.]com/656465/d56...,phishing,invalid URL


In [193]:
# remove invalid URLs (they will causes errors)
urlData = urlData.drop(invalidUrlRows.index)

In [195]:
urlData.shape

(450175, 3)

> **URL Part Reference**

![Image](../references/URL_part_diagram.png)

- image from https://www.geeksforgeeks.org/components-of-a-url/


In [38]:
# available fields in parseurl
'''
'count',
 'encode',
 'fragment',
 'geturl',
 'hostname',
 'index',
 'netloc',
 'params',
 'password',
 'path',
 'port',
 'query',
 'scheme',
 'username'
 
 
to use...
'scheme'
hostname
subdomain
domain
top level domain
port
path
query
fragement


'count',
 'encode',
 'fragment',
 'geturl',
 'hostname',
 'index',
 'netloc',
 'params',
 'password',
 'path',
 'port',
 'query',
 ,
 'username'
 
'''

"\n'count',\n 'encode',\n 'fragment',\n 'geturl',\n 'hostname',\n 'index',\n 'netloc',\n 'params',\n 'password',\n 'path',\n 'port',\n 'query',\n 'scheme',\n 'username'\n \n \nto use...\n'scheme'\nhostname\nsubdomain\ndomain\ntop level domain\nport\npath\nquery\nfragement\n\n\n'count',\n 'encode',\n 'fragment',\n 'geturl',\n 'hostname',\n 'index',\n 'netloc',\n 'params',\n 'password',\n 'path',\n 'port',\n 'query',\n ,\n 'username'\n \n"

In [39]:
print(urlData['parsedUrl'][1])

ParseResult(scheme='https', netloc='www.youtube.com', path='', params='', query='', fragment='')


In [40]:
# split hostname into top
urlData['parsedUrl'][0].hostname.split

<function str.split(sep=None, maxsplit=-1)>

In [41]:
#print(urlData['parsedUrl'][1].port)
dir(urlData['parsedUrl'][1])

['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__match_args__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '_asdict',
 '_encoded_counterpart',
 '_field_defaults',
 '_fields',
 '_hostinfo',
 '_make',
 '_replace',
 '_userinfo',
 'count',
 'encode',
 'fragment',
 'geturl',
 'hostname',
 'index',
 'netloc',
 'params',
 'password',
 'path',
 'port',
 'query',
 'scheme',
 'username']

In [42]:
urlData['parsedUrl'][0].hostname.split(sep='.')

#apply(hostname.split(sep='.'))

['www', 'google', 'com']

In [43]:
urlData['parsedUrl'].head(8)

0       (https, www.google.com, , , , )
1      (https, www.youtube.com, , , , )
2     (https, www.facebook.com, , , , )
3        (https, www.baidu.com, , , , )
4    (https, www.wikipedia.org, , , , )
5       (https, www.reddit.com, , , , )
6        (https, www.yahoo.com, , , , )
7     (https, www.google.co.in, , , , )
Name: parsedUrl, dtype: object

In [44]:
# relationship between http vs https (secure) and type (phish or not)

In [45]:
# relationship between top level domain and type (phish or not)
# 

In [46]:
# Export df to file
urlData.to_csv('../data/interim/urlData_raw.csv')

- the dataset is very simple and not well suited for statistical analysis in it's current form
- will parse the URLs and perform feature engineering in the next notebook for pre-processing

- play with the data a bit more for EDA
- look at relationship between target and
    - https and http urls
    - .com and .com.in (country specific domains)
    - any other initial analysis that can be done without extensive feature engineering (that will be done in next notebook)
    - resources
        - https://www.geeksforgeeks.org/components-of-a-url/
        - https://www.techtarget.com/whatis/definition/named-entity-recognition-NER   (from Kevin)

In [48]:
'''
TO DO
-split out url parts from hostname - domain, top level domain, sub-level domain
-join the TLD columns
    ex....
    merged_left = pd.merge(df_left, df_right, on='key', how='left')
    print("\nLeft Join (on 'key'):\n", merged_left)
-Use hostname to check if url part is IP address or hostname
-do additional EDA
'''

'\nTO DO\n-split out url parts from hostname - domain, top level domain, sub-level domain\n-join the TLD columns\n    ex....\n    merged_left = pd.merge(df_left, df_right, on=\'key\', how=\'left\')\n    print("\nLeft Join (on \'key\'):\n", merged_left)\n-Use hostname to check if url part is IP address or hostname\n-do additional EDA\n'