# **Identifying Phishing Websites**

The goal of this document is to review how best to evaluate websites for being part of a phishing scam.  This can be done using the Phishing Website Data Set from the UCI Machine Learning Repository.  This data set has 31 features for evaluation, which are detailed in _Phishing Website Features_, a paper written by Rami M. Mohammad, Fadi Thabtah, and Lee McCluskey.  These three researchers developed a classification scheme which can be broken down into the following thirty-one categories:

- Having IP Address
- URL Length
- Shortining Service
- Having @ Symbol
- Double Slash Redirecting
- Prefix Suffix
- Having Sub-Domain
- SSL Final State
- Domain Registration Length
- Favicon
- port
- HTTPS Token
- Request URL
- URL of Anchor
- Link in Tags
- SFH
- Submitting to Email
- Abnormal URL
- Redirect
- On MouseOver
- Right Click
- Pop-up Window
- Iframe
- Age of Domain
- DNS Record
- Web Traffic
- Page Rank
- Google Index
- Links Pointing to Page
- Statistical Report
- Result

While these items are being assessed in the data set, they are not labeled as can be seen below.

In [2]:
# Execute plot() inline without calling show()
%matplotlib inline
import warnings
warnings.simplefilter('ignore')

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from mlxtend.plotting import plot_decision_regions
from sklearn.metrics import accuracy_score

df = pd.read_csv(os.path.join('Data', 'phishing_dataset.csv'))

print(df.head())

   -1  1  1.1  1.2  -1.1  -1.2  -1.3  -1.4  -1.5  1.3  ...  1.9  1.10  -1.11  \
0   1  1    1    1     1    -1     0     1    -1    1  ...    1     1     -1   
1   1  0    1    1     1    -1    -1    -1    -1    1  ...    1     1      1   
2   1  0    1    1     1    -1    -1    -1     1    1  ...    1     1     -1   
3   1  0   -1    1     1    -1     1     1    -1    1  ...   -1     1     -1   
4  -1  0   -1    1    -1    -1     1     1    -1    1  ...    1     1      1   

   -1.12  -1.13  -1.14  1.11  1.12  -1.15  -1.16  
0     -1      0     -1     1     1      1     -1  
1     -1      1     -1     1     0     -1     -1  
2     -1      1     -1     1    -1      1     -1  
3     -1      0     -1     1     1      1      1  
4      1      1     -1     1    -1     -1      1  

[5 rows x 31 columns]


As the data shows, there is no labeling in the imported data.  Coding methods are not universal across all features either, with some features being binary (-1 or 1) while others are terciary (-1, 0, or 1).  The following table shows how each feature is evaluated in the dataset:

>Feature | Rating
>--- | ---
> Having IP Address | -1, 1
> URL Length | -1, 0, 1
> Shortining Service | -1, 1
> Having @ Symbol | -1, 1
> Double Slash Redirecting | -1, 1
> Prefix Suffix | -1, 1
> Having Sub-Domain | -1, 0, 1
> SSL Final State | -1, 0, 1
> Domain Registration Length | -1, 1
> Favicon | -1, 1
> port | -1, 1
> HTTPS Token | -1, 1
> Request URL | -1, 1
> URL of Anchor | -1, 0, 1
> Link in Tags | -1, 0, 1
> SFH | -1, 0, 1
> Submitting to Email | -1, 1
> Abnormal URL | -1, 1
> Redirect | 0, 1
> On MouseOver | -1, 1
> Right Click | -1, 1
> Pop-up Window | -1, 1
> Iframe | -1, 1
> Age of Domain | -1, 1
> DNS Record | -1, 1
> Web Traffic | -1, 0, 1
> Page Rank | -1, 1
> Google Index | -1, 1
> Links Pointing to Page | -1, 0, 1
> Statistical Report | -1, 1
> Result | -1, 1

Knowing this we can evaluate the data initially using the following code.  The focus here is to do a perceptron evaluation, though there are other algorithms which can be used in an attempt to be more accurate.

There are 30 features to evaluate (not counting the final column 'Result' because it is a label not a feature).  In order to evaluate this each item needs to have an $x_{1 ... n}$ value and a $w_{1 ... n}$ value.  Meaning that the equation:

>$ y = f(x_{1}, x_{2}) = f(w_{1}x_{1} + w_{2}x_{2} + ... + w_{n}x_{n})$

If $y = [-1, 1]$ then we assume -1 means the URL is not phishing and assume 1 means the URL is associated with phishing.

In order to find the best weights requires we go through the features and identify the best ones to review.

In [60]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

high_acc = []

for i in range(30):
    X = df.iloc[:, [i]].values  #To create weights
    
    X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.3, random_state=0)
    
    #Perceptron function handles the creation of weights.  Never explicitly stated.
    p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
    p.fit(X_train, y_train)
    
    y_pred = p.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    if acc >= 0.60:
        high_acc.append(i)
    
    print('Misclassified samples: %d' % (y_test != y_pred).sum())
    print('Accuracy: %.2f' % acc, 'for feature index: ', i, '\n')
    i =+ 1 
    
print('Feature Indexes with higher than 60% accuracy are: ', high_acc)

Misclassified samples: 1905
Accuracy: 0.43 for index:  0 

Misclassified samples: 1762
Accuracy: 0.47 for index:  1 

Misclassified samples: 1777
Accuracy: 0.46 for index:  2 

Misclassified samples: 1910
Accuracy: 0.42 for index:  3 

Misclassified samples: 1795
Accuracy: 0.46 for index:  4 

Misclassified samples: 1454
Accuracy: 0.56 for index:  5 

Misclassified samples: 1338
Accuracy: 0.60 for index:  6 

Misclassified samples: 388
Accuracy: 0.88 for index:  7 

Misclassified samples: 2087
Accuracy: 0.37 for index:  8 

Misclassified samples: 1823
Accuracy: 0.45 for index:  9 

Misclassified samples: 1880
Accuracy: 0.43 for index:  10 

Misclassified samples: 1773
Accuracy: 0.47 for index:  11 

Misclassified samples: 1905
Accuracy: 0.43 for index:  12 

Misclassified samples: 497
Accuracy: 0.85 for index:  13 

Misclassified samples: 1905
Accuracy: 0.43 for index:  14 

Misclassified samples: 1412
Accuracy: 0.57 for index:  15 

Misclassified samples: 1859
Accuracy: 0.44 for index

Based on the results of the above iterations, the feature at indexes with the greatest levels of accuracy, accuracy above 0.60, are indexes: 7, 13, and 25.  Let's seporate these out from the others and then review how to get better accuracy with more than one feature being evaluated at a time.

In [40]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [7, 13]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.3, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 7 & 13')

Misclassified samples: 231
Accuracy: 0.90 for indexes: 7 & 13


In [39]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [7, 25]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.3, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 7 & 25')

Misclassified samples: 353
Accuracy: 0.84 for indexes: 7 & 25


In [38]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [13, 25]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.3, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 13 & 25')

Misclassified samples: 417
Accuracy: 0.81 for indexes: 13 & 25


In [41]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [7, 13, 25]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.3, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 7, 13 & 25')

Misclassified samples: 300
Accuracy: 0.86 for indexes: 7, 13 & 25


Based on the above, we can see most accuracy is in the high 80.0%

At this point we've been training on 70% of the data and testing against 30% of the data.  What happens if we do this with 80% of the data used for training and 20% used for testing?

In [61]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

high_acc = []

for i in range(30):
    X = df.iloc[:, [i]].values  #To create weights
    
    X_train, X_test, y_train, y_test = train_test_split(
         X, y, test_size=0.2, random_state=0)
    
    #Perceptron function handles the creation of weights.  Never explicitly stated.
    p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
    p.fit(X_train, y_train)
    
    y_pred = p.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    if acc >= 0.60:
        high_acc.append(i)
    
    print('Misclassified samples: %d' % (y_test != y_pred).sum())
    print('Accuracy: %.2f' % acc, 'for feature index: ', i, '\n')
    i =+ 1 
    
print('Feature Indexes with higher than 60% accuracy are: ', high_acc)

Misclassified samples: 929
Accuracy: 0.58 for feature index:  0 

Misclassified samples: 1269
Accuracy: 0.43 for feature index:  1 

Misclassified samples: 1182
Accuracy: 0.47 for feature index:  2 

Misclassified samples: 1269
Accuracy: 0.43 for feature index:  3 

Misclassified samples: 1199
Accuracy: 0.46 for feature index:  4 

Misclassified samples: 964
Accuracy: 0.56 for feature index:  5 

Misclassified samples: 942
Accuracy: 0.57 for feature index:  6 

Misclassified samples: 253
Accuracy: 0.89 for feature index:  7 

Misclassified samples: 824
Accuracy: 0.63 for feature index:  8 

Misclassified samples: 994
Accuracy: 0.55 for feature index:  9 

Misclassified samples: 1269
Accuracy: 0.43 for feature index:  10 

Misclassified samples: 1026
Accuracy: 0.54 for feature index:  11 

Misclassified samples: 761
Accuracy: 0.66 for feature index:  12 

Misclassified samples: 330
Accuracy: 0.85 for feature index:  13 

Misclassified samples: 1025
Accuracy: 0.54 for feature index:  14 

Based on this change we have more indexes to evaluate!  So we'll do what we did above again in order to determine if there is a better evaluation that can be done on this data, only now with all the different variations found:

In [44]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [7, 8]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 7 & 8')

Misclassified samples: 466
Accuracy: 0.79 for indexes: 7 & 8


In [45]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [7, 12]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 7 & 12')

Misclassified samples: 372
Accuracy: 0.83 for indexes: 7 & 12


In [46]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [7, 13]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 7 & 13')

Misclassified samples: 231
Accuracy: 0.90 for indexes: 7 & 13


In [47]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [7, 25]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 7 & 25')

Misclassified samples: 353
Accuracy: 0.84 for indexes: 7 & 25


In [49]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [8, 12]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 8 & 12')

Misclassified samples: 824
Accuracy: 0.63 for indexes: 8 & 12


In [50]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [8, 13]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 8 & 13')

Misclassified samples: 391
Accuracy: 0.82 for indexes: 8 & 13


In [51]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [8, 25]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 8 & 25')

Misclassified samples: 682
Accuracy: 0.69 for indexes: 8 & 25


In [52]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [12, 13]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 12 & 13')

Misclassified samples: 776
Accuracy: 0.65 for indexes: 12 & 13


In [53]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [12, 25]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 12 & 25')

Misclassified samples: 756
Accuracy: 0.66 for indexes: 12 & 25


In [54]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [13, 25]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 13 & 25')

Misclassified samples: 417
Accuracy: 0.81 for indexes: 13 & 25


In [55]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [7, 8, 12, 13, 25]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for feature indexes: 7, 8, 12, 13, & 25')

Misclassified samples: 315
Accuracy: 0.86 for indexes: 7, 8, 12, 13, & 25


Based on this evaluation, we can see that using indexes 7 & 13 with 80% training data and 20% testing data, we are able to get to **90% accurate evaluation of the samples being classified properly.**

By further analizing the entire set of indexes with both the 70/30 and 80/20 splits, we are able to see the accuracy of using the entire data set.  The **80/20 split results in higher accuracy at 91%**, which is barely higher than using just two indexes, 7  & 13, with the same training / testing rates.  This may indicate that fewer features are necessary to accurately assess Phishing URLs.

In [57]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.3, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for ALL indexes with 70/30 training/testing split')

Misclassified samples: 369
Accuracy: 0.89 for ALL indexes with 70/30 training/testing split


In [59]:
y = df.iloc[:, 30].values#Indicates which column we want y to focus on
# 30 == 31 because the array begins at 0

X = df.iloc[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]].values

X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=0)

#Perceptron function handles the creation of weights.  Never explicitly stated.
p = Perceptron(max_iter=40, eta0=0.1, random_state=0)
p.fit(X_train, y_train)

y_pred = p.predict(X_test)

print('Misclassified samples: %d' % (y_test != y_pred).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred), 'for ALL indexes with 80/20 training/testing split')

Misclassified samples: 203
Accuracy: 0.91 for ALL indexes with 80/20 training/testing split
