<a href="https://colab.research.google.com/github/khaefner/M3AAWG_AI_Training_Phishing/blob/main/Phish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contents

1. [Loading Data](#loading_data)
2. Exploring the Data
3. Pre-process Data
4.  

[K-Nearest Neighbors] (#knn)

[Deep Neural Networks](#dnn)


First we need a dataset to work on.  The one we'll be using is from 2021 on Kagle at the URL below.


https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset

Next we'll start with a library that can load data from a comma dilimited file and a library used for matrix calculations.
THe file is host on the github site.


In [None]:
#This hides some of the warnings we get in MLP
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import pandas as pd  #Pandas is a data manipulation library
import numpy as np   # numpy is computing library that uses C libraries in the backend
from sklearn.model_selection import StratifiedKFold   #This gives us nice 'slices' of examples for training and testing

from sklearn.neighbors import KNeighborsClassifier  # K Nearest Neighbors
from sklearn.tree import DecisionTreeClassifier  # Decision Trees
from sklearn.ensemble import RandomForestClassifier  # Random Forrest Classifier
from sklearn.neural_network import MLPClassifier   #Neural Network Classifier

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score  #Libraries for calculating scores.


<a id='loading_data'></a>
# Loading Data
The Data we are going to use is from a dataset hosted on Kaggle.

Here:[Phishing Dataset](https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset)

Original Source of data:

[Web Page Phishing Detection](#https://data.mendeley.com/datasets/c2gw7fy2j4/3)

In [None]:
phishing_data = pd.read_csv("https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/dataset_phishing.csv")

# Exploring Data

In [None]:
print(phishing_data)

                                                     url  length_url  \
0                  http://www.crestonwood.com/router.php          37   
1      http://shadetreetechnology.com/V4/validation/a...          77   
2      https://support-appleld.com.secureupdate.duila...         126   
3                                     http://rgipt.ac.in          18   
4      http://www.iracing.com/tracks/gateway-motorspo...          55   
...                                                  ...         ...   
11425      http://www.fontspace.com/category/blackletter          45   
11426  http://www.budgetbots.com/server.php/Server%20...          84   
11427  https://www.facebook.com/Interactive-Televisio...         105   
11428             http://www.mypublicdomainpictures.com/          38   
11429  http://174.139.46.123/ap/signin?openid.pape.ma...         477   

       length_hostname  ip  nb_dots  nb_hyphens  nb_at  nb_qm  nb_and  nb_or  \
0                   19   0        3           0      0 

We see that this data has 89 columns.  These are called *features*. In this data things like length of the url,  lenght of the hostname, etc.  Rows are datapoints corresponding to one of the domains.  These are also called *examples*.  

Note:  There is one column that has special meaning.  This is the last column in the table above called, *status*.  This is the label for the website.  We are going to do **supervised** learning which means the algorithm is going to learn from the data and the label.

# Data Pre-Process

Next we need to clean up the data a bit and get it ready to analyze.  


1.   Remove the URL column.  The actual URL is not useful to the model.
2.   Alter the Label (status=legitimate or status=phishing to 0 or 1)



In [None]:
#Get rid of the first column:
phishing_data = phishing_data.iloc[:, 1:]
#Print the result
print(phishing_data)

       length_url  length_hostname  ip  nb_dots  nb_hyphens  nb_at  nb_qm  \
0              37               19   0        3           0      0      0   
1              77               23   1        1           0      0      0   
2             126               50   1        4           1      0      1   
3              18               11   0        2           0      0      0   
4              55               15   0        2           2      0      0   
...           ...              ...  ..      ...         ...    ...    ...   
11425          45               17   0        2           0      0      0   
11426          84               18   0        5           0      1      1   
11427         105               16   1        2           6      0      1   
11428          38               30   0        2           0      0      0   
11429         477               14   1       24           0      1      1   

       nb_and  nb_or  nb_eq  ...  domain_in_title  domain_with_copyright  \

In [None]:
#Change the label classes to a one or a zero
phishing_data['status'] = phishing_data['status'].replace({'legitimate': 0, 'phishing': 1})
#Print the result
phishing_data

Unnamed: 0,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,nb_eq,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,37,19,0,3,0,0,0,0,0,0,...,0,1,0,45,-1,0,1,1,4,0
1,77,23,1,1,0,0,0,0,0,0,...,1,0,0,77,5767,0,0,1,2,1
2,126,50,1,4,1,0,1,2,0,3,...,1,0,0,14,4004,5828815,0,1,0,1
3,18,11,0,2,0,0,0,0,0,0,...,1,0,0,62,-1,107721,0,0,3,0
4,55,15,0,2,2,0,0,0,0,0,...,0,1,0,224,8175,8725,0,0,6,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11425,45,17,0,2,0,0,0,0,0,0,...,0,0,0,448,5396,3980,0,0,6,0
11426,84,18,0,5,0,1,1,0,0,1,...,1,0,0,211,6728,0,0,1,0,1
11427,105,16,1,2,6,0,1,0,0,1,...,0,0,0,2809,8515,8,0,1,10,0
11428,38,30,0,2,0,0,0,0,0,0,...,1,0,0,85,2836,2455493,0,0,4,0


The next thing we need to do is seperate the dataset into two parts.  The labels and the examples.

Typically Labels are numbers.  Here we have two classes of data:

1 = Phishing site

0 = Not Phishing Site

In [None]:
#First the Labels
y = phishing_data["status"].values
print(y)

[0 1 1 ... 0 0 1]


In [None]:
#Second the example data
X = phishing_data.drop("status", axis=1).values
print(X)

[[ 37.  19.   0. ...   1.   1.   4.]
 [ 77.  23.   1. ...   0.   1.   2.]
 [126.  50.   1. ...   0.   1.   0.]
 ...
 [105.  16.   1. ...   0.   1.  10.]
 [ 38.  30.   0. ...   0.   0.   4.]
 [477.  14.   1. ...   1.   1.   0.]]


In [None]:
#Let's see how many of the data are phishing and not phishing
print(phishing_data['status'].value_counts())

0    5715
1    5715
Name: status, dtype: int64


Great we have a balance dataset.  Equal represebtation of each label phish and not phish.    Now lets look at how the features relate to each other.  There are two things we can look at **Covariance** and **Correlation**.   

---

Covariance:  measures how two variables (features) vary with respect to each other.  For example an increase in a person's height corresponds to an increas in a persons weight.  This would be a positive covariance.

---
Correlation: Correlation is a normalized covariance value.  What this means is that it is not affected by changes in scales.  Correlation makes the comparison measure fall between -1 and 1.    In this case a value of +1 indicates that the features have a direct and strong relationship.  Conversely a value of -1 means that the values have strong independence from one another.


In [None]:
correlation_matrix = phishing_data.corr(numeric_only=True)
sorted_corr = correlation_matrix.sort_values(by='status',ascending=False)

print(sorted_corr['status'].head(50))

status                     1.000000
google_index               0.731171
ratio_digits_url           0.356395
domain_in_title            0.342807
phish_hints                0.335393
ip                         0.321698
nb_qm                      0.294319
length_url                 0.248580
nb_slash                   0.242270
length_hostname            0.238322
nb_eq                      0.233386
ratio_digits_host          0.224335
shortest_word_host         0.223084
prefix_suffix              0.214681
longest_word_path          0.212709
tld_in_subdomain           0.208884
empty_title                0.207043
nb_dots                    0.207029
longest_words_raw          0.200147
avg_word_path              0.197256
avg_word_host              0.193502
length_words_raw           0.192010
nb_and                     0.170546
avg_words_raw              0.167564
nb_com                     0.156284
statistical_report         0.143944
nb_at                      0.142915
abnormal_subdomain         0

As we can see above, the status (label) has a 100% correlation with the outcome.  This is what we would expect.  The other features are ranked by their correlation to the decision.  

# K Nearest Neighbors

K-Nearest Neighbors (KNN) is a simple machine learning algorithm that helps us make predictions based on similarity. Imagine you have a bunch of points on a graph, each with a label (like red or blue). KNN works by finding the K nearest points to a new, unlabeled point you want to classify. It then looks at the labels of those nearest points and decides the label for the new point based on majority rule. For example, if most of the nearest points are red, the new point would be classified as red. K is a number you choose, and it determines how many neighbors to consider. KNN is like asking your closest friends for advice – if most of them agree, you'll probably follow their suggestion.

![image.png](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/330px-KnnClassification.svg.png)

In [None]:

def KNN(X,y):
  skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
  accuracy = []
  precision = []
  recall = []
  f1 = []


  # we are going to run the model 10 times 'n_splits=10' each time we shuffle the data randomly.
  #This helps prevent our model from overfitting.
  for train, test in skf.split(X,y):
     X_train, y_train = X[train], y[train] #training
     X_test, y_test = X[test], y[test] #testing

     knn = KNeighborsClassifier(n_neighbors=3)
     knn.fit(X_train, y_train)

     y_pred = knn.predict(X_test)

     accuracy.append(accuracy_score(y_test, y_pred))
     recall.append(recall_score(y_test, y_pred, average='macro'))
     precision.append(precision_score(y_test, y_pred, average='macro'))
     f1.append(f1_score(y_test, y_pred, average='macro'))


  average_accuracy = np.mean(accuracy)
  average_recall = np.mean(recall)
  average_precision = np.mean(precision)
  average_f1 = np.mean(f1)

  print(f"Acurracy: {average_accuracy}")
  print(f"Recall: {average_recall}")
  print(f"Precision:{average_precision}")
  print(f"F1 Score:{average_f1}")



In [None]:
KNN(X,y)

Acurracy: 0.8432195975503063
Recall: 0.8432190182846926
Precision:0.8440841687189466
F1 Score:0.8431202845241101


# Decision Tree

A [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree) is a supervised machine learning algorithm used for both classification and regression tasks. It works by recursively splitting the dataset into subsets based on the features that provide the best separation between classes (for classification) or the best predictive power (for regression). These splits are determined by evaluating criteria like Gini impurity or information gain for classification and mean squared error for regression. The process continues until a stopping criterion is met, such as reaching a maximum depth or having too few samples in a node. Once the tree is built, it can be used to make predictions by traversing the tree from the root node to a leaf node, which corresponds to the predicted class (in classification) or the predicted value (in regression) for the input data. Decision Trees are interpretable, which means you can easily understand the reasoning behind their predictions.

<img src="https://upload.wikimedia.org/wikipedia/commons/a/ad/Decision-Tree-Elements.png" />

In [None]:
def DT(X,y):
  skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
  accuracy = []
  precision = []
  recall = []
  f1 = []


  for train, test in skf.split(X,y):
     X_train, y_train = X[train], y[train] #training
     X_test, y_test = X[test], y[test] #testing

     dt = DecisionTreeClassifier(criterion='gini') #Gini is a measure of statistical dispersion that quantifies the inequality or impurity within a set of values,
     dt.fit(X_train, y_train)

     y_pred = dt.predict(X_test)

     accuracy.append(accuracy_score(y_test, y_pred))
     recall.append(recall_score(y_test, y_pred, average='macro'))
     precision.append(precision_score(y_test, y_pred, average='macro'))
     f1.append(f1_score(y_test, y_pred, average='macro'))


  average_accuracy = np.mean(accuracy)
  average_recall = np.mean(recall)
  average_precision = np.mean(precision)
  average_f1 = np.mean(f1)

  print(f"Acurracy: {average_accuracy}")
  print(f"Recall: {average_recall}")
  print(f"Precision:{average_precision}")
  print(f"F1 Score:{average_f1}")

In [None]:
DT(X,y)

Acurracy: 0.9340332458442695
Recall: 0.9340324299168431
Precision:0.934087831397672
F1 Score:0.934030822378465


# Random Forrest
Random Forest is an ensemble learning technique in machine learning that leverages a collection of decision trees to improve predictive accuracy and reduce overfitting. It works by creating multiple decision trees during training, where each tree is constructed using a random subset of the training data and a random subset of the features. When making predictions, each tree provides its individual prediction, and the final prediction is determined by taking a majority vote (classification) or averaging (regression) across all the individual tree predictions. This ensemble approach helps enhance the robustness and generalization of the model, as it combines the strengths of multiple decision trees while mitigating their individual weaknesses and biases.

  <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/Random_forest_diagram_complete.png/330px-Random_forest_diagram_complete.png" alt="drawing" width="50%"/>




In [None]:
#Random Forrest
def RF(X,y):
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    accuracy = []
    precision = []
    recall = []
    f1 = []


    # we are going to run the model 10 times 'n_splits=10' each time we shuffle the data randomly.
    #This helps prevent our model from overfitting.
    for train, test in skf.split(X,y):
      X_train, y_train = X[train], y[train] #training
      X_test, y_test = X[test], y[test] #testing


      # Create a Random Forest classifier
      rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

      # Fit the classifier to the training data
      rf_classifier.fit(X_train, y_train)

      # Make predictions on the test data
      y_pred = rf_classifier.predict(X_test)

      accuracy.append(accuracy_score(y_test, y_pred))
      recall.append(recall_score(y_test, y_pred, average='macro'))
      precision.append(precision_score(y_test, y_pred, average='macro'))
      f1.append(f1_score(y_test, y_pred, average='macro'))

    average_accuracy = np.mean(accuracy)
    average_recall = np.mean(recall)
    average_precision = np.mean(precision)
    average_f1 = np.mean(f1)

    print(f"Acurracy: {average_accuracy}")
    print(f"Recall: {average_recall}")
    print(f"Precision:{average_precision}")
    print(f"F1 Score:{average_f1}")


In [None]:
RF(X,y)

Acurracy: 0.9656167979002624
Recall: 0.9656165419519185
Precision:0.9656625180250511
F1 Score:0.9656160207555683


<a name="dnn"></a>
# Deep Neural Networks (DNNs)
DNNs take the primitive perceptron and build complex networks of interconnected neurons sometimes with many hidden layers.

<table width="100%">
<tr>
<td>
  <img src="https://www.simplilearn.com/ice9/free_resources_article_thumb/Perceptron_work.png" alt="drawing"/>
</td>
<td>
  <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/MultiLayerPerceptron.svg/2560px-MultiLayerPerceptron.svg.png" alt="drawing" />
</td>
</tr>
</table>




In [None]:
#Neural Nets  We'll use the Multi-Layer Perceptron MLP
#Important hyper-paramters are Learning_rate  This affects how fast the algorihms converges (minimizes errors).
#A learning rate that is to high will lead to sub-optimal solutions, too low and it will take forever.
def MLP(X,y):
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    accuracy = []
    precision = []
    recall = []
    f1 = []


    # we are going to run the model 10 times 'n_splits=10' each time we shuffle the data randomly.
    #This helps prevent our model from overfitting.
    for train, test in skf.split(X,y):
      X_train, y_train = X[train], y[train] #training
      X_test, y_test = X[test], y[test] #testing

      mlp = MLPClassifier(
            solver='adam',            #‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba
            hidden_layer_sizes=(20,),
            activation='tanh',
            max_iter=20,
            validation_fraction=0.2,
            learning_rate_init=0.01,   #The amount that the weights are updated a small positive value, often in the range between 0.0 and 1.0.
        )

      # Fit (train) the classifier
      mlp.fit(X_train,y_train)

      #Predict the results using the set-aside test data.
      y_pred = mlp.predict(X_test)

      accuracy.append(accuracy_score(y_test, y_pred))
      recall.append(recall_score(y_test, y_pred, average='macro'))
      precision.append(precision_score(y_test, y_pred, average='macro'))
      f1.append(f1_score(y_test, y_pred, average='macro'))

    average_accuracy = np.mean(accuracy)
    average_recall = np.mean(recall)
    average_precision = np.mean(precision)
    average_f1 = np.mean(f1)

    print(f"Acurracy: {average_accuracy}")
    print(f"Recall: {average_recall}")
    print(f"Precision:{average_precision}")
    print(f"F1 Score:{average_f1}")

In [None]:
MLP(X,y)

Acurracy: 0.7050743657042869
Recall: 0.7050457423487196
Precision:0.7168306231658895
F1 Score:0.7010568065235776


In [5]:
!pip install Levenshtein
from joblib.numpy_pickle import load
import requests
import hashlib
import os

timeout_seconds = 10

#create the scratch directory for our imported python files
if not os.path.exists("data"):
    os.makedirs("data")

#This is to load files from the data directory
init_file_path = os.path.join("data", "__init__.py")

# Create an empty __init__.py file
with open(init_file_path, "w") as init_file:
    pass

#Data extraction scripts are hosted here: https://data.mendeley.com/datasets/c2gw7fy2j4/3
url_features_url ="https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/url_features.py"
url_features_sha="66e2cbcd2760dd8b08600fdc80edd1f788a7c2a4ca1b03ae4de02d4fc635ad4b"

all_brands_url = "https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/allbrands.txt"
all_brands_sha = "58fc066042181abbb1b42dd9ebf046dd0826347f93a6c8a6c129a4b8fb252efe"

content_features_url = "https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/content_features.py"
content_features_sha = ""

external_features_url = "https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/external_features.py"
external_features_sha = ""

feature_extractor_url = "https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/feature_extractor.py"
feature_extractor_sha = ""



def calculate_hash(file_path,expected_hash):
  # Create a SHA-256 hash object
  sha256 = hashlib.sha256()

  # Read the file in binary mode and update the hash object
  with open(file_path, "rb") as file:
      while True:
          data = file.read(65536)  # You can adjust the buffer size as needed
          if not data:
              break
          sha256.update(data)
  calculated_hash = sha256.hexdigest()
  if calculated_hash == expected_hash:
    return True
  else:
    return False


def get_lib(url, destination_path, expected_hash):
  try:
    response = requests.get(url, timeout=timeout_seconds)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Open the local file in binary write mode and write the content of the response to it
        with open("data/"+destination_path, "wb") as file:
          file.write(response.content)
          if calculate_hash("data/"+destination_path,expected_hash):
              print(f"File integrity passed: {'data/'+destination_path}")
          else:
              print(f"File integrity failed: {'data/'+destination_path}")
              return
        print(f"File downloaded to {'data/'+destination_path}")

    else:
        print(f"Failed to download file. Status code: {response.status_code}")
  except requests.exceptions.Timeout:
    print(f"Request timed out after {timeout_seconds} seconds.")
  except requests.exceptions.RequestException as e:
    print(f"An error occurred during the request: {str(e)}")

get_lib(url_features_url,"url_features.py",url_features_sha)
get_lib(all_brands_url,"allbrands.txt",all_brands_sha)







Collecting Levenshtein
  Downloading Levenshtein-0.21.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (172 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.5/172.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<4.0.0,>=2.3.0 (from Levenshtein)
  Downloading rapidfuzz-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, Levenshtein
Successfully installed Levenshtein-0.21.1 rapidfuzz-3.3.0
File integrity failed: data/url_features.py
File integrity failed: data/allbrands.txt


In [None]:
import data.url_features