<a href="https://colab.research.google.com/github/khaefner/M3AAWG_AI_Training_Phishing/blob/main/Phish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contents

1. [Loading Data](#loading_data)
2. Exploring the Data
3. Pre-process Data
4. K-Nearest Neighbors (#knn)
5. Deep Neural Networks (#dnn)
6. Feature Selection (are all features created equal?)



First we need a dataset to work on.  The one we'll be using is from 2021 on Kagle at the URL below.


https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset

Next we'll start with a library that can load data from a comma dilimited file and a library used for matrix calculations.
THe file is host on the github site.


In [None]:
#This hides some of the warnings we get in MLP
def warn(*args, **kwargs):
    pass
from termcolor import colored
import warnings
warnings.warn = warn

import pandas as pd  #Pandas is a data manipulation library
import numpy as np   # numpy is computing library that uses C libraries in the backend
from sklearn.model_selection import StratifiedKFold   #This gives us nice 'slices' of examples for training and testing

from sklearn.neighbors import KNeighborsClassifier  # K Nearest Neighbors
from sklearn.tree import DecisionTreeClassifier  # Decision Trees
from sklearn.ensemble import RandomForestClassifier  # Random Forrest Classifier
from sklearn.neural_network import MLPClassifier   #Neural Network Classifier

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score  #Libraries for calculating scores.


<a id='loading_data'></a>
# Loading Data
The Data we are going to use is from a dataset hosted on Kaggle.

Here:[Phishing Dataset](https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset)

Original Source of data:

[Web Page Phishing Detection](#https://data.mendeley.com/datasets/c2gw7fy2j4/3)

In [None]:
phishing_data = pd.read_csv("https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/dataset_phishing.csv")

# Exploring Data

In [None]:
print(phishing_data)

We see that this data has 88 columns.  These are called *features*. In this data things like length of the url,  lenght of the hostname, etc.  Rows are datapoints corresponding to one of the domains.  These are also called *examples*.  

Note:  There is one column that has special meaning.  This is the last column in the table above called, *status*.  This is the label for the website.  We are going to do **supervised** learning which means the algorithm is going to learn from the data and the label.

# Data Pre-Process

Next we need to clean up the data a bit and get it ready to analyze.  


1.   Remove the URL column.  The actual URL is not useful to the model.
2.   Alter the Label (status=legitimate or status=phishing to 0 or 1)



In [None]:
#Get rid of the first column:
phishing_data = phishing_data.iloc[:, 1:]
#Print the result
print(phishing_data)

In [None]:
#Change the label classes to a one or a zero
phishing_data['status'] = phishing_data['status'].replace({'legitimate': 0, 'phishing': 1})
#Print the result
phishing_data

The next thing we need to do is seperate the dataset into two parts.  The labels and the examples.

Typically Labels are numbers.  Here we have two classes of data:

1 = Phishing site

0 = Not Phishing Site

In [None]:
#First the Labels
y = phishing_data["status"].values
print(y)

In [None]:
#Second the example data
X = phishing_data.drop("status", axis=1).values
print(X)

In [None]:
#Let's see how many of the data are phishing and not phishing
print(phishing_data['status'].value_counts())

Great we have a balance dataset.  Equal represebtation of each label phish and not phish.    Now lets look at how the features relate to each other.  There are two things we can look at **Covariance** and **Correlation**.   

---

Covariance:  measures how two variables (features) vary with respect to each other.  For example an increase in a person's height corresponds to an increas in a persons weight.  This would be a positive covariance.

---
Correlation: Correlation is a normalized covariance value.  What this means is that it is not affected by changes in scales.  Correlation makes the comparison measure fall between -1 and 1.    In this case a value of +1 indicates that the features have a direct and strong relationship.  Conversely a value of -1 means that the values have strong independence from one another.


In [None]:
correlation_matrix = phishing_data.corr(numeric_only=True)
sorted_corr = correlation_matrix.sort_values(by='status',ascending=False)

print(sorted_corr['status'].head(50))

As we can see above, the status (label) has a 100% correlation with the outcome.  This is what we would expect.  The other features are ranked by their correlation to the decision.  

# K Nearest Neighbors

K-Nearest Neighbors (KNN) is a simple machine learning algorithm that helps us make predictions based on similarity. Imagine you have a bunch of points on a graph, each with a label (like red or blue). KNN works by finding the K nearest points to a new, unlabeled point you want to classify. It then looks at the labels of those nearest points and decides the label for the new point based on majority rule. For example, if most of the nearest points are red, the new point would be classified as red. K is a number you choose, and it determines how many neighbors to consider. KNN is like asking your closest friends for advice – if most of them agree, you'll probably follow their suggestion.

![image.png](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/330px-KnnClassification.svg.png)

In [None]:

def KNN(X,y):
  skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
  accuracy = []
  precision = []
  recall = []
  f1 = []


  # we are going to run the model 10 times 'n_splits=10' each time we shuffle the data randomly.
  #This helps prevent our model from overfitting.
  for train, test in skf.split(X,y):
     X_train, y_train = X[train], y[train] #training
     X_test, y_test = X[test], y[test] #testing

     knn = KNeighborsClassifier(n_neighbors=3)
     knn.fit(X_train, y_train)

     y_pred = knn.predict(X_test)

     accuracy.append(accuracy_score(y_test, y_pred))
     recall.append(recall_score(y_test, y_pred, average='macro'))
     precision.append(precision_score(y_test, y_pred, average='macro'))
     f1.append(f1_score(y_test, y_pred, average='macro'))


  average_accuracy = np.mean(accuracy)
  average_recall = np.mean(recall)
  average_precision = np.mean(precision)
  average_f1 = np.mean(f1)

  print(f"Acurracy: {average_accuracy}")
  print(f"Recall: {average_recall}")
  print(f"Precision:{average_precision}")
  print(f"F1 Score:{average_f1}")

  return knn


In [None]:
knn = KNN(X,y)

# Decision Tree

A [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree) is a supervised machine learning algorithm used for both classification and regression tasks. It works by recursively splitting the dataset into subsets based on the features that provide the best separation between classes (for classification) or the best predictive power (for regression). These splits are determined by evaluating criteria like Gini impurity or information gain for classification and mean squared error for regression. The process continues until a stopping criterion is met, such as reaching a maximum depth or having too few samples in a node. Once the tree is built, it can be used to make predictions by traversing the tree from the root node to a leaf node, which corresponds to the predicted class (in classification) or the predicted value (in regression) for the input data. Decision Trees are interpretable, which means you can easily understand the reasoning behind their predictions.

<img src="https://upload.wikimedia.org/wikipedia/commons/a/ad/Decision-Tree-Elements.png" />

In [None]:
def DT(X,y):
  skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
  accuracy = []
  precision = []
  recall = []
  f1 = []


  for train, test in skf.split(X,y):
     X_train, y_train = X[train], y[train] #training
     X_test, y_test = X[test], y[test] #testing

     dt = DecisionTreeClassifier(criterion='gini') #Gini is a measure of statistical dispersion that quantifies the inequality or impurity within a set of values,
     dt.fit(X_train, y_train)

     y_pred = dt.predict(X_test)

     accuracy.append(accuracy_score(y_test, y_pred))
     recall.append(recall_score(y_test, y_pred, average='macro'))
     precision.append(precision_score(y_test, y_pred, average='macro'))
     f1.append(f1_score(y_test, y_pred, average='macro'))


  average_accuracy = np.mean(accuracy)
  average_recall = np.mean(recall)
  average_precision = np.mean(precision)
  average_f1 = np.mean(f1)

  print(f"Acurracy: {average_accuracy}")
  print(f"Recall: {average_recall}")
  print(f"Precision:{average_precision}")
  print(f"F1 Score:{average_f1}")

  return dt

In [None]:
dt = DT(X,y)

# Random Forrest
Random Forest is an ensemble learning technique in machine learning that leverages a collection of decision trees to improve predictive accuracy and reduce overfitting. It works by creating multiple decision trees during training, where each tree is constructed using a random subset of the training data and a random subset of the features. When making predictions, each tree provides its individual prediction, and the final prediction is determined by taking a majority vote (classification) or averaging (regression) across all the individual tree predictions. This ensemble approach helps enhance the robustness and generalization of the model, as it combines the strengths of multiple decision trees while mitigating their individual weaknesses and biases.

  <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/Random_forest_diagram_complete.png/330px-Random_forest_diagram_complete.png" alt="drawing" width="50%"/>




In [None]:
#Random Forrest
def RF(X,y):
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    accuracy = []
    precision = []
    recall = []
    f1 = []


    # we are going to run the model 10 times 'n_splits=10' each time we shuffle the data randomly.
    #This helps prevent our model from overfitting.
    for train, test in skf.split(X,y):
      X_train, y_train = X[train], y[train] #training
      X_test, y_test = X[test], y[test] #testing


      # Create a Random Forest classifier
      rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

      # Fit the classifier to the training data
      rf_classifier.fit(X_train, y_train)

      # Make predictions on the test data
      y_pred = rf_classifier.predict(X_test)

      accuracy.append(accuracy_score(y_test, y_pred))
      recall.append(recall_score(y_test, y_pred, average='macro'))
      precision.append(precision_score(y_test, y_pred, average='macro'))
      f1.append(f1_score(y_test, y_pred, average='macro'))

    average_accuracy = np.mean(accuracy)
    average_recall = np.mean(recall)
    average_precision = np.mean(precision)
    average_f1 = np.mean(f1)

    print(f"Acurracy: {average_accuracy}")
    print(f"Recall: {average_recall}")
    print(f"Precision:{average_precision}")
    print(f"F1 Score:{average_f1}")

    return rf_classifier

In [None]:
rf = RF(X,y)

<a name="dnn"></a>
# Deep Neural Networks (DNNs)
DNNs take the primitive perceptron and build complex networks of interconnected neurons sometimes with many hidden layers.

<table width="100%">
<tr>
<td>
  <img src="https://www.simplilearn.com/ice9/free_resources_article_thumb/Perceptron_work.png" alt="drawing"/>
</td>
<td>
  <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/MultiLayerPerceptron.svg/2560px-MultiLayerPerceptron.svg.png" alt="drawing" />
</td>
</tr>
</table>




In [None]:
#Neural Nets  We'll use the Multi-Layer Perceptron MLP
#Important hyper-paramters are Learning_rate  This affects how fast the algorihms converges (minimizes errors).
#A learning rate that is to high will lead to sub-optimal solutions, too low and it will take forever.
def MLP(X,y):
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    accuracy = []
    precision = []
    recall = []
    f1 = []


    # we are going to run the model 10 times 'n_splits=10' each time we shuffle the data randomly.
    #This helps prevent our model from overfitting.
    for train, test in skf.split(X,y):
      X_train, y_train = X[train], y[train] #training
      X_test, y_test = X[test], y[test] #testing

      mlp = MLPClassifier(
            solver='adam',            #‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba
            hidden_layer_sizes=(20,),
            activation='tanh',
            max_iter=20,
            validation_fraction=0.2,
            learning_rate_init=0.01,   #The amount that the weights are updated a small positive value, often in the range between 0.0 and 1.0.
        )

      # Fit (train) the classifier
      mlp.fit(X_train,y_train)

      #Predict the results using the set-aside test data.
      y_pred = mlp.predict(X_test)

      accuracy.append(accuracy_score(y_test, y_pred))
      recall.append(recall_score(y_test, y_pred, average='macro'))
      precision.append(precision_score(y_test, y_pred, average='macro'))
      f1.append(f1_score(y_test, y_pred, average='macro'))

    average_accuracy = np.mean(accuracy)
    average_recall = np.mean(recall)
    average_precision = np.mean(precision)
    average_f1 = np.mean(f1)

    print(f"Acurracy: {average_accuracy}")
    print(f"Recall: {average_recall}")
    print(f"Precision:{average_precision}")
    print(f"F1 Score:{average_f1}")

    return mlp

In [None]:
mlp = MLP(X,y)

# Feature Selection.  
Not all features are equal.  Remember that some of them have a higher correlation to the classifier than others.





In [None]:
def feature_selection(num_top_correlated=10):
    top_list = sorted_corr['status'].head(num_top_correlated)
    print(top_list)
    X_orig = phishing_data.drop("status", axis=1)
    top_features=sorted_corr[1:num_top_correlated+1].index
    print(top_features)
    return X_orig[top_features].values


In [None]:
X_prime = feature_selection()

In [None]:

print("----------------")
print("KNN All Features")
print("________________")
KNN(X,y)
print("----------------")
print("KNN Selected Features")
print("________________")
KNN(X_prime,y)
print("\n")

print("----------------")
print("DT All Features")
print("________________")
DT(X,y)
print("----------------")
print("DT Selected Features")
print("________________")
DT(X_prime,y)
print("\n")

print("----------------")
print("RF All Features")
print("________________")
RF(X,y)
print("----------------")
print("RF Selected Features")
print("________________")
RF(X_prime,y)
print("\n")

print("----------------")
print("MLP All Features")
print("________________")
MLP(X,y)
print("----------------")
print("MLP Selected Features")
print("________________")
MLP(X_prime,y)


Surprised?
A lower correlation among ensemble model members will increase the error-correcting capability of the model. So it is preferred to use models with low correlations when creating ensembles.

# Classify

Now let's use out trained models to do some predections!

Caveate:  The data set is not exactly the same.  We are not doing the randomness in domain, or the domain age.

In [None]:

from joblib.numpy_pickle import load
import requests
import hashlib
import os

timeout_seconds = 10

#create the scratch directory for our imported python files
if not os.path.exists("data"):
    os.makedirs("data")

#This is to load files from the data directory
init_file_path = os.path.join("data", "__init__.py")

# Create an empty __init__.py file
with open(init_file_path, "w") as init_file:
    pass

#Data extraction scripts are hosted here: https://data.mendeley.com/datasets/c2gw7fy2j4/3
# I have them on a github page as I needed to remove some NLP depenadancies
all_brands_url = "https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/allbrands.txt"
all_brands_sha = "58fc066042181abbb1b42dd9ebf046dd0826347f93a6c8a6c129a4b8fb252efe"

url_features_url ="https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/url_features.py"
url_features_sha="6c36b9db9518f8e2bf12d1cc2b5eae3ef88fe7f517280792d9895af73028c78b"

content_features_url = "https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/content_features.py"
content_features_sha = "3165d2aa24322bb8db79f59070b4b4930661ad4afc54d56cf915261e06bc9d28"

external_features_url = "https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/external_features.py"
external_features_sha = "b4a0b2147163cf0c12d66e2392d443f4b1b131de3b84a3b9403d2f7ed00171cb"

feature_extractor_url = "https://raw.githubusercontent.com/khaefner/M3AAWG_AI_Training_Phishing/main/feature_extractor.py"
feature_extractor_sha = "4638d1d578fd80be7e2adae1917280e39d0c2906d8db1e3a44ae571bb5e8a317"



def calculate_hash(file_path,expected_hash):
  # Create a SHA-256 hash object
  sha256 = hashlib.sha256()

  # Read the file in binary mode and update the hash object
  with open(file_path, "rb") as file:
      while True:
          data = file.read(65536)  # You can adjust the buffer size as needed
          if not data:
              break
          sha256.update(data)
  calculated_hash = sha256.hexdigest()
  if calculated_hash == expected_hash:
    return True
  else:
    return False


def get_lib(url, destination_path, expected_hash):
  try:
    response = requests.get(url, timeout=timeout_seconds)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Open the local file in binary write mode and write the content of the response to it
        with open("data/"+destination_path, "wb") as file:
          file.write(response.content)
          if calculate_hash("data/"+destination_path,expected_hash):
              print(f"File integrity passed: {'data/'+destination_path}")
          else:
              print(f"File integrity failed: {'data/'+destination_path}")
              return
        print(f"File downloaded to {'data/'+destination_path}")

    else:
        print(f"Failed to download file. Status code: {response.status_code}")
  except requests.exceptions.Timeout:
    print(f"Request timed out after {timeout_seconds} seconds.")
  except requests.exceptions.RequestException as e:
    print(f"An error occurred during the request: {str(e)}")

get_lib(url_features_url,"url_features.py",url_features_sha)
get_lib(all_brands_url,"allbrands.txt",all_brands_sha)
get_lib(content_features_url,"content_features.py",content_features_sha)
get_lib(external_features_url,"external_features.py",external_features_sha)
get_lib(feature_extractor_url,"feature_extractor.py",feature_extractor_sha)







In [None]:
#Now lets immport some of the code to analyze new websites.
!pip install Levenshtein # Levenshtein distance is a string metric for measuring the difference between two sequences.
!pip install whois
!pip install dnspython
!pip install tldextract

In [None]:
import data.url_features
import data.content_features
import data.external_features
import data.feature_extractor as fextract2

In [None]:
url="https://cnn.com"
site_data = fextract2.extract_features(url)
print(site_data)

In [None]:
site_data_no_label = site_data.pop() #we don't need the label...this is what we want to predict
print(f"site data: {site_data_no_label}")
site_data = site_data[1:]   #slice off the url
print(site_data)
site_data_array = np.array(site_data) # convert to numpy array
reshaped_site_data_array = site_data_array.reshape(1,-1)
//print(reshaped_site_data_array)
result = knn.predict(reshaped_site_data_array)
print(f"KNN: {result}")
print("Phishing" if result == 1 else "Not Phishing")
result = dt.predict(reshaped_site_data_array)
print(f"DT: {result}")
print("Phishing" if result == 1 else "Not Phishing")
result =rf.predict(reshaped_site_data_array)
print(f"RF: {result}")
print("Phishing" if result == 1 else "Not Phishing")
result = mlp.predict(reshaped_site_data_array)
print(f"MLP: {result}")
print("Phishing" if result == 1 else "Not Phishing")
