# Classifying Phishing websites with an Atribute-Level RNN

To detect whether a given URL has a high probability of being a Phishing Website, we make use of a Recurrent Neural Network (RNN) as an attribute-level classifier for Website URLs.

The RNN will read each URL as a series of attributes, outputting a prediction and "hidden state" at each step, and feeds its previous hidden state into each next step. We will take the final prediction to be the output, i.e. which class the URL belongs to: Phishing Website, or Non-Phishing Website.

# Data Preparation

For this network, we used a CSV file containing the Phishing Website URLs that were split into attributes. The data can be downloaded separately at https://phishtank.com/. Specifically, we used 10,988 URLs from the PhishTank Database (As of March 31, 2022), confirmed to be either a Phishing Website URL or a Non-Phishing Website URL, in order to train our network.

First, we split the data into domains, attributes, and labels. Afterwards, to ensure that our network is robust, we use Stratified K-Folds Cross-Validation in order to split our data. 

Each URL has a label of either 1 (Phishing Website) or 0 (Non-Phishing Website).

In [6]:
import numpy as np
import pandas as pd
import torch
import random
from sklearn.model_selection import StratifiedKFold

# Use Pandas to read the CSV data file
data_file = pd.read_csv('../data/data.csv')
data = data_file.values.tolist()

# Set class labels
labels = ["Non-Phishing Website", "Phishing Website"]
labels_tensor = torch.LongTensor([0,1])

# Set metadata for our RNN
num_of_attributes = 16
hidden_size = 128
num_of_labels = 2
data_size = len(data)

# Extract domains, attributes, and labels from our data
# We will also create numpy array versions to perform Stratified K-Folds
data_domain = []
data_attribute = torch.Tensor(data_size, 16, 1, 16)
data_label = torch.LongTensor(data_size)
data_attribute_np = []
data_label_np = []

for i in range(data_size):
    data_domain.append(data[i][0])
    
    new_data = data[i][1:-1]
    new_tensor = torch.zeros(16, 1, 16)
    for idx, att in enumerate(new_data):
        new_tensor[idx][0][idx] = att
    data_attribute[i] = new_tensor
    data_attribute_np.append(new_data)
    
    data_label[i] = data[i][-1]
    data_label_np.append(data[i][-1])

data_attribute_np = np.array(data_attribute_np)
data_label_np = np.array(data_label_np)

# Perform Stratified K-Folds
skf = StratifiedKFold(n_splits = 12, random_state = None, shuffle = True)

for train_index, test_index in skf.split(data_attribute_np, data_label_np):
    print("TRAIN:", len(train_index), "TEST:", len(test_index))    

TRAIN: 10072 TEST: 916
TRAIN: 10072 TEST: 916
TRAIN: 10072 TEST: 916
TRAIN: 10072 TEST: 916
TRAIN: 10072 TEST: 916
TRAIN: 10072 TEST: 916
TRAIN: 10072 TEST: 916
TRAIN: 10072 TEST: 916
TRAIN: 10073 TEST: 915
TRAIN: 10073 TEST: 915
TRAIN: 10073 TEST: 915
TRAIN: 10073 TEST: 915


# Creating the Network

This implementation of the RNN has been largely taken from [the PyTorch for Torch users tutorial](https://github.com/pytorch/tutorials/blob/master/Introduction%20to%20PyTorch%20for%20former%20Torchies.ipynb). It consist of 2 linear layers which operate on an input and hidden state, with an additional Sigmoid layer for the hidden layer.

In [7]:
import torch.nn as nn
from torch.autograd import Variable

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        
        # Define the input, hidden, and output sizes of the network
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Define the two linear layers
        self.input_to_hidden_layer = nn.Linear(input_size + hidden_size, hidden_size)
        self.input_to_output_layer = nn.Linear(input_size + hidden_size, output_size)
    
    def forward(self, input, hidden):

        # Combine the input and the hidden layer into a single Tensor, for the training process
        combined = torch.cat((input, hidden), 1)
        
        # Run the Tensor through the network
        hidden = self.input_to_hidden_layer(combined)
        hidden = torch.sigmoid(hidden)
        output = self.input_to_output_layer(combined)
        
        # Return the results of the network
        return output, hidden

    def init_hidden(self):
        return Variable(torch.zeros(1, self.hidden_size))

## Manually testing the network

With our custom `RNN` class defined, we can create a new instance, `net`.

To run a step of this network, we can pass in an input (in this case, the features for the first URL in our att_set) and a previous hidden state (Initialized as zeros at first). We will get back the output (probability of each label) and a next hidden state (which we keep for the next step).

As you can see the output is a `<1 x num_of_labels>` Tensor, where every item is the likelihood of the label (higher is more likely).

In [8]:
net = RNN(num_of_attributes, hidden_size, num_of_labels)

ex_input = data_attribute[0]
ex_hidden = net.init_hidden()

ex_output, ex_next_hidden = net(ex_input[0], ex_hidden)
print('output size =', ex_output.size())
print(ex_output)

output size = torch.Size([1, 2])
tensor([[0.0529, 0.0257]], grad_fn=<AddmmBackward>)


# Preparing for Training

Before going into training we should make a few helper functions. The first is to interpret the output of the network, which we know to be a likelihood of each label. We can use `Tensor.topk` to get the index of the greatest value:

In [9]:
def category(output):
    
    # Tensor out of Variable with .data
    top_n, top_i = output.data.topk(1)
    print(top_i)
    category_i = top_i[0][0]
    return "Prediction: " + labels[category_i]

print(category(ex_output))

tensor([[0]])
Prediction: Non-Phishing Website


We will also want a quick way to get a training example (A set of attributes and its true label) from a given index.

In [10]:
def get_example(index):
    
    # Use the index given to select the particular example
    attribute_tensor = data_attribute[index]
    label_tensor = torch.Tensor([data_label[index]]).long()
    
    # Return the Attribute, Label, and their corresponding Tensors
    return attribute_tensor, label_tensor
    
attribute_tensor, label_tensor = get_example(0)
print('\nAttribute =', attribute_tensor, '\nLabel =', labels[label_tensor.item()], '\n')


Attribute = tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0

We may also define a handy function to get the error of the network:

In [11]:
def get_error(scores, labels):

    batch_size = scores.size(0)
    predicted_labels = scores.argmax(dim = 1)
    indicator = (predicted_labels == labels)
    num_of_matches = indicator.sum()
    
    return 1 - num_of_matches.float() / batch_size

And also a function to get the accuracy of the network as well:

In [12]:
def get_accuracy(scores, labels):
    
    batch_size = scores.size(0)
    predicted_labels = scores.argmax(dim = 1)
    indicator = (predicted_labels == labels)
    num_of_matches = indicator.sum()
    
    return 100 * num_of_matches.float() / batch_size  

Finally, we may define a function to evaluate our network on the test set:

In [13]:
def evaluate_on_test_data(test_index):

    running_accuracy = 0
    num_batches = 0
    
    label_true = torch.Tensor()
    label_prediction = torch.Tensor()
    
    for count in range(0, len(test_index)):

        hidden = net.init_hidden()
        
        attribute_tensor, label_tensor = get_example(test_index[count].item())
        
        for i in range(num_of_attributes):
            output, hidden = net(attribute_tensor[i], hidden)
        
        _, predictions = torch.max(output, 1)
        
        # Compute the accuracy of the model on the test data
        accuracy = get_accuracy(output.detach(), label_tensor)
        running_accuracy += accuracy.item()
        
        num_batches += 1
        
        label_true = torch.cat((label_true, label_tensor.data), 0)
        label_prediction = torch.cat((label_prediction, predictions), 0)
    
    f1 = f1_score(label_true, label_prediction)
    TN, FP, FN, TP = confusion_matrix(label_true, label_prediction).ravel()
    TPR = TP / (TP + FN)
    FNR = FN / (TP + FN)
    
    total_accuracy = running_accuracy / num_batches
    print('Test Accuracy =', total_accuracy, 'percent', '\n\nF1 Score = ', f1, '\nTPR Score = ', TPR, '\nFNR Score = ', FNR, '\n\n\n')
    return f1, TPR, FNR

# Training the Network

Now all it takes to train this network is show it a bunch of examples, have it make guesses, and tell it if it's wrong.

For our loss function, [`nn.CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) is appropriate, since the hidden layer of the RNN is passed through `nn.sigmoid`. We will also use a Learning Rate of 0.05.

In [14]:
criterion = nn.CrossEntropyLoss()
learning_rate = 0.05

We will train our network for a total of 20 epochs. Each epoch will train our network on all of the training data, in a randomised order.

Within each epoch, we will pass in the individual attribute tensors of a single training example (16 in total) one at a time. This will be done for all examples (around ~8000 times).

After every 5 epochs, we will test our partially-trained network on the test data and have it output the accuracy rate on the test data.

On a local machine, the average accuracy of the model on the test dataset ranges around 88 ~ 91%. Results may vary with different hyperparameters and dataset. Apart from the general accuracy, we also use the F1 score metric, as well as the True Positive Rate (TPR) and the False Negative Rate (FNR), in order to evaluate the model's performance over the test dataset.

In [15]:
import time
import math
from sklearn.metrics import f1_score, confusion_matrix

# Initialise the net
net = RNN(num_of_attributes, hidden_size, num_of_labels)
num_of_epochs = 20

# Keep track of losses for plotting
current_loss = 0
all_losses = []

start = time.time()

kfold_count = 1
f1_list = []
TPR_list = []
FNR_list = []

# Start the training process
# We will train each K-Fold for a total of 50 epochs
for train_index, test_index in skf.split(data_attribute_np, data_label_np):

    # We then pass each K-Fold through 50 epochs of training
    for epoch in range(1, num_of_epochs + 1):

        # Create a new optimizer at the beginning of each epoch, and give the current learning rate
        # Relevant if we change learning rate every epoch (currently not being done) 
        optimizer = torch.optim.SGD(net.parameters(), lr = learning_rate)

        running_loss = 0
        running_error = 0
        num_batches = 0
        
        np.random.shuffle(train_index)
        np.random.shuffle(test_index)

        for count in range(0, len(train_index)):

            # Forward and Backward Passes    
            optimizer.zero_grad()
            hidden = net.init_hidden()
            attribute_tensor, label_tensor = get_example(train_index[count].item())

            for i in range(num_of_attributes):
                output, hidden = net(attribute_tensor[i], hidden)

            loss = criterion(output, label_tensor)
            loss.backward()
            optimizer.step()

            # Compute some stats
            running_loss += loss.detach().item()
            error = get_error(output.detach(), label_tensor)
            running_error += error.item()
            num_batches += 1

        # Once the epoch is finished, we divide the "running quantities" by the number of batches
        total_loss = running_loss / num_batches
        total_error = running_error / num_batches
        elapsed_time = time.time() - start
        
        # Every 5 epochs, we display the stats and compute the error rate on the test set  
        if epoch % 5 == 0:
            print('\nK-Fold', kfold_count, '\nEpoch =', epoch, '\nElapsed Time =', elapsed_time, '\nLoss =', total_loss, '\nError =', total_error * 100,'\nLearning Rate =', learning_rate, '\n')
            f1, TPR, FNR = evaluate_on_test_data(test_index)
            
            # If this is the final epoch for the current K-Fold, pass back the scores
            if epoch == num_of_epochs:
                f1_list.append(f1)
                TPR_list.append(TPR)
                FNR_list.append(FNR)
    
    # Increment the K-Fold count
    kfold_count = kfold_count + 1


K-Fold 1 
Epoch = 5 
Elapsed Time = 58.054107427597046 
Loss = 0.30963848209178296 
Error = 12.926926131850674 
Learning Rate = 0.05 

Test Accuracy = 86.79039301310044 percent 

F1 Score =  0.8626560726447219 
TPR Score =  0.7615230460921844 
FNR Score =  0.23847695390781562 




K-Fold 1 
Epoch = 10 
Elapsed Time = 117.44369077682495 
Loss = 0.2863015365072393 
Error = 11.397934868943606 
Learning Rate = 0.05 

Test Accuracy = 87.77292576419214 percent 

F1 Score =  0.874439461883408 
TPR Score =  0.781563126252505 
FNR Score =  0.218436873747495 




K-Fold 1 
Epoch = 15 
Elapsed Time = 175.9408655166626 
Loss = 0.2830614506004731 
Error = 11.397934868943606 
Learning Rate = 0.05 

Test Accuracy = 87.882096069869 percent 

F1 Score =  0.8756998880179172 
TPR Score =  0.7835671342685371 
FNR Score =  0.21643286573146292 




K-Fold 1 
Epoch = 20 
Elapsed Time = 235.62602019309998 
Loss = 0.28139692250600046 
Error = 11.268864177918983 
Learning Rate = 0.05 

Test Accuracy = 87.22707

Test Accuracy = 89.30131004366812 percent 

F1 Score =  0.8941684665226782 
TPR Score =  0.8296593186372746 
FNR Score =  0.17034068136272545 




K-Fold 8 
Epoch = 15 
Elapsed Time = 1832.7578749656677 
Loss = 0.26550389608052144 
Error = 11.02065131056394 
Learning Rate = 0.05 

Test Accuracy = 89.62882096069869 percent 

F1 Score =  0.8968512486427795 
TPR Score =  0.8276553106212425 
FNR Score =  0.17234468937875752 




K-Fold 8 
Epoch = 20 
Elapsed Time = 1891.7468116283417 
Loss = 0.263717800703596 
Error = 10.901509134233518 
Learning Rate = 0.05 

Test Accuracy = 88.8646288209607 percent 

F1 Score =  0.8903225806451612 
TPR Score =  0.8296593186372746 
FNR Score =  0.17034068136272545 




K-Fold 9 
Epoch = 5 
Elapsed Time = 1951.6749532222748 
Loss = 0.26319361714776995 
Error = 10.771369006254343 
Learning Rate = 0.05 

Test Accuracy = 87.75956284153006 percent 

F1 Score =  0.873589164785553 
TPR Score =  0.7755511022044088 
FNR Score =  0.22444889779559118 




K-Fold 9 


# Calculating the Results

With the trained model and the aggregated F1, TPR, and FNR scores, we can calculate the mean and the standard deviation for each metric, giving us a better idea of the model's performance.

In [16]:
print('F1 Mean:', np.mean(np.array(f1_list)), '\n')
print('F1 STD:', np.std(np.array(f1_list)), '\n')
print('TPR Mean:', np.mean(np.array(TPR_list)), '\n')
print('TPR STD:', np.std(np.array(TPR_list)), '\n')
print('FNR Mean:', np.mean(np.array(FNR_list)), '\n')
print('FNR STD:', np.std(np.array(FNR_list)), '\n')

F1 Mean: 0.8895662727920074 

F1 STD: 0.010267565929834278 

TPR Mean: 0.8131262525050101 

TPR STD: 0.017434581804915143 

FNR Mean: 0.18687374749498997 

FNR STD: 0.01743458180491514 

