# PyTorch: Classifying Phishing websites with an Atribute-Level RNN

To detect whether a given URL has a high probability of being a Phishing Website, we make use of a Recurrent Neural Network (RNN) as an attribute-level classifier for Website URLs.

The RNN will read each URL as a series of attributes, outputting a prediction and "hidden state" at each step, and feeds its previous hidden state into each next step. We will take the final prediction to be the output, i.e. which class the URL belongs to: Phishing Website, or Non-Phishing Website.

# Preparing the Data

For this network, we used a CSV file containing the Phishing Website URLs that were split into attributes. The data can be downloaded separately at https://phishtank.com/. 

Specifically, we used 10,988 URLs from the PhishTank Database (As of March 31, 2022), confirmed to be either a Phishing Website URL or a Non-Phishing Website URL, in order to train our network.

In [1]:
import numpy as np
import pandas as pd
import torch
import random

# Use Pandas to read the CSV data file
data_file = pd.read_csv('../data/data.csv')
data = data_file.values.tolist()

# Randomly shuffle the data
random.shuffle(data)

# Set class labels
labels = ["Non-Phishing Website", "Phishing Website"]
labels_tensor = torch.LongTensor([0,1])

# Converting Data into Tensors

We first split the data into Training Data and Test Data. We allocate 2000 of the examples to be Test Data.

Then, we turn each of the obtained data and the corresponding labels into tensors. Hence, each URL will correspond to a single 1 x 16 Tensor, where 16 corresponds to the 16 attributes that each URL has.

Each URL has a label of either 1 (Phishing Website) or 0 (Non-Phishing Website).

In [2]:
# Set training data and test data sizes
data_size = len(data)
test_data_size = 2000
training_data_size = data_size - test_data_size

# Set metadata for our RNN
num_of_attributes = 16
hidden_size = 128
num_of_labels = 2

# Extract out training data and test data
training_data = torch.Tensor(training_data_size, 16, 1, 16)
test_data = torch.Tensor(test_data_size, 16, 1, 16)

training_data_domain = []
test_data_domain = []
training_label = torch.LongTensor(training_data_size)
test_label = torch.LongTensor(test_data_size)

for i in range(training_data_size):
    training_data_domain.append(data[i][0])
    
    new_data = data[i][1:-1]
    new_tensor = torch.zeros(16, 1, 16)
    for idx, att in enumerate(new_data):
        new_tensor[idx][0][idx] = att
    training_data[i] = new_tensor
    training_label[i] = data[i][-1]

for j in range(test_data_size):
    test_data_domain.append(data[j+training_data_size][0])
    
    new_data = data[j+training_data_size][1:-1]
    new_tensor = torch.zeros(16, 1, 16)
    for idx, att in enumerate(new_data):
        new_tensor[idx][0][idx] = att
    test_data[j] = new_tensor    
    test_label[j] = data[j+training_data_size][-1]
    
# Convert the data into PyTorch Tensors
print(training_data.size())
print(training_label.size())
print(test_data.size())
print(test_label.size())
print(training_data)

torch.Size([8988, 16, 1, 16])
torch.Size([8988])
torch.Size([2000, 16, 1, 16])
torch.Size([2000])
tensor([[[[0., 0., 0.,  ..., 0., 0., 0.]],

         [[0., 0., 0.,  ..., 0., 0., 0.]],

         [[0., 0., 1.,  ..., 0., 0., 0.]],

         ...,

         [[0., 0., 0.,  ..., 0., 0., 0.]],

         [[0., 0., 0.,  ..., 0., 1., 0.]],

         [[0., 0., 0.,  ..., 0., 0., 0.]]],


        [[[0., 0., 0.,  ..., 0., 0., 0.]],

         [[0., 0., 0.,  ..., 0., 0., 0.]],

         [[0., 0., 1.,  ..., 0., 0., 0.]],

         ...,

         [[0., 0., 0.,  ..., 1., 0., 0.]],

         [[0., 0., 0.,  ..., 0., 1., 0.]],

         [[0., 0., 0.,  ..., 0., 0., 1.]]],


        [[[0., 0., 0.,  ..., 0., 0., 0.]],

         [[0., 0., 0.,  ..., 0., 0., 0.]],

         [[0., 0., 1.,  ..., 0., 0., 0.]],

         ...,

         [[0., 0., 0.,  ..., 0., 0., 0.]],

         [[0., 0., 0.,  ..., 0., 1., 0.]],

         [[0., 0., 0.,  ..., 0., 0., 0.]]],


        ...,


        [[[0., 0., 0.,  ..., 0., 0., 0.]],



# Creating the Network

This RNN has been largely taken from [the PyTorch for Torch users tutorial](https://github.com/pytorch/tutorials/blob/master/Introduction%20to%20PyTorch%20for%20former%20Torchies.ipynb). It is just 2 linear layers which operate on an input and hidden state, with a LogSoftmax layer after the output.

In [3]:
import torch.nn as nn
from torch.autograd import Variable

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        
        # Define the input, hidden, and output sizes of the network
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Define the two linear layers, with Softmax as the non-linear layer
        self.input_to_hidden_layer = nn.Linear(input_size + hidden_size, hidden_size)
        self.input_to_output_layer = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim = 1)
    
    def forward(self, input, hidden):

        # Combine the input and the hidden layer into a single Tensor, for the training process
        combined = torch.cat((input, hidden), 1)
        
        # Run the Tensor through the network
        hidden = self.input_to_hidden_layer(combined)
        output = self.input_to_output_layer(combined)
        output = self.softmax(output)
        
        # Return the results of the network
        return output, hidden

    def init_hidden(self):
        return Variable(torch.zeros(1, self.hidden_size))

## Manually testing the network

With our custom `RNN` class defined, we can create a new instance, `net`.

To run a step of this network, we can pass in an input (in this case, the features for the first URL in our att_set) and a previous hidden state (Initialized as zeros at first). We will get back the output (probability of each label) and a next hidden state (which we keep for the next step).

As you can see the output is a `<1 x num_of_labels>` Tensor, where every item is the likelihood of the label (higher is more likely).

In [4]:
net = RNN(num_of_attributes, hidden_size, num_of_labels)

ex_input = training_data[0]
ex_hidden = net.init_hidden()

ex_output, ex_next_hidden = net(ex_input[0], ex_hidden)
print('output size =', ex_output.size())
print(ex_output)

output size = torch.Size([1, 2])
tensor([[-0.7410, -0.6475]], grad_fn=<LogSoftmaxBackward>)


# Preparing for Training

Before going into training we should make a few helper functions. The first is to interpret the output of the network, which we know to be a likelihood of each label. We can use `Tensor.topk` to get the index of the greatest value:

In [5]:
def category(output):
    
    # Tensor out of Variable with .data
    top_n, top_i = output.data.topk(1)
    print(top_i)
    category_i = top_i[0][0]
    return "Prediction: " + labels[category_i]

print(category(ex_output))

tensor([[1]])
Prediction: Phishing Website


We will also want a quick way to get a training example (A set of attributes and its true label) from a given index.

In [6]:
def get_training_example(index, train=True):
    
    if (train):
        # Use the random integer to select a particular training example
        attribute_tensor = training_data[index]
        label_tensor = torch.Tensor([training_label[index]]).long()

    else:
        # Use the random integer to select a particular training example
        attribute_tensor = test_data[index]
        label_tensor = torch.Tensor([test_label[index]]).long()
    
    # Return the Attribute, Label, and their corresponding Tensors
    return attribute_tensor, label_tensor
    
for i in range(10):
    attribute_tensor, label_tensor = get_training_example(i)
    print('\nAttribute =', attribute_tensor, '\nLabel =', labels[label_tensor.item()], '\n')


Attribute = tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0

We may also define a handy function to get the error of the network:

In [7]:
def get_error(scores, labels):

    batch_size = scores.size(0)
    predicted_labels = scores.argmax(dim = 1)
    indicator = (predicted_labels == labels)
    num_of_matches = indicator.sum()
    
    return 1 - num_of_matches.float() / batch_size

And also a function to get the accuracy of the network as well:

In [8]:
def get_accuracy(scores, labels):
    
    batch_size = scores.size(0)
    predicted_labels = scores.argmax(dim=1)
    indicator = (predicted_labels == labels)
    num_of_matches = indicator.sum()
    return 100 * num_of_matches.float() / batch_size  

Finally, we may define a function to evaluate our network on the test set:

In [9]:
def evaluate_on_test_data():

    running_accuracy = 0
    num_batches = 0
    
    for i in range(0, test_data_size):

        hidden = net.init_hidden()
        
        attribute_tensor, label_tensor = get_training_example(shuffled_indices_test[i].item(), False)
        
        for i in range(num_of_attributes):
            output, hidden = net(attribute_tensor[i], hidden)
        
        # Compute some stats
        accuracy = get_accuracy(output.detach(), label_tensor)
        running_accuracy += accuracy.item()
        
        num_batches += 1
        
    total_accuracy = running_accuracy / num_batches
    print('Test Accuracy =', total_accuracy, 'percent')

# Training the Network

Now all it takes to train this network is show it a bunch of examples, have it make guesses, and tell it if it's wrong.

For our loss function, [`nn.NLLLoss`](http://pytorch.org/docs/nn.html#nllloss) is appropriate, since the last layer of the RNN is `nn.LogSoftmax`. We will also use a Learning Rate of 0.005.

In [10]:
criterion = nn.NLLLoss()
learning_rate = 0.005

We will train our network for a total of 100 epochs. Each epoch will train our network on all of the training data, in a randomised order.

Within each epoch, we will pass in the individual attribute tensors of a single training example (16 in total) one at a time. This will be done for all examples (around ~8000 times).

After every 10 epoch, we will test our partially-trained network on the test data and have it output the accuracy rate on the test data.

The average accuracy of this architecture, with the set that was initially used to train and test, is around 87 ~ 88%. Results may vary with different parameters and data set used.

In [12]:
import time
import math

# Initialise the net
net = RNN(num_of_attributes, hidden_size, num_of_labels)
num_of_epochs = 100

# Keep track of losses for plotting
current_loss = 0
all_losses = []

start = time.time()

# Start the training process
for epoch in range(num_of_epochs):
    
    # create a new optimizer at the beginning of each epoch: give the current learning rate.  
    optimizer = torch.optim.SGD(net.parameters(), lr = learning_rate)
    
    running_loss = 0
    running_error = 0
    num_batches = 0
    
    shuffled_indices_training = torch.randperm(training_data_size)
    shuffled_indices_test = torch.randperm(test_data_size)
    
    for count in range(0, training_data_size):
        
        # Forward and Backward Passes    
        optimizer.zero_grad()
        hidden = net.init_hidden()
        attribute_tensor, label_tensor = get_training_example(shuffled_indices_training[count].item(), True)

        for i in range(num_of_attributes):
            output, hidden = net(attribute_tensor[i], hidden)
            
        loss = criterion(output, label_tensor)
        loss.backward()
        optimizer.step()
        
        # Compute some stats
        running_loss += loss.detach().item()
        error = get_error(output.detach(), label_tensor)
        running_error += error.item()
        num_batches += 1
    
    # Once the epoch is finished, we divide the "running quantities" by the number of batches
    total_loss = running_loss / num_batches
    total_error = running_error / num_batches
    elapsed_time = time.time() - start
    
    # Every 10 epoch, we display the stats and compute the error rate on the test set  
    if epoch % 10 == 0 : 
        print('\nEpoch =', epoch, '\nElapsed Time =', elapsed_time, '\nLoss =', total_loss, '\nError =', total_error * 100,'\nLearning Rate =', learning_rate, '\n')
        evaluate_on_test_data()


Epoch = 0 
Elapsed Time = 9.648565530776978 
Loss = 0.6841134036056826 
Error = 43.958611481975964 
Learning Rate = 0.005 

Test Accuracy = 59.8 percent

Epoch = 10 
Elapsed Time = 106.57540845870972 
Loss = 0.2980557216001938 
Error = 11.504227859368047 
Learning Rate = 0.005 

Test Accuracy = 86.95 percent

Epoch = 20 
Elapsed Time = 203.94273114204407 
Loss = 0.29878498092486505 
Error = 11.426346239430352 
Learning Rate = 0.005 

Test Accuracy = 86.95 percent

Epoch = 30 
Elapsed Time = 302.186320066452 
Loss = 0.2957939164415235 
Error = 11.481975967957277 
Learning Rate = 0.005 

Test Accuracy = 87.1 percent

Epoch = 40 
Elapsed Time = 400.8748769760132 
Loss = 0.2956726012785223 
Error = 11.470850022251891 
Learning Rate = 0.005 

Test Accuracy = 87.1 percent

Epoch = 50 
Elapsed Time = 500.94018268585205 
Loss = 0.2945947323390529 
Error = 11.359590565198042 
Learning Rate = 0.005 

Test Accuracy = 87.1 percent

Epoch = 60 
Elapsed Time = 599.52343583107 
Loss = 0.295238206082