This was made as part of a project for CS3244: Machine Learning, a module I took in National University of Singapore (NUS).
This repository contains two Jupyter Notebooks - each contains a Recurrent Neural Network (RNN) model that can take in an input URL and predict with relatively high accuracy whether it is likely to be a Phishing website or a Non-Phishing website. Each URL is broken down into 16 different attributes and then processed using the network in order to determine the probabilities for each class.
The first model is a standard RNN model that is trained and tested on a set of 10,988 URLs taken from the PhishTank Database (recent as of March 31, 2022). The second is an identical model, this time trained and tested on a combination of two datasets - the initial dataset from the PhishTank Database, as well as an additional second set of 11,000 URLs generated by a Generative Adversarial Network (GAN). The first model showed great accuracy (though bearing in mind the possibility of overfitting) over the test data, with the average accuracy ranging from 88% to 91%. The second model showed a lower accuracy as expected, with possibly noisy data from the GAN-generated URLs, with the average accuracy ranging from 72% to 78%.
The implementation is planned to be further updated to incorporate state-of-the-arts extensions and improvements in the future.