Skip to content

Genetic algorithms in classifying a website as Phishing or Not

Notifications You must be signed in to change notification settings

prabhnoor0212/malicious-website-detection

Repository files navigation

Malicious-Website-Detection

IMPORTING>>>> In Progress (will be addidng more results and the web extension code as I get time :)

OBJECTIVE

Build a system to classifiy a url as malicious or safe using machine learning.

DATASET

malicious-dataset

FLOW

untitled diagram

Pre-Processing

  • Cleaning: Missing-values by imputation, domain-knowledge, Mean-Mode substitutions and other techniques
  • Merging and Joining of datasets
  • Standardization of numerical features
  • Analysis: Graphs(1-D,2-D,3-D,n-D) iPython notebook / PDF

Dimensionality reduction for visualisation using - TSNE (selected over PCA due to better plot)

tsne

ML

K-Nearest Neighbors

  • Train data + Cross Val data -> 70%
  • kd-tree algorithm for finding optimal-k (for better time performance)
  • Accuracy on unseen test data: 94.86%
  • Confusion Matrix (Without optimization):

wogene

Observations

*Kd-tree time complexity

-> When d (number of features) is small : O(log (n))

-> When d is not small : O(2^d (log (n)))

  • K-NN interpretibility decreases as dimensionality increases

  • K-NN uses Minkowski Distance ( esp. Euclidean distance) which fails in higher dimensions

  • -> Hughes Phenomenon

    -> As d increases Overfitting increases (generally)

    -> Distance functions (Euclidean)

dist formula

Genetic Algorithms and Bio-inspired algos

  • As TSNE and PCA does not account for the output label while dimensionality reduction, other techniques would be more suitable
  • This algorithm short-listed only 7 features (major reduction)
  • Thus, simplifying time complexity for productionazing as well as reducing the dimensionality
  • Accuracy: 96.89%
  • Code & Implementation
  • Confusion Matrix:

withgen

como

Bayes

coming soon.. Will be adding interesting results from other algorithms as soon as I find some time :)

About

Genetic algorithms in classifying a website as Phishing or Not

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published