Malicious-Website-Detection

IMPORTING>>>> In Progress (will be addidng more results and the web extension code as I get time :)

OBJECTIVE

Build a system to classifiy a url as malicious or safe using machine learning.

DATASET

Initially the URLs alongwith Phishing tags were downloaded from: https://www.kaggle.com/simsek/openphishcom-phishing-urls-on-oct-2-2017#dataset.csv
Many useful parameters (URL-based features, Domain-based features, Page-based features and Content-based features) were extracted using web scraping and some basic computations : makeDataset.py (for the code), dataset.csv (for the basic dataset)
Dataset after preparation and pre-processing: [final_cleaned_dataset.csv] (https://github.com/prabhnoor0212/malicious-website-detection/blob/master/final_cleaned_dataset.csv)

FLOW

Pre-Processing

Cleaning: Missing-values by imputation, domain-knowledge, Mean-Mode substitutions and other techniques
Merging and Joining of datasets
Standardization of numerical features
Analysis: Graphs(1-D,2-D,3-D,n-D) iPython notebook / PDF

Dimensionality reduction for visualisation using - TSNE (selected over PCA due to better plot)

ML

K-Nearest Neighbors

Train data + Cross Val data -> 70%
kd-tree algorithm for finding optimal-k (for better time performance)
Accuracy on unseen test data: 94.86%
Confusion Matrix (Without optimization):

Observations

*Kd-tree time complexity

-> When d (number of features) is small : O(log (n))

-> When d is not small : O(2^d (log (n)))

K-NN interpretibility decreases as dimensionality increases
K-NN uses Minkowski Distance ( esp. Euclidean distance) which fails in higher dimensions
Curse of Dimensionality [https://en.wikipedia.org/wiki/Curse_of_dimensionality]

-> Hughes Phenomenon

-> As d increases Overfitting increases (generally)

-> Distance functions (Euclidean)

Genetic Algorithms and Bio-inspired algos

As TSNE and PCA does not account for the output label while dimensionality reduction, other techniques would be more suitable
This algorithm short-listed only 7 features (major reduction)
Thus, simplifying time complexity for productionazing as well as reducing the dimensionality
Accuracy: 96.89%
Code & Implementation
Confusion Matrix:

Bayes

coming soon.. Will be adding interesting results from other algorithms as soon as I find some time :)

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Malc API		Malc API
stuff		stuff
Analysis_Working.ipynb		Analysis_Working.ipynb
Analysis_Working.pdf		Analysis_Working.pdf
ML_WORKING.ipynb		ML_WORKING.ipynb
README.md		README.md
Untitled.ipynb		Untitled.ipynb
Wiki_Views.ipynb		Wiki_Views.ipynb
checkScript.py		checkScript.py
classifier.pkl		classifier.pkl
classifier2.pkl		classifier2.pkl
classifier3.ipynb		classifier3.ipynb
classifier3.pkl		classifier3.pkl
dataset.csv		dataset.csv
dataset_prep.ipynb		dataset_prep.ipynb
extras.csv		extras.csv
final_cleaned_dataset.csv		final_cleaned_dataset.csv
final_dataset.csv		final_dataset.csv
index_thous.csv		index_thous.csv
makeDataset.py		makeDataset.py
old_data_study.html		old_data_study.html
radar.csv		radar.csv
radar.txt		radar.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malicious-Website-Detection

OBJECTIVE

DATASET

FLOW

Pre-Processing

Dimensionality reduction for visualisation using - TSNE (selected over PCA due to better plot)

ML

K-Nearest Neighbors

Observations

Curse of Dimensionality [https://en.wikipedia.org/wiki/Curse_of_dimensionality]

Genetic Algorithms and Bio-inspired algos

Bayes

About

Releases

Packages

Contributors 2

Languages

prabhnoor0212/malicious-website-detection

Folders and files

Latest commit

History

Repository files navigation

Malicious-Website-Detection

OBJECTIVE

DATASET

FLOW

Pre-Processing

Dimensionality reduction for visualisation using - TSNE (selected over PCA due to better plot)

ML

K-Nearest Neighbors

Observations

Curse of Dimensionality [https://en.wikipedia.org/wiki/Curse_of_dimensionality]

Genetic Algorithms and Bio-inspired algos

Bayes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages