# Real Time Filtering of Malicious URLs

The main aim of this project is to built a lightweight filtering system to filter out the
harmful URLs by using Machine Learning Algorithms and methodologies.

The project has three main modules 


1.   Pre-Processing
2.   Building Machine Learning Models and Testing
3.   Real Time Filtering



# Loading the dataset
The dataset used for this project is taken from Kaggle. This link is https://www.kaggle.com/xwolf12/malicious-and-benign-websites

In [None]:
#importing the pandas and numpy libraries

#pandas is used for data processing
import pandas as pd

#numpy is used for mathematical operations
import numpy as np

In [None]:
#loading the dataset
df = pd.read_csv("/content/dataset.csv")

In [None]:
#viewing the data
df.head(10)

# Pre-Processing
The different pre-processing steps done are

1.   Feature Selection: Only 6 features were selected for the training of the models. These features were selected based on the covariance value with the output label.
2.   Tokenization: Converting the strings to numeric tokens
3.   Normalization: Since the range of the numeric data is very large, the variance increases and this affects the training process. To reduce variance, Normalization is done.



## Feature Selection

In [None]:
#getting all the column names
df.columns

In [None]:
#total number of columns in the dataframe
len(df.columns)

In [None]:
#selecting particular columns
df_part= df[['NUMBER_SPECIAL_CHARACTERS','SERVER','CONTENT_LENGTH','WHOIS_STATEPRO','DIST_REMOTE_TCP_PORT','REMOTE_IPS','Type' ]]

In [None]:
#viewing the modified dataset
df_part.head(5)

## Tokenization

In [None]:
#mapping the server name strings to integers or tokens

#getting the unique server names
server_names = df.SERVER.unique()

#creating a map between the server name and token
server_names_map = {k:v for v,k in enumerate(server_names)}

In [None]:
#getting the unique state names
state_names = df.WHOIS_STATEPRO.unique()

#creating a map between the state name and token
state_names_map = {k:v for v,k in enumerate(state_names)}

In [None]:
 #applying the server map to the dataframe
 df_part.SERVER = df_part.SERVER.apply(lambda x: server_names_map[x])

In [None]:
#applying the state map to the dataframe
df_part.WHOIS_STATEPRO = df_part.WHOIS_STATEPRO.apply(lambda x: state_names_map[x])

In [None]:
#making the NaN values as 0
df_part['CONTENT_LENGTH'] = df_part['CONTENT_LENGTH'].fillna(0)

#making the content_length column as type int
df_part['CONTENT_LENGTH'] = df_part['CONTENT_LENGTH'].astype('int')

In [None]:
#viewing the updated dataset
df_part.head(5)

In [None]:
#Splitting into X and y

X = df_part.iloc[:, :-1]
y = df_part.iloc[:, -1]

In [None]:
from sklearn.model_selection import train_test_split

#splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Normalization

In [None]:
from sklearn.preprocessing import StandardScaler
#scaling the values in X_train
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Building Machine Learning models and Testing
Three machine learning models were developed to compare and contrast the results of the different models.
Three ML models were developed


1.   Multi Layer Perceptron
2.   Random Forest Classifier
3.   Support Vector Machine



## Random Forest Classifier

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
from sklearn.ensemble import RandomForestClassifier

#defining the Random Forest model with 30 decision trees
classifier = RandomForestClassifier(n_estimators=30, random_state=42)

#training the model with the training set
classifier.fit(X_train, y_train)

#prediciting the outputs on the test set
y_pred = classifier.predict(X_test)

### Evaluation Metrics

In [None]:
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt


In [None]:
#plotting the PR curve
disp_rf = plot_precision_recall_curve(classifier, X_test, y_test)


## Multi Layer Perceptron

In [None]:
#importing the libraries
import tensorflow as tf
import keras

In [None]:
#defining the ML model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(10, activation='relu'))
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [None]:
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=adam_optimizer, loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
history = model.fit(X_train, y_train, batch_size=32, epochs=30, verbose=1, validation_split=0.1, shuffle=True)

In [None]:
#predicting on the test set
y_pred_mlp = model.predict_classes(X_test)

### Evaluation Metrics

In [None]:
print(accuracy_score(y_test, y_pred_mlp))

In [None]:
print(classification_report(y_test, y_pred_mlp))

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
mlp_pred_proba = model.predict(X_test)

In [None]:
prec, recall , _ = precision_recall_curve(y_test, mlp_pred_proba)

In [None]:
import matplotlib.pyplot as plt
plt.plot(recall, prec)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("MLP Precision vs Recall")
plt.show()

## Support Vector Machine

In [None]:
from sklearn import svm

In [None]:
#defining the SVM with the rbf kernel
clf = svm.SVC(kernel='rbf')

#training
clf.fit(X_train, y_train)

In [None]:
#predicting on the test set
y_pred_svm = clf.predict(X_test)

### Evaluation Metrics

In [None]:
print(accuracy_score(y_test, y_pred_svm))

In [None]:
print(classification_report(y_test, y_pred_svm))

In [None]:
disp_svm = plot_precision_recall_curve(clf, X_test, y_test)

# Real Time Filtering

This is a small code part to show how the system will be used in real time. Since the Random Forest algorithm had the highest accuracy, we have used it for the demo part. 

In [None]:
def pred_link(server = "None", whois_state="None",  \
              nos_spl_chars = 0, content_len = 0, \
              remote_ips = 0, remote_tcp=0):
  #tokenizing the server name
  server_token = server_names_map[server]
  
  #tokeninzing the state name
  whois_state_token = state_names_map[whois_state]

  #constructing the input array
  inp_arr = [[nos_spl_chars, server_token, content_len, whois_state_token, remote_tcp, remote_ips]]
  
  #constructing the input for the classifier
  inp_arr = sc.transform(inp_arr)
  
  #final result
  result = classifier.predict(inp_arr)
  if result == 0:
    return "The link is not safe"

  return "The link is safe"



In [None]:
#actual label 0
pred_link("Microsoft-HTTPAPI/2.0","Arizona", 10, 324, 2,13)

In [None]:
#actual label 0
pred_link("nginx","PANAMA", 11, 0, 14,46)

In [None]:
#actual label 1
pred_link("Apache/2.2.14 (FreeBSD) mod_ssl/2.2.14 OpenSSL/0.9.8y DAV/2 PHP/5.2.12 with Suhosin-Patch",\
          "Utah", 10, 2516, 2,0)

In [None]:
#actual label 1
pred_link("nginx", "Novosibirskaya obl.",7,686,2,0 )

In [None]:
#actual label 1
pred_link("nginx/1.10.1", "None", 5, 0, 0,0)

In [None]:
#actual label 0
pred_link("None", "None", 7, 13716, 8, 6)