
<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />

## Deep Learning with RNNs

This notebook shows three commonly used neural network architectures to detect malicious URLs using **RNNs**. 

The task is to build a model that will be able to classify a URL as *malicious* or *benign*. 

Libraries:
- [Keras](https://keras.io/) is used as high-level API for [tensorflow](https://www.tensorflow.org/) backend
- [string.printable](https://docs.python.org/3/library/string.html#string.printable) returns the text string. That is, it returns the printable symbols. The result obtained is equivalent to the concatenation of those returned by string.digits , string.ascii_letters , string.punctuation and string.whitespace.
- pandas
- numpy
- json

In [1]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
import numpy as np
import re, os
from string import printable
from sklearn import model_selection

#import gensim
import tensorflow as tf
import keras

import warnings
warnings.filterwarnings("ignore")

## Load raw URL data
Extract the csv file from 
```
../data/url_data_mega_deep_learning.csv.zip
```
Then you can load the csv using the cell below. 

In [3]:
## Load data URL

DATA_HOME = '../data/'
df = pd.read_csv(DATA_HOME + 'url_data_mega_deep_learning.csv')
df.sample(n=25).head(5) 

Unnamed: 0,url,isMalicious
12000,niketrainersuk.com.co/air-force-1-low-trainers...,0
109569,rtdesigns.ca/lord123/home,1
158719,xpxtupcje.pl/linuxsucks.php,1
70012,facebook.com/pages/WE-HATE-ASTON-VILLA/1686557...,0
20007,chinahr.com/beijing/jobs/35343,0


## Pre-processing URL data
**Step 1**: Convert each raw URL string to integers. 

For a given string, if the characters that are contained in **printable**, they can be assigned a number (encoded) using the **printable.index()** 


In [4]:
url_int_tokens = [[printable.index(x) + 1 for x in url if x in printable] for url in df.url]

# print out a few of these encoded characters
url_int_tokens[0][0:10]

[29, 25, 24, 17, 22, 35, 28, 19, 13, 29]

**Step 2:** Cut URL string at max_len or pad with zeros if shorter. 

Because, we need for the inputs into a neural network to all be the same length.

Use the keras.preprocessing.pad_sequence method for this task

In [5]:
max_len=75
X = keras.preprocessing.sequence.pad_sequences(url_int_tokens, maxlen=max_len)

**Step 3:** Extract labels from the pandas dataframe and convert to a numpy array|

In [6]:
targets = np.array(df.isMalicious)

print('Dimensions of Features: ', X.shape,'\nDimensions of Targets: ', targets.shape)

Dimensions of Features:  (194798, 75) 
Dimensions of Targets:  (194798,)


## Test/Train Split

In [7]:
split_ratios = (0.7, 0.15, 0.15)  # Training, Validation, Test


X_train, X_temp, y_train, y_temp = model_selection.train_test_split(X, targets, test_size=(1 - split_ratios[0]), stratify=targets, random_state=42)
X_val, X_test, y_val, y_test = model_selection.train_test_split(X_temp, y_temp, stratify=y_temp, test_size=split_ratios[2] / (split_ratios[1] + split_ratios[2]), random_state=42)

y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)
y_val = keras.utils.to_categorical(y_val)
y_val.shape

(29220, 2)

In [8]:
X_val.shape

(29220, 75)

## Architecture for an LSTM

In [9]:
model_name = 'lstm_URL_Classifier'

In [10]:
final_dense_units = 2
max_len=75
emb_dim=32
max_vocab_len=100
lstm_output_size=32
#W_regularizer=keras.regularizers.l2(1e-4)

lstm_model = keras.Sequential(name=model_name)

lstm_model.add(keras.layers.Input(shape=(max_len,), dtype='int32'))
lstm_model.add(keras.layers.Embedding(input_dim=max_vocab_len, output_dim=emb_dim))#, input_shape=(max_len,)))#, input_length=max_len))
lstm_model.add(keras.layers.LSTM(lstm_output_size))
lstm_model.add(keras.layers.Dropout(0.5))
lstm_model.add(keras.layers.Dense(final_dense_units, activation='sigmoid'))

lstm_optimizer = keras.optimizers.Adam(learning_rate=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

lstm_model.compile(optimizer=lstm_optimizer,  
                   loss='binary_crossentropy', 
                   metrics=['accuracy'] 
                  # metrics=keras.metrics.F1Score(threshold=0.5)]
                  )
lstm_model.summary()

## Train Model

In [12]:
nb_epoch = 3
batch_size = 32

CallBack = [
        keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=5, mode='max', verbose=1),
]
lstm_model.fit(X_train, 
               y_train, 
               epochs=nb_epoch,
               batch_size=batch_size,
              validation_data = (X_val, y_val),
               verbose = 1, 
               callbacks= CallBack
              )

Epoch 1/3
[1m4262/4262[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m233s[0m 53ms/step - accuracy: 0.7337 - loss: 0.5374 - val_accuracy: 0.8271 - val_loss: 0.3938
Epoch 2/3
[1m4262/4262[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m230s[0m 54ms/step - accuracy: 0.8283 - loss: 0.3991 - val_accuracy: 0.8359 - val_loss: 0.3753
Epoch 3/3
[1m4262/4262[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m225s[0m 53ms/step - accuracy: 0.8374 - loss: 0.3829 - val_accuracy: 0.8414 - val_loss: 0.3612


<keras.src.callbacks.history.History at 0x286bd58e000>

In [19]:
loss, accuracy = lstm_model.evaluate(X_test, y_test, verbose=1)

print('\nFinal Cross-Validation Accuracy', accuracy, '\n')

[1m914/914[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 19ms/step - accuracy: 0.8401 - loss: 0.3636

Final Cross-Validation Accuracy 0.8402806520462036 



In [124]:
lstm_model.save("lstm_URL_Classifier.keras")

## Making a prediction

In [20]:
test_url_mal = "naureen.net/etisalat.ae/index2.php"
test_url_benign = "sixt.com/php/reservation?language=en_US"

url = test_url_mal

In [21]:
# Step 1: Convert raw URL string in list of lists where characters that are contained in "printable" are stored encoded as integer 
url_int_tokens = [[printable.index(x) + 1 for x in url if x in printable]]

# Step 2: Cut URL string at max_len or pad with zeros if shorter
max_len=75
processed_url = keras.preprocessing.sequence.pad_sequences(url_int_tokens, maxlen=max_len)

In [22]:
target_proba = lstm_model.predict(processed_url, batch_size=5)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step


In [23]:
def threshold_result(proba):
    if proba > 0.5:
        return "MALICIOUS!"
    else:
        return "benign"

In [24]:
print("Test URL:\n", url, "\nis", threshold_result(target_proba[0][1]))

Test URL:
 naureen.net/etisalat.ae/index2.php 
is MALICIOUS!
