<a href="https://colab.research.google.com/github/huanyanwei/ai-projects/blob/main/Web_Logs_Classifier_Overall.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Logs Classfier

**Objective: To distinguish web server logs generated from normal web surfing and logs generated due to malicous actions performed by attacker.**





Input: 

To be able to detect if logs is potentially malicious + what kind of potential attack


# Key Info

Source of web server logs - GET Requests from Apache Web Server hosted in Damn Vulnerable Web App (DVWA)

## Log Generation

Logs are generated mainly thorugh fuzzing using Turbo Intruder module in Burpsuite. 

1.   **Normal Logs (normal)**

> * *Injection path:  `/DVWA/vulnerabilities/fi/?page=xxx`*

>> `/usr/share/wordlists/wfuzz/general/medium.txt`

>> `/usr/share/wordlists/wfuzz/others/names.txt`

> * By manually clicking around the web app

2.  **Directory Traversal (dir)** 

> * *Injection path:  `/DVWA/vulnerabilities/fi/?page=xxx`*

>> `/usr/share/wordlists/wfuzz/Injections/Traversal.txt` from Kali

>> `/Directory Traversal/Intruder/deep_traversal.txt` from https://github.com/swisskyrepo/PayloadsAllTheThings/tree/master/Directory%20Traversal

>> (Not Tested) Other files from https://github.com/swisskyrepo/PayloadsAllTheThings/tree/master/Directory%20Traversal

>> (Not Tested) Files from `SecLists/Fuzzing/LFI/` from https://github.com/danielmiessler/SecLists

3.   **Cross Site Scripting (xss)**

> * *Injection path: `/DVWA/vulnerabilities/xss_r/?name=xxx`*

>> `/xss-payload-list-master/Intruder/xss-payload.txt` from https://github.com/payloadbox/xss-payload-list

>>`README.md` from https://github.com/payloadbox/xss-payload-list/blob/master/README.md

>> `/usr/share/wordlists/wfuzz/Injections/XSS.txt` from Kali

## Logs Labelling

* After logs have been generated, logs are extracted from web server and labelled depending which catogary it falls into. 

* **0 means that the log entry IS NOT the category.**
* **1 means that the log entry IS under the category.**

* Logs entries are then shuffled and saved into a train.csv and test.csv files. 

# Sample Code 

**Using Tensorflow to classify labelled web logs (normal vs XSS vs dir traversal)**

## Import libraries

In [None]:
%tensorflow_version 2.x

In [None]:
import pandas as pd
import numpy as np

import tensorflow as tf

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence

from tensorflow.keras.optimizers import Adam

In [None]:
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)

## Data preparation

Train data is for training of the model.
Test data is for validation of the model after the completion of the training. 

In [None]:
train_df = pd.read_csv("./train.csv")
test_df = pd.read_csv("./test.csv")

train_df.head()

Unnamed: 0.1,Unnamed: 0,url,normal,xss,dir
0,2631,"127.0.0.1 - - [03/May/2021:10:05:27 -0400] ""GE...",0,0,1
1,16052,"127.0.0.1 - - [03/May/2021:03:22:18 -0400] ""GE...",1,0,0
2,17721,"127.0.0.1 - - [03/May/2021:03:17:50 -0400] ""GE...",1,0,0
3,4232,"127.0.0.1 - - [03/May/2021:03:11:22 -0400] ""GE...",1,0,0
4,20011,"127.0.0.1 - - [03/May/2021:03:14:42 -0400] ""GE...",1,0,0


In [None]:
# To extract all url entries
# X_train will be the input of the training later

X_train = train_df["url"].values
X_test = test_df["url"].values

X_train[0]

'127.0.0.1 - - [03/May/2021:10:05:27 -0400] "GET /DVWA/vulnerabilities/fi/?page=..%25c1%259c..%25c1%259c..%25c1%259c..%25c1%259c..%25c1%259c..%25c1%259cetc%25c1%259passwd HTTP/1.1" 200 1350 "http://127.0.0.1/DVWA/vulnerabilities/fi/?page=include.php" "Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0"'

In [None]:
# To extarct the labels of the repective url entries and save under y_train

y_train = train_df[['normal','xss','dir']].values

y_train[0]

array([0, 0, 1])

We will need to tokenise all the unique strings in the entries as a numerical token.

In [None]:
# create the tokenizer
t = Tokenizer()

# Get the total number of words from all datasets (i.e. train2, valdn and test)
all_comments = list (X_train) + list (X_test)
print("There are a total of", len(all_comments), "logs in all of the data")

# fit the tokenizer on the documents
t.fit_on_texts(all_comments)

# summarize what was learned
total_num_of_words = len(t.word_counts)
print("There are a total of", total_num_of_words, "distinct words in all of the data")

There are a total of 20526 logs in all of the data
There are a total of 11702 distinct words in all of the data


In [None]:
# Encode all the strings in X_train to tokens
X_train_encoded = t.texts_to_sequences(X_train)

# Pad sequences such that all strings will be of a standardise length (i.e. 50 in this case)
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_length = 50

# Padding is done before the string
X_train_encoded_padded = pad_sequences(X_train_encoded, maxlen=max_length, padding='pre')
X_train_encoded_padded[1]

array([   0,    0,    0,    5,    1,    1,    2,    3,   14,   11,    3,
         56,   73,   12,   13,    4,    6,    7,    8,   15, 2234,    9,
          2,    2,   26,   32,    9,    5,    1,    1,    2,    4,    6,
          7,    8,   17,   16,    1,   18,   21,   22,   19,   23,   10,
          1,   20,   24,   25,   10,    1], dtype=int32)

In [None]:
y_train.shape

(16420, 3)

## Tensorflow Model Building

**Model below is copied over from NLP. No idea how to improve...**

Can refer to https://towardsdatascience.com/deep-learning-which-loss-and-activation-functions-should-i-use-ac02f1c56aa8

In [None]:
# Need to change the Dropout rate, activation function?

from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Dense, Dropout, Input, Embedding, GlobalMaxPooling1D
from tensorflow.keras.optimizers import RMSprop

Inp = Input(name='inputs',shape=[max_length])
x = Embedding(total_num_of_words + 1, 50, input_length=max_length)(Inp)
x = GlobalMaxPooling1D()(x)
x = Dropout(0.5,name='Dropout')(x)

# Change Number of Output based on number of columns.
out = Dense(3,activation='sigmoid', name='output')(x)

In [None]:
model2 = Model(inputs=Inp,outputs=out)
# Need to change loss and optimiser?
model2.compile(loss='binary_crossentropy',optimizer=Adam(0.01),metrics=['accuracy'])

model2.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 50)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 50, 50)            585150    
_________________________________________________________________
global_max_pooling1d (Global (None, 50)                0         
_________________________________________________________________
Dropout (Dropout)            (None, 50)                0         
_________________________________________________________________
output (Dense)               (None, 3)                 153       
Total params: 585,303
Trainable params: 585,303
Non-trainable params: 0
_________________________________________________________________


## Training of model

Input X_train and label y_train for training.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

# Need to change the delta value?
early_stop = EarlyStopping(monitor='val_loss',min_delta=0.001)

# Need to change the number of epoch?
# Split the X_train and y_train by 0.2 for validation
model2.fit(X_train_encoded_padded,y_train,
          batch_size=128,
          epochs=10,
          validation_split=0.2,
          callbacks=[early_stop])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<tensorflow.python.keras.callbacks.History at 0x7f1c0605b4d0>

## Verification 

Verification using X_test

In [None]:
# Prepare the input data X_test same as X_train. 

X_test_encoded = t.texts_to_sequences(X_test)
X_test_encoded_padded = pad_sequences(X_test_encoded, maxlen=max_length, padding='pre')
print(X_test_encoded_padded[0])

[   0    0    0    5    1    1    2    3   14   11    3   63  102   12
   13    4    6    7    8   15 9782    9    2    2   26   32    9    5
    1    1    2    4    6    7    8   17   16    1   18   21   22   19
   23   10    1   20   24   25   10    1]


In [None]:
print(X_test_encoded_padded[23])

[  6  34  33  35  35  35  35  35  35  35  35  35  35  35  35  35  35  91
  88   9   2   2  26 244   9   5   1   1   2   4   6  34  33  46  40  17
  16   1  18  21  22  19  23  10   1  20  24  25  10   1]


In [None]:
# Apply trained model on the inputs from X_test

prediction = model2.predict(X_test_encoded_padded)
prediction[0]

array([9.9775320e-01, 1.4258723e-03, 1.1625122e-05], dtype=float32)

In [None]:
# Round off the values
# Print the results

round_predictions= np.around(prediction, decimals=1)
results_df= pd.concat([test_df, pd.DataFrame(round_predictions, columns= ['normal_pred','xss_pred','dir_pred'])], axis=1)

results_df.head(50)

Unnamed: 0.1,Unnamed: 0,url,normal,xss,dir,normal_pred,xss_pred,dir_pred
0,124,"127.0.0.1 - - [03/May/2021:03:16:43 -0400] ""GE...",1,0,0,1.0,0.0,0.0
1,14876,"127.0.0.1 - - [03/May/2021:03:14:02 -0400] ""GE...",1,0,0,0.9,0.0,0.0
2,13615,"127.0.0.1 - - [03/May/2021:03:11:22 -0400] ""GE...",1,0,0,1.0,0.0,0.0
3,9841,"127.0.0.1 - - [03/May/2021:03:18:58 -0400] ""GE...",1,0,0,1.0,0.0,0.0
4,1987,"127.0.0.1 - - [03/May/2021:03:41:37 -0400] ""GE...",0,1,0,0.0,1.0,0.0
5,2534,"127.0.0.1 - - [03/May/2021:03:40:57 -0400] ""GE...",0,1,0,0.0,1.0,0.0
6,10977,"127.0.0.1 - - [03/May/2021:03:41:31 -0400] ""GE...",0,1,0,0.0,1.0,0.0
7,3309,"127.0.0.1 - - [03/May/2021:03:07:36 -0400] ""GE...",1,0,0,1.0,0.0,0.0
8,14353,"127.0.0.1 - - [03/May/2021:03:41:15 -0400] ""GE...",0,1,0,0.0,1.0,0.0
9,5710,"127.0.0.1 - - [03/May/2021:03:40:49 -0400] ""GE...",0,1,0,0.0,1.0,0.0


# Future Work

1.   Calculate the ROC curve, false positive rate of the model etc. 
2.   Improve the model (e.g. accuracy, speed etc)
3.   Increase the log sources


*   To include more kind of web app attacks (e.g. SQLi, RCE etc) to generation more variation of logs. Refer to OSWAP. 
*   To collect logs from different kind of web servers, from different websites etc. 
*   To collect from POST request?




