#Keras implementation
This is an implementation on Google Colab Notebook of the deep learning classification model constructed using stacked nonsymmetric deep autoencoder (NDAE) and Random Forest algorithm on KDD Cup'99 dataset. This model is proposed in the article "A Deep Learning Approach to Network Intrusion Detection" (https://ieeexplore.ieee.org/document/8264962)



In [1]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [2]:
%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)

TensorFlow 1.x selected.
1.15.2


#Prepare data
The KDD Cup’99 dataset is loaded from .csv files, which contains 494,021 training records and 311,029 testing records.

This dataset needs pre-processing to be successfully utilised with stacked NDAE model.This is because our model operates using only numeric values but one record in the dataset has a mixture of numeric and symbolic values, so a data transformation was needed to convert them. In addition integer values also need normalisation as they were mixed with ﬂoating point values between 0 and 1, which would make learning difﬁcult.



Load data from .csv file

In [0]:
import pandas
data_train = pandas.read_csv('/content/drive/My Drive/NDAE_IEEE2018/data/kddcup.data_10_percent_train_5_class.csv')
data_test = pandas.read_csv('/content/drive/My Drive/NDAE_IEEE2018/data/kddcup.data_10_percent_test_5_class.csv')
x_train =data_train.iloc[:, :-1].values
y_train = data_train.iloc[:, 41].values
x_test =data_test.iloc[:, :-1].values
y_test = data_test.iloc[:, 41].values

Firstly, the symbolic features: Protocol type, Service, Flag are transformed to numeric type, then these are encoded by one-hot vectors.

In [0]:

import numpy as np
#Concate training and testing data to do the same preprocess steps
x = np.concatenate((x_train, x_test), axis = 0)

#Transform to numeric features
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Encode Protocol type feature (3 different values)
labelencoder_x_1 = LabelEncoder()
x[:, 1] = labelencoder_x_1.fit_transform(x[:, 1])
# Encode Service feature (67 different values)
labelencoder_x_2 = LabelEncoder()
x[:, 2] = labelencoder_x_2.fit_transform(x[:, 2])

# Encode Flag feature (11 different values)
labelencoder_x_3 = LabelEncoder()
x[:, 3] = labelencoder_x_3.fit_transform(x[:, 3])


# Encoded by one-hot vector
from sklearn.compose import ColumnTransformer
# transform the second feature to 3-dim one-hot vector
ct = ColumnTransformer([("ProtocolType", OneHotEncoder(), [1])], remainder = 'passthrough')
x = ct.fit_transform(x)
# transform the third feature to 67-dim one-hot vector
ct = ColumnTransformer([("Service", OneHotEncoder(), [4])], remainder = 'passthrough')
x = ct.fit_transform(x)
# transform the fourth feature to 11-dim one-hot vector
ct = ColumnTransformer([("Flag", OneHotEncoder(), [71])], remainder = 'passthrough')
x = ct.fit_transform(x)

Labels are also converted to numeric values

In [0]:
# Encode label (5 labels)
y = np.concatenate((y_train, y_test), axis = 0)
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

Data is normalized to ﬂoating point values between 0 and 1

In [0]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x = scaler.fit_transform(x)

# split to training and testing data
x_train = x[:494021,:]
x_test = x[494021:,:]

y_train = y[:494021]
y_test = y[494021:]


#Construct classifier
The model uses two NDAEs arranged in a stack, is combined with the RF algorithm.

![Stacked NDAE Classification Model](https://drive.google.com/uc?id=1paKfuUYEVtBmOfkrMdzzfElMwGyMd2eC)








In [7]:
import keras
from keras.layers import Input, Dense
from keras.models import Model
from keras import backend as K
import tensorflow as tf
%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)

1.15.2


Using TensorFlow backend.


In [0]:
class DenseTranspose(keras.layers.Layer):
  def __init__(self, dense, activation=None, **kwargs):
    self.dense = dense
    self.activation = keras.activations.get(activation)
    super().__init__(**kwargs)
  def build(self, batch_input_shape):
    self.biases = self.add_weight(name="bias", initializer="zeros",shape=[self.dense.input_shape[-1]])
    self.W = tf.transpose(self.dense.weights[0]) 
    super().build(batch_input_shape)
  def compute_output_shape(self, input_shape):
    return (input_shape[0], self.dense.input_shape[-1])
  def call(self, inputs):
    z = tf.matmul(inputs, self.W)
    return self.activation(z + self.biases)

Contruct the first autoencoder, and training it.


In [9]:
K.clear_session()
num_hidden = (119, 14, 28, 28)

Dense_11 = Dense(units=num_hidden[1], activation='sigmoid')
Dense_12 = Dense(units=num_hidden[2], activation='sigmoid')
Dense_13 = Dense(units=num_hidden[3], activation='sigmoid')

inputs_1 = Input(shape=(num_hidden[0],))

#Encoder
encoded_11 = Dense_11(inputs_1)
encoded_12 = Dense_12(encoded_11)
encoded_13 = Dense_13(encoded_12)

#Decoder
decoded_11 = DenseTranspose(Dense_13, activation='sigmoid')(encoded_13)
decoded_12 = DenseTranspose(Dense_12, activation='sigmoid')(decoded_11)
outputs_1 = DenseTranspose(Dense_11, activation='sigmoid')(decoded_12)

AE_1=Model(inputs_1, outputs_1)
Encoder_1=Model(inputs_1, decoded_12)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [10]:
AE_1.compile(optimizer='rmsprop', loss= 'mse')
AE_1.fit(x_train,x_train,epochs=100,batch_size=256,shuffle=True)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 7

<keras.callbacks.callbacks.History at 0x7f3982accd68>

Getting the input of the last layer to fed to the second autoencoder.

In [0]:
AE_1_encoded_train = Encoder_1.predict(x_train)
AE_1_encoded_test = Encoder_1.predict(x_test)

Contruct the second autoencoder, and training it.

In [0]:
K.clear_session()
num_hidden = (14, 28, 28)

Dense_21 = Dense(units=num_hidden[1], activation='sigmoid')
Dense_22 = Dense(units=num_hidden[2], activation='sigmoid')


inputs_2 = Input(shape=(num_hidden[0],))

#Encoder
encoded_21 = Dense_21(inputs_2)
encoded_22 = Dense_22(encoded_21)


#Decoder
decoded_21 = DenseTranspose(Dense_22, activation='sigmoid')(encoded_22)
outputs_2 = DenseTranspose(Dense_21, activation='sigmoid')(decoded_21)

AE_2=Model(inputs_2, outputs_2)
Encoder_2=Model(inputs_2, decoded_21)

In [13]:
AE_2.compile(optimizer='rmsprop', loss= 'mse')
AE_2.fit(AE_1_encoded_train,AE_1_encoded_train,epochs=100,batch_size=256,shuffle=True)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.callbacks.History at 0x7f3983f3cb00>

Getting the input of the last layer to fed to RF algorithm.

In [0]:
AE_2_encoded_train = Encoder_2.predict(AE_1_encoded_train)
AE_2_encoded_test = Encoder_2.predict(AE_1_encoded_test)

In [15]:
from sklearn.ensemble import RandomForestClassifier
rfc1 = RandomForestClassifier(n_jobs=-1, n_estimators=10)
rfc1.fit(AE_2_encoded_train, y_train)
rfc1.score(AE_2_encoded_test, y_test)

0.9784775915155662