<a href="https://colab.research.google.com/github/kniemi641/UC-MScA/blob/master/ML%20Homework%207%20-%20RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 7 - RNN

The purpose of this exercise is to train a recurrent neural network model to identify malicous networking activity. A networking log file has been provided which contains attributes such as timestamp, access method, and an indicator which denotes if the observed activity was a breach.

In [1]:
# Generic Packages
import pandas as pd
import numpy as np
import math
import os
from scipy import stats
import sys
import json
from scipy import stats
import optparse

#Neural Networks
from keras.callbacks import TensorBoard
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, GaussianDropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.utils import np_utils
from collections import OrderedDict
from keras import optimizers
import h5py

# Sklearn Packages
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder

#Plotting Packages
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import seaborn as sns

#Utilitie warnings
import pickle
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive, files
drive.mount('/content/gdrive')
#uploaded = files.upload()
np.random.seed(235)

Using TensorFlow backend.


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
#GLOBAL & CONSTANTS
NUM = 7
INPUT_DATA_FILE = 'dev-access.csv'

GD_CODE_DIR = '/content/gdrive/My Drive/Code/uchicago/'
SUBJECT_DIR = 'Machine Learning & Predictive Analytics/'
DATA_DIR = 'data/'
MODEL_DIR = 'models/'
LOGS_DIR = 'logs/'
HOMEWORK_DIR = 'Homework {}/'.format(NUM)
NOTEBOOK_NAME = 'Homework {}.ipynb'.format(NUM)

MAIN_PATH = os.path.join(GD_CODE_DIR
                        ,SUBJECT_DIR
                        ,HOMEWORK_DIR)

INPUT_FILE = os.path.join(MAIN_PATH
                          ,DATA_DIR
                          ,INPUT_DATA_FILE)

NOTEBOOK_FILE = os.path.join(MAIN_PATH
                            ,NOTEBOOK_NAME)

MODEL_EXPORT_PATH = os.path.join(MAIN_PATH
                                ,MODEL_DIR)

LOG_PATH = os.path.join(MAIN_PATH
                       ,LOGS_DIR)


## Data Processing
The dataset is imported in .JSON format and flattened into a X and y vectors.


In [3]:
log_data = pd.read_csv(INPUT_FILE, engine='python', quotechar='|', header=None)
dataset = log_data.values

X = dataset[:,0]
y = dataset[:,1]

for index, item in enumerate(X):
# Quick hack to space out json elements
  reqJson = json.loads(item, object_pairs_hook=OrderedDict)
  del reqJson['timestamp']
  del reqJson['headers']
  del reqJson['source']
  del reqJson['route']
  del reqJson['responsePayload']
X[index] = json.dumps(reqJson, separators=(',', ':'))
print('X data: {}'.format(X[:5]))

X data: ['{"timestamp":1502738402847,"method":"post","query":{},"path":"/login","statusCode":401,"source":{"remoteAddress":"88.141.113.237","referer":"http://localhost:8002/enter"},"route":"/login","headers":{"host":"localhost:8002","accept-language":"en-us","accept-encoding":"gzip, deflate","connection":"keep-alive","accept":"*/*","referer":"http://localhost:8002/enter","cache-control":"no-cache","x-requested-with":"XMLHttpRequest","content-type":"application/json","content-length":"36"},"requestPayload":{"username":"Carl2","password":"bo"},"responsePayload":{"statusCode":401,"error":"Unauthorized","message":"Invalid Login"}}'
 '{"timestamp":1502738402849,"method":"post","query":{},"path":"/login","statusCode":401,"source":{"remoteAddress":"88.141.113.237"},"route":"/login","headers":{"host":"localhost:8002","connection":"keep-alive","cache-control":"no-cache","accept":"*/*","accept-encoding":"gzip, deflate, br","accept-language":"en-US,en;q=0.8,es;q=0.6","content-type":"application/j

Next the X vector is tokenized at the character level, using tabs and line carriage returns as filters. The observations are padded to the max_log_length.

In [4]:
#Tokenize X by tabs and cariage returns
tokenizer = Tokenizer(filters='\t\n', char_level=True)
tokenizer.fit_on_texts(X)

num_words = len(tokenizer.word_index)+1
X = tokenizer.texts_to_sequences(X)

max_log_length = 1024
X_processed = sequence.pad_sequences(X, maxlen=max_log_length)
X_processed[:5]

array([[ 0,  0,  0, ...,  1, 27, 27],
       [ 0,  0,  0, ...,  1, 27, 27],
       [ 0,  0,  0, ...,  1, 27, 27],
       [ 0,  0,  0, ...,  1, 27, 27],
       [ 0,  0,  0, ...,  1, 27, 27]], dtype=int32)

The code is split into train and test using a ratio of 3:1.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.25, random_state=235)

## Model 1

The first model is constructed using an embedding layer, LSTM layer, and a dense layer. The embedding layer as input dimensions equal to the processed X vector, and an output dimension of 32. The LSTM layer uses 64 units, with a dropout rate of 0.5. Finally the dense layer is single unit with a 'Rectified Linear Unit' activation function. 

In [0]:
#Sequential Model
model_1 = Sequential()

#Embedding Input Layer
model_1.add(Embedding(input_dim=num_words
                      , output_dim = 32
                      , input_length = max_log_length)
            )

#LSTM Lyaer
model_1.add(LSTM(units = 64
                , dropout= 0.5)
            )

#Output Layer
model_1.add(Dense(units = 1
                  , activation= 'relu',)
            )

#Compile
model_1.compile(optimizer='Adam'
                , loss='binary_crossentropy'
                , metrics=['accuracy']
                )

print(model_1.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1024, 32)          2048      
_________________________________________________________________
lstm_6 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 65        
Total params: 26,945
Trainable params: 26,945
Non-trainable params: 0
_________________________________________________________________
None


The model summary above shows the 3 layers, their dimensions, and the total number of trainable parameters. The model is then trained using 3 epochs, with a validation split of 3:1 and a batch size of 128.

In [0]:
#Histrial training
model_1_history = model_1.fit( X_train
                              ,y_train
                              ,batch_size=128
                              ,epochs=3
                              ,validation_split=0.25
                              ,verbose=1
                             )

model_1.save(MODEL_EXPORT_PATH+'model_1.h5')

Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [0]:
#model_1 = load_model(MODEL_EXPORT_PATH+'model_1.h5')
score_1 = model_1.evaluate(X_test, y_test, verbose=0)
print('Test loss:{}'.format(round(score_1[0], 4)))
print('Test accuracy{}'.format(round(score_1[1],4)))

Test loss:0.4422
Test accuracy0.6421


The results above show the accuracy and loss of the model.

## Model 2

The second model has the same embedding and LSTM layers as the first.Two additional dropout layers with a values of 0.5 are added, and the output layer's activation function is changed to a 'sigmoid'

In [0]:
#Sequential Model
model_2 = Sequential()

#Embedding Input Layer
model_2.add(Embedding(input_dim=num_words
                      , output_dim = 32
                      , input_length = max_log_length)
           )

#Dropout Layer
model_2.add(Dropout(rate=0.5))

#LSTM Lyaer
model_2.add(LSTM(units = 64
                 , dropout= 0.5)
           )

#Dropout Layer
model_2.add(Dropout(rate=0.5))

#Dense Sigmoid to target
model_2.add(Dense(units = 1
                  , activation= 'sigmoid')
           )

#Compile
model_2.compile(optimizer='Adam'
                , loss='binary_crossentropy'
                , metrics=['accuracy']
               )

#summarize
print(model_2.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 1024, 32)          2048      
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024, 32)          0         
_________________________________________________________________
lstm_7 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 65        
Total params: 26,945
Trainable params: 26,945
Non-trainable params: 0
_________________________________________________________________
None


The summary of the model above the additional layers. Note these layers have no trainable parameters (weights) and therefore the second model has the same number of trainable parameters as the first model.

In [0]:
#Fit Model
model_2_history = model_2.fit(X_train
                              , y_train
                              , batch_size=128                              
                              , epochs=3
                              , validation_split=0.25
                              , verbose=1
                             )

model_2.save(MODEL_EXPORT_PATH+'model_2.h5')

Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [6]:
#model_2 = load_model(MODEL_EXPORT_PATH+'model_2.h5')
score_2 = model_2.evaluate(X_test, y_test, verbose=0)
print('Test loss:{}'.format(round(score_2[0], 4)))
print('Test accuracy{}'.format(round(score_2[1],4)))

Test loss:0.4539
Test accuracy0.7399


## Model 3

The third model is created using the same embedding and output (dense) layer parameters. However the dropout layers have been changed to a Guassian layer and additional LSTM layer has been added. The optimizer has been changed to Stochatic Gradient Descent with a learning rate of 0.1 and momentum of 0.9 

In [7]:
#Sequential Model
model_3 = Sequential()

#Embedding Input Layer
model_3.add(Embedding(input_dim=num_words
                      , output_dim = 32
                      , input_length = max_log_length)
           )

#Dropout Layer
model_3.add(Dropout(rate=0.5))

#LSTM Lyaer
model_3.add(LSTM(units = 64
                 , recurrent_dropout=0.5
                 , activation='tanh'
                 , return_sequences=True)
           )

#Dropout Layer
model_3.add(Dropout(rate=0.5))

model_3.add(LSTM(units = 64
                 , recurrent_dropout=0.5
                 , activation='tanh')
           )

#Dense Sigmoid to target
model_3.add(Dense(units = 1
                  , activation= 'sigmoid'
                 )
           )

#SGD Optimizer
SGD = optimizers.SGD(lr=0.01
                     , decay=1e-6
                     , momentum=0.9
                     , nesterov=True)


# Compiler           
model_3.compile(optimizer=SGD
              , loss='binary_crossentropy'
              , metrics=['accuracy']
               )
            
print(model_3.summary())   

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1024, 32)          2048      
_________________________________________________________________
dropout_1 (Dropout)          (None, 1024, 32)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 1024, 64)          24832     
_________________________________________________________________
dropout_2 (Dropout)          (None, 1024, 64)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 59,969
Trainable params: 59,969
Non-trainable params: 0
_________________________________________________________________
None


The summary from the model above shows the large increase in the number of trainable parameters due to to the second LSTM layer. 

In [8]:
#Fit Model
model_3_history = model_3.fit(X_train
            , y_train
            , validation_split=0.25
            , epochs=3
            , batch_size=128
            , verbose=1
           )

model_3.save(MODEL_EXPORT_PATH+'model_3.h5')

Train on 15059 samples, validate on 5020 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [10]:
#model_3 = load_model(MODEL_EXPORT_PATH+'model_3.h5')

score_3 = model_3.evaluate(X_test, y_test, verbose=0)
print('Test loss: {}'.format(round(score_3[0], 4)))
print('Test accuracy: {}'.format(round(score_3[1],4)))

Test loss:0.6645
Test accuracy0.706


## Analysis



5.   The relu activation function has no upper bound, and the lower bound set to 0. The sigmoid function can only achieve values between 0 and 1. 
6.   The sigmoid is more appropriate for binary classification.
7.   Dropout layers set a fraction of random inputs to 0 at each training update. This helps prevent overfitting by forcing the network to not become reliant on any one individual pathway. These inputs are NOT set to 0 during test, as testing should involve looking at all possible trained parameters. 
8.   Recurrent neural networks allow for 'memory' to help train the weights when input data is being trained. CNNs do not have this property, they are designed more to features more related to 'spacial' aspects, such as images.
9.   LSTMs use neural network units which consist of a cell, input gate, output gate and forget gate. These gates allow for the storage, usage, discarding of information in a LSTM unit. They allowed for incorporation of training data where an unknown time lag exist between input observations.

In [0]:
!jupyter nbconvert --to html '/content/gdrive/My Drive/Code/Machine Learning & Predictive Analytics/Homework 7/Homework 7.ipynb'

[NbConvertApp] Converting notebook /content/gdrive/My Drive/Code/Machine Learning & Predictive Analytics/Homework 7/Homework 7.ipynb to html
[NbConvertApp] Writing 319663 bytes to /content/gdrive/My Drive/Code/Machine Learning & Predictive Analytics/Homework 7/Homework 7.html
