# Tying it all together

As François Chollet recommends in his excellent book, [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python), it is always a good idea to follow a general workflow.  
This is why I adapt the workflow that is introduced in chapter 6.3.  
Additionally, I am querying Splunk instead of using a static dataset that would be provided normally.  

**On a side note**: I decided to pass on usung a generator for querying the static CICIDS2017 dataset, because Keras does not allow advanced collection of statistics for Tensorboard with generators.  
Generators will be most likely neccessary for live analysis of data - a scenario, where advanced insights are not of critical importance.

## Getting Data Out Of Splunk

In [None]:
# Have a look at the IPython notebook "Splunk Python API - CICIDS17" for a detailed Splunk howto
import splunklib.client as client
import splunklib.results as results

%run -i splunk_credentials

service = client.connect(
    host=HOST,
    port=PORT,
    username=USERNAME,
    password=PASSWORD
)

We continue with setting up the search params as well as the search itself.  
For this first example, I am just interested in the part set *Friday Working Hours - Afternoon DDoS* to keep initial training durations low.  
Have a look at the IPython notebook *Data Sanitization* for more information on how I transformed the dataset into its current representation.

The data has been imported in the index *cicids17*, the CSV data is split by days and attacks.  
This is why I can simply query for the source file name that yields the specific DDoS dataset part. 
In a live data scenario one would narrow down the search time range to drastically speed up search times, but as this is static data, I'm willfully ignoring this.

In [None]:
kwargs_oneshot = {'earliest_time': '2017-07-07T16:10:01.000', # Subset start just befor the ddos
                  'latest_time':   '2017-07-07T23:59:59.000',
                  'count': 0}

# This yields the full dataset in its (mostly) raw form
#searchquery_oneshot = 'search index=cicids17 source="Friday-WorkingHours-Afternoon-DDos.pcap_ISCX_clean.csv" | sort _time | head 20'

# An example of how to prefilter and work with data on the server side
searchquery_oneshot = 'search index=cicids17 source="Friday-WorkingHours-Afternoon-DDos.pcap_ISCX_clean.csv" |  table * |  sort 0 timestamp | head 10000'

oneshotsearch_results = service.jobs.oneshot(searchquery_oneshot, **kwargs_oneshot) 
reader = results.ResultsReader(oneshotsearch_results)

Next up, iterate over all returned entries and convert them into a useable, in-memory datastructure.  
Splunk returns an [ordered dict](https://docs.python.org/3/library/collections.html#collections.OrderedDict) as datastructure, which is very convenient for further usage.  
Unfortunately, the dict also contains meta info from Splunk, so the entries has to be scrubbed from these.  
Additionally, the flow_id and timestamp fields are removed, as they change constantly and could possibly throw off the neural network.

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)

data = []

for idx,item in enumerate(reader):  
    # delete all splunk-specific metadata as well as the flow_id and timestamp of the original data
    try:
        del item['flow_id'],item['timestamp']
        del item['date_mday'],item['source'],item['index'],item['sourcetype'],item['_subsecond'],item['linecount'],item['_bkt'],item['_raw'],item['date_month'],item['date_year']
        del item['_time'],item['timeendpos'],item['timestartpos'],item['date_hour'],item['date_minute'],item['_cd'],item['date_zone'],item['host'],item['_serial']
        del item['date_second'],item['_sourcetype'],item['date_wday'],item['splunk_server'],item['punct'],item['_indextime'],item['_si']
    except KeyError as ke:
        pass # FIXME: Just ignore any key errors for now.
    data.append(item)
    
netflows = pd.DataFrame(data)
print('Processed {} netflows'.format(len(netflows)))

In [None]:
netflows.head(5)

Splunk returns the newest entries at the top, which normally is a great idea. We, however need the correct spatial alignment.  
The Splunk Search API has a *sort* feature; however, it seems to act up sometimes on returning the full dataset. Clear **FIXME**

## One Hot Encoding

Pandas has a nice built in function for One Hot Encoding: [get_dummies](http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example).  
The generated labels are not added to the dataset, as these are used as target values in a standalone DataFrame.

In [None]:
from keras.utils.np_utils import to_categorical
df_label = to_categorical(netflows['label'])

In [None]:
# Generate OHE labels
df_source_ip = pd.get_dummies(netflows['source_ip'], prefix='source_ip')
df_destination_ip = pd.get_dummies(netflows['destination_ip'], prefix='destination_ip')

#Append the label columns to the dataset
netflows = pd.concat([netflows, df_source_ip], axis=1)
netflows = pd.concat([netflows, df_destination_ip], axis=1)

#drop the original columns
netflows.drop(['source_ip'], axis=1, inplace=True)
netflows.drop(['destination_ip'], axis=1, inplace=True)
netflows.drop(['label'], axis=1, inplace=True)

In [None]:
netflows.head(5)

In [None]:
df_label.head(5)

## Building the Keras model

Prepare the tensorboard callbacks and logdir stuff

In [None]:
from time import time
from keras.callbacks import TensorBoard
tensorboard = TensorBoard(log_dir="logs/rms-{}".format(time()))

Split the data into test and validation parts

In [None]:
x_val = netflows.values[:5000]
x_part_train = netflows.values[5000:]

y_val = df_label.values[:5000]
y_part_train = df_label.values[5000:]

print(x_part_train.shape)

import numpy as np
# Reshape all imputs to a 3D-tensor: https://stackoverflow.com/questions/44704435/error-when-checking-model-input-expected-lstm-1-input-to-have-3-dimensions-but
#x_val = np.reshape(x_val, (x_val.shape[0], 1, x_val.shape[1]))
#x_part_train = np.reshape(x_part_train, (x_part_train.shape[0], 1, x_part_train.shape[1]))
#y_val = np.reshape(y_val, (y_val.shape[0], 1, y_val.shape[1]))
#y_part_train = np.reshape(y_part_train, (y_part_train.shape[0], 1, y_part_train.shape[1]))


### A first simple RMSProp approximation

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Embedding
from keras.layers import LSTM

model = Sequential()

model.add(Embedding(x_part_train.shape[-1], output_dim=256))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(x_part_train, y_part_train, batch_size=16, epochs=2)
score = model.evaluate(x_test, y_test, batch_size=16)