# Tying it all together

As François Chollet recommends in his excellent book, [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python), it is always a good idea to follow a general workflow.  
This is why I adapt the workflow that is introduced in chapter 6.3.  
Additionally, I am querying Splunk instead of using a static dataset that would be provided normally.  

**On a side note**: I decided to pass on usung a generator for querying the static CICIDS2017 dataset, because Keras does not allow advanced collection of statistics for Tensorboard with generators.  
Generators will be most likely neccessary for live analysis of data - a scenario, where advanced insights are not of critical importance.

## Getting Data Out Of Splunk

In [1]:
# Have a look at the IPython notebook "Splunk Python API - CICIDS17" for a detailed Splunk howto
import splunklib.client as client
import splunklib.results as results

%run -i splunk_credentials

service = client.connect(
    host=HOST,
    port=PORT,
    username=USERNAME,
    password=PASSWORD
)

We continue with setting up the search params as well as the search itself.  
For this first example, I am just interested in the part set *Friday Working Hours - Afternoon DDoS* to keep initial training durations low.  
Have a look at the IPython notebook *Data Sanitization* for more information on how I transformed the dataset into its current representation.

The data has been imported in the index *cicids17*, the CSV data is split by days and attacks.  
This is why I can simply query for the source file name that yields the specific DDoS dataset part. 
In a live data scenario one would narrow down the search time range to drastically speed up search times, but as this is static data, I'm willfully ignoring this.

In [2]:
kwargs_oneshot = {'earliest_time': '2017-07-07T16:11:00.000', # Subset start just befor the ddos
                  'latest_time':   '2017-07-07T16:11:05.000',
                  'count': 0}

# This yields the full dataset in its (mostly) raw form
#searchquery_oneshot = 'search index=cicids17 source="Friday-WorkingHours-Afternoon-DDos.pcap_ISCX_clean.csv" | sort _time | head 20'

# An example of how to prefilter and work with data on the server side
search_benign = 'search index=cicids17 source="Friday-WorkingHours-Afternoon-DDos.pcap_ISCX_clean.csv" label="BENIGN" |  table * |  sort 0 timestamp | head 500'
search_malicious = 'search index=cicids17 source="Friday-WorkingHours-Afternoon-DDos.pcap_ISCX_clean.csv" label!="BENIGN" |  table * |  sort 0 timestamp | head 500'

# This is just for testing to make sure the encoder works right
benign_results = service.jobs.oneshot(search_benign, **kwargs_oneshot) 
benign_reader = results.ResultsReader(benign_results)

malicious_results = service.jobs.oneshot(search_malicious, **kwargs_oneshot) 
malicious_reader = results.ResultsReader(malicious_results)

Next up, iterate over all returned entries and convert them into a useable, in-memory datastructure.  
Splunk returns an [ordered dict](https://docs.python.org/3/library/collections.html#collections.OrderedDict) as datastructure, which is very convenient for further usage.  
Unfortunately, the dict also contains meta info from Splunk, so the entries has to be scrubbed from these.  
Additionally, the flow_id and timestamp fields are removed, as they change constantly and could possibly throw off the neural network.

In [3]:
import pandas as pd
pd.set_option('display.max_columns', None)

# delete all splunk-specific metadata as well as the flow_id and timestamp of the original data

data = []

# FIXME: Just don't concatenate at all! just use a sane search query pls.
for idx,item in enumerate(benign_reader):  
    try:
        del item['flow_id'],item['timestamp']
        del item['date_mday'],item['source'],item['index'],item['sourcetype'],item['_subsecond'],item['linecount'],item['_bkt'],item['_raw'],item['date_month'],item['date_year']
        del item['_time'],item['timeendpos'],item['timestartpos'],item['date_hour'],item['date_minute'],item['_cd'],item['date_zone'],item['host'],item['_serial']
        del item['date_second'],item['_sourcetype'],item['date_wday'],item['splunk_server'],item['punct'],item['_indextime'],item['_si']
    except KeyError as ke:
        pass # FIXME: Just ignore any key errors for now.
    data.append(item)
    
for idx,item in enumerate(malicious_reader):  
    try:
        del item['flow_id'],item['timestamp']
        del item['date_mday'],item['source'],item['index'],item['sourcetype'],item['_subsecond'],item['linecount'],item['_bkt'],item['_raw'],item['date_month'],item['date_year']
        del item['_time'],item['timeendpos'],item['timestartpos'],item['date_hour'],item['date_minute'],item['_cd'],item['date_zone'],item['host'],item['_serial']
        del item['date_second'],item['_sourcetype'],item['date_wday'],item['splunk_server'],item['punct'],item['_indextime'],item['_si']
    except KeyError as ke:
        pass # FIXME: Just ignore any key errors for now.
    data.append(item)   
    
netflows = pd.DataFrame(data)
print('Processed {} netflows'.format(len(netflows)))

Processed 1000 netflows


In [4]:
# FIXME: Why are these not deleted though the initial del item[stuff]?
netflows.drop(['_si'], axis=1, inplace=True)
netflows.drop(['_sourcetype'], axis=1, inplace=True)
netflows.drop(['_indextime'], axis=1, inplace=True)
netflows.drop(['punct'], axis=1, inplace=True)
netflows.drop(['date_wday'], axis=1, inplace=True)
netflows.drop(['splunk_server'], axis=1, inplace=True)
netflows.head(5)

Unnamed: 0,ack_flag_count,act_data_pkt_fwd,active_max,active_mean,active_min,active_std,average_packet_size,avg_bwd_segment_size,avg_fwd_segment_size,bwd_avg_bulk_rate,bwd_avg_bytes_per_bulk,bwd_avg_packets_per_bulk,bwd_header_length,bwd_iat_max,bwd_iat_mean,bwd_iat_min,bwd_iat_std,bwd_iat_total,bwd_packet_length_max,bwd_packet_length_mean,bwd_packet_length_min,bwd_packet_length_std,bwd_packets_per_s,bwd_psh_flags,bwd_urg_flags,cwe_flag_count,date_second,destination_ip,destination_port,down_per_up_ratio,ece_flag_count,fin_flag_count,flow_bytes_per_s,flow_duration,flow_iat_max,flow_iat_mean,flow_iat_min,flow_iat_std,flow_packets_per_s,fwd_avg_bulk_rate,fwd_avg_bytes_per_bulk,fwd_avg_packets_per_bulk,fwd_header_length,fwd_iat_max,fwd_iat_mean,fwd_iat_min,fwd_iat_std,fwd_iat_total,fwd_packet_length_max,fwd_packet_length_mean,fwd_packet_length_min,fwd_packet_length_std,fwd_packets_per_s,fwd_psh_flags,fwd_urg_flags,idle_max,idle_mean,idle_min,idle_std,init_win_bytes_backward,init_win_bytes_forward,label,max_packet_length,min_packet_length,min_seg_size_forward,packet_length_mean,packet_length_std,packet_length_variance,protocol,psh_flag_count,rst_flag_count,source_ip,source_port,subflow_bwd_bytes,subflow_bwd_packets,subflow_fwd_bytes,subflow_fwd_packets,syn_flag_count,total_backward_packets,total_fwd_packets,total_length_of_bwd_packets,total_length_of_fwd_packets,urg_flag_count,external_ip
0,1,21,1388645,154596.2222,315,462768.2921,245.7727273,316.0,201.5,0,0,0,704,16600000.0,5419158.286,3,6722736.628,114000000.0,316,316.0,316,0.0,0.19331713,0,0,0,0,192.168.10.3,389,0,0,0,138.9950168,113802641,16600000.0,1750809.862,2,4543699.313,0.579951391,0,0,0,"[1408, 1408]",16600000.0,2646573.047,2,5388154.566,114000000.0,403,201.5,0,203.8295572,0.386634261,1,0,16600000.0,12500000.0,7873388,3784672.562,2081,525,BENIGN,403,0,32,242.1044776,174.2978237,30379.73134,6,0,0,192.168.10.50,42576,6952,22,8866,44,1,22,44,6952,8866,0,
1,0,24,0,0.0,0,0.0,113.1529412,158.0454545,64.97560976,0,0,0,1424,106006.0,8678.465116,1,21852.1674,373174.0,976,158.0454545,0,312.6752498,33.16684582,0,0,0,0,192.168.10.50,22,1,0,0,7249.970979,1326626,953354.0,15793.16667,0,104621.7421,64.07231578,0,0,0,"[1328, 1328]",996355.0,33165.65,0,159221.5213,1326626.0,456,64.97560976,0,109.864573,30.90546997,0,0,0.0,0.0,0,0.0,243,29200,BENIGN,976,0,32,111.8372093,239.6868477,57449.78495,6,1,0,192.168.10.19,36558,6954,44,2664,41,0,44,41,6954,2664,0,
2,1,0,24503,24503.0,24503,0.0,6.857142857,6.0,6.0,0,0,0,120,9459512.0,1896146.8,1,4228058.524,9480734.0,6,6.0,6,0.0,0.632643453,0,0,0,0,172.16.0.1,58132,6,0,0,4.428504173,9484015,9459512.0,1580669.167,1,3859836.775,0.738084029,0,0,0,"[20, 20]",0.0,0.0,0,0.0,0.0,6,6.0,6,0.0,0.105440576,0,0,9459512.0,9459512.0,9459512,0.0,0,229,BENIGN,6,6,20,6.0,0.0,0.0,6,0,0,192.168.10.50,80,36,6,6,1,0,6,1,36,6,1,
3,1,0,5511,5511.0,5511,0.0,7.0,6.0,6.0,0,0,0,100,9802510.0,2451433.75,1,4900717.735,9805735.0,6,6.0,6,0.0,0.509786837,0,0,0,0,172.16.0.1,58120,5,0,0,3.670465224,9808021,9802510.0,1961604.2,1,4383199.822,0.611744204,0,0,0,"[20, 20]",0.0,0.0,0,0.0,0.0,6,6.0,6,0.0,0.101957367,0,0,9802510.0,9802510.0,9802510,0.0,0,229,BENIGN,6,6,20,6.0,0.0,0.0,6,0,0,192.168.10.50,80,30,5,6,1,0,5,1,30,6,1,
4,1,0,20942,20942.0,20942,0.0,6.857142857,6.0,6.0,0,0,0,120,9458106.0,1895558.0,4,4227600.328,9477790.0,6,6.0,6,0.0,0.632974957,0,0,0,0,172.16.0.1,58135,6,0,0,4.430824699,9479048,9458106.0,1579841.333,4,3859552.527,0.738470783,0,0,0,"[20, 20]",0.0,0.0,0,0.0,0.0,6,6.0,6,0.0,0.105495826,0,0,9458106.0,9458106.0,9458106,0.0,0,229,BENIGN,6,6,20,6.0,0.0,0.0,6,0,0,192.168.10.50,80,36,6,6,1,0,6,1,36,6,1,


In [7]:
# For debugging and speedup: Write this to json to skip the need for splunk queries time and time again
netflows.to_json('netflows_cic.json')

Splunk returns the newest entries at the top, which normally is a great idea. We, however need the correct spatial alignment.  
The Splunk Search API has a *sort* feature; however, it seems to act up sometimes on returning the full dataset. Clear **FIXME**

## One Hot Encoding

Let's use the [Keras text preprocessing utils](https://keras.io/preprocessing/text/) for the task of encoding all needed data.

In [None]:
import numpy as np
from keras.preprocessing.text import Tokenizer
# Instantiate a new tokenizer with 20 words. More labels are not present
label_tokenizer = Tokenizer(num_words=20) 
label_tokenizer.fit_on_texts(netflows['label'])

# Run the fitted tokenizer on the label column and save the encoded data as dataframe
enc_labels = label_tokenizer.texts_to_sequences(netflows['label'])

type(np.asarray(enc_labels))

enc_labels = np.concatenate(enc_labels).ravel()

In [None]:
# from sklearn.preprocessing import OneHotEncoder

# enc = OneHotEncoder(sparse=False) # Key here is sparse=False!
# y_categorical = enc.fit_transform(y.reshape((y.shape[0]),1))

In [None]:
# Generate OHE labels
df_source_ip = pd.get_dummies(netflows['source_ip'], prefix='source_ip')
df_destination_ip = pd.get_dummies(netflows['destination_ip'], prefix='destination_ip')
df_external_ip = pd.get_dummies(netflows['external_ip'], prefix='external_ip')

#Append the label columns to the dataset
netflows = pd.concat([netflows, df_source_ip], axis=1)
netflows = pd.concat([netflows, df_destination_ip], axis=1)
netflows = pd.concat([netflows, df_external_ip], axis=1)

#drop the original columns
netflows.drop(['source_ip'], axis=1, inplace=True)
netflows.drop(['destination_ip'], axis=1, inplace=True)
netflows.drop(['external_ip'], axis=1, inplace=True)
netflows.drop(['label'], axis=1, inplace=True)

In [None]:
netflows.head(5)

## Building the Keras model

Prepare the tensorboard callbacks and logdir stuff

In [None]:
from time import time
from keras.callbacks import TensorBoard
tensorboard = TensorBoard(log_dir="logs/lstm-{}".format(time()))

In [None]:
print(netflows.values.shape)

### A first simple RMSProp approximation

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Embedding
from keras.layers import LSTM

model = Sequential()

model.add(Embedding(netflows.values.shape[-1], 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = model.fit(
    netflows.values.tolist(), 
    enc_labels.tolist(), 
    batch_size=16, 
    epochs=2, 
    validation_split=0.2,
    callbacks=[tensorboard],
    verbose=1,
)