# Building an ML classifier for malicious IDS traffic

In this part of the workshop we will try to create a classifier to detect malicious traffic in an Industrial Control System (ICS) network



In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from utils import get_training_data

### Dataset

The data used in these experiments was originally collected by researchers at The University of Coimbra. [Original Paper here](https://link.springer.com/chapter/10.1007/978-3-030-05849-4_19):Frazão, I., Abreu, P.H., Cruz, T., Araújo, H. and Simões, P., 2018, September. Denial of service attacks: Detecting the frailties of machine learning algorithms in the classification process. In International Conference on Critical Information Infrastructures Security (pp. 230-235). Springer, Cham. - Thank you to the authors for their help labelling the dataset!

The dataset was collected on an ICS testbed during benign activity and during 3 kinds of DDoS attack. (TCP SYN flood, ping flood, Modbus query flood)

Here we use a subset of the whole dataset to save computational resources. 

Features were extracted from the pcap files, again a subset of possible features are used here, more could be used. Non-repeatable features are removed (e.g. timestamps, id numbers). More features were collected but any with low or zero variance on training set were omitted to save unnecessary computation. 

Categorical features use one-hot representation. 

<b>The models here use per-packet classification. Flow-based analysis, rolling averages and time-series are common methods but not used here due to the additional computational resources required.</b>

In [2]:
# List of possible features 
features = [
    # Ethernet
    "Ethernet__type",
    # IP
    "IP__ihl", 
    "IP__tos", 
    "IP__len",  
    "IP__flags", 
    "IP__frag", 
    "IP__ttl", 
    "IP__proto",
    "IP__src", 
    "IP__dst", 
    # TCP
    "TCP__sport", 
    "TCP__dport", 
    "TCP__seq", 
    "TCP__ack", 
    "TCP__dataofs", 
    "TCP__flags", 
    # ModbusADU 
    "ModbusADU__protoId", 
    "ModbusADU__len", 
    "ModbusADU__unitId", 
    # UDP
    "UDP__sport", 
    "UDP__dport" ,  
    "UDP__len",
    # BOOTP
    "BOOTP__secs", 
    # ICMP(v6)
    "ICMP__type", [c] 
    "ICMPv6 Neighbor Discovery - Neighbor Solicitation__type",
    "ICMPv6 Neighbor Discovery - Router Solicitation__type", 
    "ICMPv6 Neighbor Discovery Option - Source Link-Layer Address__type",
    "ICMPv6 Neighbor Discovery Option - Source Link-Layer Address__len",
    # DHCP
    "DHCPv6 Solicit Message__msgtype ",
    "DHCP6 Elapsed Time Option__optcode", 
    "DHCP6 Elapsed Time Option__optlen", 
    "DHCP6 Elapsed Time Option__elapsedtime", 
    "DHCP6 Client Identifier Option__optcode",
    "DHCP6 Option Request Option__optcode",
    "DHCP6 Option Request Option__optlen", 
    
    "Link Local Multicast Node Resolution - Query__qdcount"
    
    # interpacket time
    "time_delta"
]

True     30584
False    12397
Name: malicious, dtype: int64
['R1-PA1:VH' 'R1-PM1:V' 'R1-PA2:VH' 'R1-PM2:V' 'R1-PA3:VH' 'R1-PM3:V'
 'R1-PA4:IH' 'R1-PM4:I' 'R1-PA5:IH' 'R1-PM5:I' 'R1-PA6:IH' 'R1-PM6:I'
 'R1-PA7:VH' 'R1-PM7:V' 'R1-PA8:VH' 'R1-PM8:V' 'R1-PA9:VH' 'R1-PM9:V'
 'R1-PA10:IH' 'R1-PM10:I' 'R1-PA11:IH' 'R1-PM11:I' 'R1-PA12:IH'
 'R1-PM12:I' 'R1:F' 'R1:DF' 'R1-PA:Z' 'R1-PA:ZH' 'R1:S' 'R2-PA1:VH'
 'R2-PM1:V' 'R2-PA2:VH' 'R2-PM2:V' 'R2-PA3:VH' 'R2-PM3:V' 'R2-PA4:IH'
 'R2-PM4:I' 'R2-PA5:IH' 'R2-PM5:I' 'R2-PA6:IH' 'R2-PM6:I' 'R2-PA7:VH'
 'R2-PM7:V' 'R2-PA8:VH' 'R2-PM8:V' 'R2-PA9:VH' 'R2-PM9:V' 'R2-PA10:IH'
 'R2-PM10:I' 'R2-PA11:IH' 'R2-PM11:I' 'R2-PA12:IH' 'R2-PM12:I' 'R2:F'
 'R2:DF' 'R2-PA:Z' 'R2-PA:ZH' 'R2:S' 'R3-PA1:VH' 'R3-PM1:V' 'R3-PA2:VH'
 'R3-PM2:V' 'R3-PA3:VH' 'R3-PM3:V' 'R3-PA4:IH' 'R3-PM4:I' 'R3-PA5:IH'
 'R3-PM5:I' 'R3-PA6:IH' 'R3-PM6:I' 'R3-PA7:VH' 'R3-PM7:V' 'R3-PA8:VH'
 'R3-PM8:V' 'R3-PA9:VH' 'R3-PM9:V' 'R3-PA10:IH' 'R3-PM10:I' 'R3-PA11:IH'
 'R3-PM11:I' 'R3-PA12:IH' 'R3-PM

## Models

We have built wrappers for three anomaly detection and 6 supervised learning models. 

All models take the chosen features as input before training. The anomaly detection models all take a contamination ratio (percentage malicious) as input.

To save your computers, we have pretrained some models which can be loaded from the list of names below

In [3]:
# List of available models

from models import ISOF, OneClassSVM, RandomForest, DecisionTree, MLP, Adaboost, SVM, XGBoost, Model

anomaly_detection_models = [
    OneClassSVM, # One class support vector machine
    ISOF, # Isolation Forest
]

classification_models = [
    RandomForest, 
    DecisionTree, # (fastest to train)
    MLP, # multi-layer perceptron (feed-forward neural network) (can be slow)
    SVM # support vector machine (can be slow with many features)
]

# feature sets:
# all = all features
# time_model = ["time_delta", "IP__ttl", "Ethernet__type"]
# src_dst_features = IP sources, desinations, TCP and UDP source and destination ports (only top 5 most common ports used to save computational capacity)
# all_except_src_dst = all features excluding those used in src_dst 
# IP_features = all features starting with IP__ from list above
# tcp_udp_modbus_icmp_boot = all features relating to TCP, UDP, MODBUS, ICMP, and BOOTD

model_names = [
    # feature sets: 
    raise Exception
    # todo fill this in 
    "all_dt
    all_except_src_dst_dt
    all_except_src_dst_ISOF
    all_except_src_dst_MLP
    all_except_src_dst_rf
    all_ISOF
    all_LOF
    all_MLP
    all_rf
    IP_features_dt
    IP_features_ISOF
    IP_features_MLP
    IP_features_rf
    src_dst_features_dt
    src_dst_features_ISOF
    src_dst_features_LOF
    src_dst_features_MLP
    src_dst_features_rf
    tcp_udp_modbus_icmp_boot_dt
    tcp_udp_modbus_icmp_boot_ISOF
    tcp_udp_modbus_icmp_boot_MLP
    tcp_udp_modbus_icmp_boot_rf
    time_model_dt
    time_model_ISOF
    time_model_LOF
    time_model_MLP
    time_model_rf"
]

In [None]:
# the training set consists of clean and TCP SYN and ping flood attacks. 

print("Training set")
train_data = get_training_data()
contamination_ratio = train_data["malicious"].sum() / len(train_data)

# the test set consists of clean traffic and ping and MODBUS query flood attacks
print("Testing set")
test_data = get_testing_set()

# try loading an existing model and observing the results on the test data (for an existing model we do not need to supply the features)
# the output will show the accuracy, f1 score, true posisitve rate, true negative rate and confusion matrix:
# [[tn, fp],
#  [fn, tp]]
model = RandomForest(None, "time_model_rf")
model.test(test_data)

## Part 1 exercises:

In [1]:
# 1. try loading some other existing models and observe the performance on the test set
# NB, if you don't know what kind of model it was you can use the generic class Model
# 
# model = RandomForest(None, save_model_name=<model_name_here>)
# model.test(test_data)
#
# e.g.1 
# model = Model(None, save_model_name="all_ISOF")
# model.test(test_data)




In [2]:
# 2. if you have the computational resources, try training your own model
# TIP: decision tree is the quickest and less features = faster training 
# use save_model_name to save your model so you can load it later (skip this if you don't want to save)
# 
# model = <Algo_name>(<list_of_features>, [save_model_name=<model_name_here>, [contamination=contamination_ratio]])
# model.test(test_data)
# 
# e.g.1
# model = DecisionTree(["IP__proto", "IP__flags"], save_model_name="2IPfeatures_dt")
# model.test(test_data)
#
# e.g.2
# model = ISOF(["IP__len", "ModbusADU__len", "UDP__len"], save_model_name="len_ISOF", contamination=contamination_ratio)



In [3]:
# 3. Based on the performance metrics you have observed, which is the model you would use?

In [5]:
# 4. Which model do you think will be the most robust to adversarial manipulation? 
# Is the answer to 3. and 4. the same?

### ----- end of part 1 ------