# Building an ML classifier for malicious IDS traffic

In this part of the workshop we will try to create a classifier to detect malicious traffic in an Industrial Control System (ICS) network



In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
from utils import get_training_data, get_testing_data
from models import ISOF, OneClassSVM, RandomForest, DecisionTree, MLP, SVM, Model

### Dataset

The data used in these experiments was originally collected by researchers at The University of Coimbra. [Original Paper here](https://link.springer.com/chapter/10.1007/978-3-030-05849-4_19):Frazão, I., Abreu, P.H., Cruz, T., Araújo, H. and Simões, P., 2018, September. Denial of service attacks: Detecting the frailties of machine learning algorithms in the classification process. In International Conference on Critical Information Infrastructures Security (pp. 230-235). Springer, Cham. - Thank you to the authors for their help labelling the dataset!

The dataset was collected on an ICS testbed during benign activity and during 3 kinds of DDoS attack. (TCP SYN flood, ping flood, Modbus query flood)

Here we use a subset of the whole dataset to save computational resources. 

Features were extracted from the pcap files, again a subset of possible features are used here, more could be used. Non-repeatable features are removed (e.g. timestamps, id numbers). More features were collected but any with low or zero variance on training set were omitted to save unnecessary computation. 

Categorical features use one-hot representation. 

<b>The models here use per-packet classification. Flow-based analysis, rolling averages and time-series are common methods but not used here due to the additional computational resources required.</b>

In [11]:
# List of possible features 
features = [
    # Ethernet
    "Ethernet__type",
    # IP
    "IP__ihl", 
    "IP__tos", 
    "IP__len",  
    "IP__flags", 
    "IP__frag", 
    "IP__ttl", 
    "IP__proto",
    "IP__src", 
    "IP__dst", 
    # TCP
    "TCP__sport", 
    "TCP__dport", 
    "TCP__seq", 
    "TCP__ack", 
    "TCP__dataofs", 
    "TCP__flags", 
    # ModbusADU 
    "ModbusADU__protoId", 
    "ModbusADU__len", 
    "ModbusADU__unitId", 
    # UDP
    "UDP__sport", 
    "UDP__dport" ,  
    "UDP__len",
    # BOOTP
    "BOOTP__secs", 
    # ICMP(v6)
    "ICMP__type",
    "ICMPv6 Neighbor Discovery - Neighbor Solicitation__type",
    "ICMPv6 Neighbor Discovery - Router Solicitation__type", 
    "ICMPv6 Neighbor Discovery Option - Source Link-Layer Address__type",
    "ICMPv6 Neighbor Discovery Option - Source Link-Layer Address__len",
    # DHCP
    "DHCPv6 Solicit Message__msgtype ",
    "DHCP6 Elapsed Time Option__optcode", 
    "DHCP6 Elapsed Time Option__optlen", 
    "DHCP6 Elapsed Time Option__elapsedtime", 
    "DHCP6 Client Identifier Option__optcode",
    "DHCP6 Option Request Option__optcode",
    "DHCP6 Option Request Option__optlen", 
    
    "Link Local Multicast Node Resolution - Query__qdcount"
    
    # interpacket time
    "time_delta"
]

## Models

We have built wrappers for three anomaly detection and 6 supervised learning models. 

All models take the chosen features as input before training. The anomaly detection models all take a contamination ratio (percentage malicious) as input.

To save your computers, we have pretrained some models which can be loaded from the list of names below

In [12]:
# List of available models
anomaly_detection_models = [
    OneClassSVM, # One class support vector machine
    ISOF, # Isolation Forest
]

classification_models = [
    RandomForest, 
    DecisionTree, # (fastest to train)
    MLP, # multi-layer perceptron (feed-forward neural network) (can be slow)
    SVM # support vector machine (can be slow with many features)
]

## feature sets:
# all = all features
# time_model = ["time_delta", "IP__ttl", "Ethernet__type"]
# src_dst_features = IP sources, desinations, TCP and UDP source and destination ports (only top 5 most common ports used to save computational capacity)
# all_except_src_dst = all features excluding those used in src_dst 
# IP_features = all features starting with IP__ from list above
# tcp_udp_modbus_icmp_boot = all features relating to TCP, UDP, MODBUS, ICMP, and BOOTD

model_names = [
    # decision tree models
    "all_dt",
    "all_except_src_dst_dt",
    "IP_features_dt",
    "src_dst_features_dt",
    "src_dst_features_ISOF",
    "tcp_udp_modbus_icmp_boot_dt",
    "time_model_dt"
    
    # random forest models
    "all_except_src_dst_rf",
    "all_rf",
    
    # MLP models
    "tcp_udp_modbus_icmp_boot_MLP",
    "all_MLP",

    # SVM models
    
    # One Class SVM
    
    # Isolation forest
    "all_ISOF",
]

In [13]:
# the training set consists of clean and TCP SYN and ping flood attacks. 

print("Training set")
train_data = get_training_data()
contamination_ratio = train_data["malicious"].sum() / len(train_data)

# the test set consists of clean traffic and ping and MODBUS query flood attacks
print("\nTesting set")
test_data = get_testing_data()

# try loading an existing model and observing the results on the test data (for an existing model we do not need to supply the features)
# the output will show the accuracy, f1 score, true posisitve rate, true negative rate and confusion matrix:
# [[tn, fp],
#  [fn, tp]]
model = RandomForest(None, "time_model_rf")
model.test(test_data)

Training set
1    1080540
0     503716
Name: malicious, dtype: int64
modbusQueryFlooding    631677
clean                  503716
pingFloodDDoS          390831
tcpSYNFloodDDoS         58032
Name: attack_type, dtype: int64
Testing set
1    418874
0    148467
Name: malicious, dtype: int64
pingFloodDDoS          194436
tcpSYNFloodDDoS        182094
clean                  148467
modbusQueryFlooding     42344
Name: attack_type, dtype: int64
save_model_path exists, loading model and config....
RandomForestClassifier()
['time_delta', 'IP__ttl', 'Ethernet__type_2048.0', 'Ethernet__type_2054.0', 'Ethernet__type_0.0', 'Ethernet__type_34525.0', 'Ethernet__type_32821.0', 'IP__proto_6.0', 'IP__proto_17.0', 'IP__proto_0.0', 'IP__proto_1.0', 'IP__proto_2.0']
-----
Testing acc: 0.93, f1: 0.95, tpr: 0.93, tnr 0.94
[[139191   9276]
 [ 29439 389435]]
-----


## Part 1 exercises:

In [None]:
# 1. try loading some other existing models and observe the performance on the test set
# NB, if you don't know what kind of model it was you can use the generic class Model
# 
# model = RandomForest(None, save_model_name=<model_name_here>)
# model.test(test_data)
#
# e.g.1 
# model = Model(None, save_model_name="all_ISOF")
# model.test(test_data)




In [None]:
# 2. if you have the computational resources, try training your own model
# TIP: decision tree is the quickest and less features = faster training 
# use save_model_name to save your model so you can load it later (skip this if you don't want to save)
# 
# model = <Algo_name>(<list_of_features>, [save_model_name=<model_name_here>, [contamination=contamination_ratio]])
# model.train(train_data)
# model.test(test_data)
# 
# e.g.1
# model = DecisionTree(["IP__proto", "IP__flags"], save_model_name="2IPfeatures_dt")
# model.train(train_data)
# model.test(test_data)
#
# e.g.2
# model = ISOF(["IP__len", "ModbusADU__len", "UDP__len"], save_model_name="len_ISOF", contamination=contamination_ratio)
# model.train(train_data)
# model.test(test_data)


In [None]:
# 3. Based on the performance metrics you have observed, which is the model you would use?

In [None]:
# 4. Which model do you think will be the most robust to adversarial manipulation? 
# Is the answer to 3. and 4. the same?

### ----- end of part 1 ------