# Unsupervised Learning Algorithms for Outlier Detection

netml is a network anomaly detection tool & library written in Python. It uses outlier detection algorithms, which can operate on unlabeled traffic, as well. In this lab, we have labeled traffic as outliers and normal traffic, which allows these algorithms to declare and learn thresholds for outlier detection.

## Learning Objectives

1. Learn about the netml python library.
2. Apply the library to example network traffic.
3. Learn how to apply a trained outlier detection model to your own traffic capture.

## Tasks

1. Install the netml library.
2. Run it on the provided test traffic to train a model.
3. Capture a traffic trace of your own to try to detect outliers

---


## Labeling the Network Traffic

In [7]:
from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data

pcap = PCAP(
    '../data/demo.pcap',
    flow_ptks_thres=2,
    random_state=42,
    verbose=10,
)

# extract flows from pcap
pcap.pcap2flows(q_interval=0.9)

# label each flow (optional)
pcap.label_flows(label_file='../data/demo.csv')

# extract features from each flow via IAT
pcap.flow2features('IAT', fft=False, header=False)

# dump data to disk
dump_data((pcap.features, pcap.labels), out_file='../out/IAT-features.dat')

# stats
print(pcap.features.shape, pcap.pcap2flows.tot_time, pcap.flow2features.tot_time)

'_pcap2flows()' starts at 2021-04-12 09:49:50
pcap_file: ../data/demo.pcap
ith_packets: 0
len(flows): 188
total number of flows: 188. Num of flows < 2 pkts: 0, and >=2 pkts: 188 without timeout splitting.
kept flows: 188. Each of them has at least 2 pkts after timeout splitting.
flow_durations.shape: (188, 1)
        col_0
count 188.000
mean    7.901
std    28.353
min     0.000
25%     0.000
50%     0.001
75%     0.593
max   172.205
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_0   188 non-null    float64
dtypes: float64(1)
memory usage: 1.6 KB
None
0th_flow: len(pkts): 52
After splitting flows, the number of subflows: 196 and each of them has at least 2 packets.
'_pcap2flows()' ends at 2021-04-12 09:49:51 and takes 0.0277 mins.
'_label_flows()' starts at 2021-04-12 09:49:51
Label CSV 0th row
Number of labelled flows: 88; number of not existed flo

## Training the Models

In [8]:
from sklearn.model_selection import train_test_split

from netml.ndm.model import MODEL
from netml.ndm.ocsvm import OCSVM
from netml.utils.tool import dump_data, load_data

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

RANDOM_STATE = 42

# load data
(features, labels) = load_data('../out/IAT-features.dat')

# split train and test sets
(
    features_train,
    features_test,
    labels_train,
    labels_test,
) = train_test_split(features, labels, test_size=0.33, random_state=RANDOM_STATE)

# create detection model
ocsvm = OCSVM(kernel='rbf', nu=0.5, random_state=RANDOM_STATE)
ocsvm.name = 'OCSVM'
ndm = MODEL(ocsvm, score_metric='auc', verbose=10, random_state=RANDOM_STATE)

# train the model from the train set
ndm.train(features_train)

# evaluate the trained model
ndm.test(features_test, labels_test)

# dump data to disk
dump_data((ocsvm, ndm.history), out_file='../out/OCSVM-results.dat')

# stats
print(ndm.train.tot_time, ndm.test.tot_time, ndm.score)

'_train()' starts at 2021-04-12 09:50:03
'_train()' ends at 2021-04-12 09:50:03 and takes 0.0 mins.
'_test()' starts at 2021-04-12 09:50:03
'_test()' ends at 2021-04-12 09:50:03 and takes 0.0 mins.
0.0 0.0 0.43200000000000005


## Classification of Network Traffic for Outlier Detection