# Anomaly Detection with Unsupervised Learning

In this asssignment, you will use `netml` to perform unsupervised learning on a dataset that contains attack traffic. Unsupervised learning refers to the process where models are trained without training or "supervision". They can be used to detect structure in data, including the process of detecting outliers.

In the below examples, an OCSVM model is trained by demo traffic included in the library, and tested by labels in a CSV file, (both provided by the University of New Brunswick's Intrusion Detection Systems dataset).

## Part 1: Warmup

In [6]:
from sklearn.model_selection import train_test_split

from netml.ndm.model import MODEL
from netml.ndm.ocsvm import OCSVM
from netml.pparser.parser import PCAP

RANDOM_STATE = 42

pcap = PCAP(
    'data/demo/demo.pcap',
    flow_ptks_thres=2,
    random_state=42,
    verbose=0,
)

# extract flows from pcap
pcap.pcap2flows(q_interval=0.9)

# label each flow (optional)
pcap.label_flows(label_file='data/demo/demo.csv')

# extract features from each flow via IAT
pcap.flow2features('IAT', fft=False, header=False)

# load data
(features, labels) = (pcap.features,
                      pcap.labels)

# split train and test sets
(
    features_train,
    features_test,
    labels_train,
    labels_test,
) = train_test_split(features, labels, test_size=0.33, random_state=RANDOM_STATE)

# create detection model
ocsvm = OCSVM(kernel='rbf', nu=0.5, random_state=RANDOM_STATE)
ocsvm.name = 'OCSVM'
ndm = MODEL(ocsvm, score_metric='auc', verbose=0, random_state=RANDOM_STATE)

# train the model from the train set
ndm.train(features_train)

# evaluate the trained model
ndm.test(features_test, labels_test)

# stats
print(ndm.train.tot_time, ndm.test.tot_time, ndm.score)

'_pcap2flows()' starts at 2022-11-17 09:56:16
'_pcap2flows()' ends at 2022-11-17 09:56:17 and takes 0.024 mins.
'_label_flows()' starts at 2022-11-17 09:56:17
'_label_flows()' ends at 2022-11-17 09:56:18 and takes 0.0073 mins.
'_flow2features()' starts at 2022-11-17 09:56:18
'_flow2features()' ends at 2022-11-17 09:56:18 and takes 0.0 mins.
'_train()' starts at 2022-11-17 09:56:18
'_train()' ends at 2022-11-17 09:56:18 and takes 0.0 mins.
'_test()' starts at 2022-11-17 09:56:18
'_test()' ends at 2022-11-17 09:56:18 and takes 0.0 mins.
0.0 0.0 0.632


## Part 2: Explore Other Anomaly Detection Models

The `netml` library supports a number of anomaly detection models, including:

* Autoencoder
* Gaussian Mixture Model
* Independent Forest
* Kernel Density Estimation
* One-Class SVM (OCSVM) (as shown above)
* Principal Compoenent Analysis

Try the above example on two (or more) of the following models. Which tend to perform better?

## Part 3: Anomaly Detection for Activity Detection

Now that you have some basic experience with using the `netml` library for anomaly detection, your assignment is to apply anomaly detection on a different dataset and problem: activity detection.

When users interact with smart home devices in a connected home setting, these devices may generate traffic that differs from idle behavior (e.g., an increase in traffic volume).  Your task is to train anomaly detection models to distinguish *idle* traffic from *device interaction*. We have provided two traces from an Amazon Echo device: 

* An idle trace ("normal")
* A trace where a user plays music ("anomaly")

Your task will be to:

1. Use netml to construct flows from these traces
2. Use netml to label the flows (you will probably have to override the label_flows function)
3. Evaluate one more more anomaly detection models to detect activity.

## Part 3.1: Construct the Flows

Construct flows from the provided pcaps which involve (1) no interaction; (2) playing music with the device. There are multiple samples for each class, so you will need to manipulate the data so that you can train your model with it. Probably the easiest way to do this would be to concatenate the pcaps for each class and then load them into the `netml` pcap class, but there are presumably other ways to do this.

## Part 3.2: Label the Flows

You might consider using `netml` to label your flows. To do so, you will have to generate some kind of label file (`netml` has a label_flows function that you might consider overriding).

## Part 3.3: Train and Evaluate a Model

Select one of the unsupervised models in the `netml` library and train and evaluate your model to detect interactions with the Alexa device (vs. idle traffic).