# Feature Extraction

Data representation plays a critical role in the performance of many machine learning methods in machine learning. The data representation of network traffic often determines the effectiveness of these models as much as the model itself. The wide range of novel events that network operators need to detect (e.g., attacks, malware, new applications, changes in traffic demands) introduces the possibility for a broad range of possible models and data representations.

[NetML](https://pypi.org/project/netml/) is an open-source tool and end-to-end pipeline for anomaly detection in network traffic. This notebook walks through the use of that library.

First, let us load the library.

In [1]:
from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data



## Specify a Packet Capture File

Create a pcap data structure for which we would like to extract features. You can define the minumum number of packets that you want to include in each flow.

## Convert the Packet Capture Into Flows

Find the function in `netml` that converts the pcap file into flows. Examing the resulting data structure. What does it contain?

### Count the Flows

How many flows are in your data structure?

## Extract Features from Each Flow

Use the `netml` library to extract features from each flow. The documentation and [accompanying paper](https://arxiv.org/pdf/2006.16993.pdf) provide examples of features that you can try to extract. First try to extract the inter-arrival times for each flow.

### Interarrival Times

### Print the Features

Inspect the features for each flow.

In [7]:
iat_features

array([[3.47244024e-01, 7.56025314e-04, 4.64078903e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [3.50138903e-01, 2.55107880e-04, 4.32914972e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [5.70064783e-01, 2.50234127e-01, 1.03995309e+01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [1.32479668e-02, 1.62124634e-04, 3.94445896e-01, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.27911568e-02, 6.77895546e-03, 1.35493279e-03, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.31859779e-02, 3.81400824e-01, 2.24113464e-04, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

## Thought Questions

What other features might you want to extract from packet captures that are not provided by the `netml` library?