# NFStream: a Flexible Network Data Analysis Framework

[**nfstream**][repo] is a Python package providing fast, flexible, and expressive data structures designed to make working with **online** or **offline** network data both easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** network data analysis in Python. Additionally, it has
the broader goal of becoming **a common network data processing framework for researchers** providing data reproducibility across experiments.

* **Performance:** **nfstream** is designed to be fast (x10 faster with pypy3 support) with a small CPU and memory footprint.
* **Layer-7 visibility:** **nfstream** deep packet inspection engine is based on [**nDPI**][ndpi]. It allows nfstream to perform [**reliable**][reliable] encrypted applications identification and metadata extraction (e.g. TLS, QUIC, TOR, HTTP, SSH, DNS, etc.).
* **Flexibility:** add a flow feature in 2 lines as an [**NFPlugin**][nfplugin].
* **Machine Learning oriented:** add your trained model as an [**NFPlugin**][nfplugin].

In this notebook, we demonstrate a subset of features provided by [**nfstream**][repo].

[documentation]: https://nfstream.github.io/
[ndpi]: https://github.com/ntop/nDPI
[nfplugin]: https://nfstream.github.io/docs/api#nfplugin
[reliable]: http://people.ac.upc.edu/pbarlet/papers/ground-truth.pam2014.pdf
[repo]: https://nfstream.github.io/

In [None]:
from nfstream import NFStreamer, NFPlugin
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

## Flow aggregation made simple

In the following, we are going to use the main object provided by nfstream, `NFStreamer` which have the following parameters:

* `source` [default= `None` ]: Source of packets. Possible values: `live_interface_name` or  `pcap_file_path`.
* `snaplen` [default= `65535` ]: Packet capture length.
* `idle_timeout` [default= `30` ]: Flows that are inactive for more than this value in seconds will be exported.
* `active_timeout` [default= `300` ]: Flows that are active for more than this value in seconds will be exported.
* `plugins` [default= `()` ]: Set of user defined NFPlugins.
* `dissect` [default= `True` ]: Enable nDPI deep packet inspection library for Layer 7 visibility.
* `max_tcp_dissections` [default= `80` ]: Maximum per flow TCP packets to dissect (ignored when dissect=False).
* `max_udp_dissections` [default= `16` ]: Maximum per flow UDP packets to dissect (ignored when dissect=False).
* `statistics` [default= `False`]: Enable statistical flow features extraction.
* `account_ip_padding_size` [default= `False`]: Enable Ethernet padding accounting when reporting IP sizes.
* `enable_guess` [default= True]: Enable/Disable identification engine port guess heuristic.
* `decode_tunnels` [default= True]: Enable/Disable GTP/TZSP tunnels dissection.
* `bpf_filter` [default= None]: Specify a BPF filter for filtering selected traffic
* `promisc` [default= True]: Enable/Disable promiscuous capture mode.

`NFStreamer` returns a flow iterator. We can iterate over flows or convert it directly to pandas Dataframe using `to_pandas()` method.

In [None]:
df = NFStreamer(source="pcaps/instagram.pcap").to_pandas()

In [None]:
df.head()

We can enable statistical flow features extraction as follow:

In [None]:
df = NFStreamer(source="pcaps/instagram.pcap", statistics=True).to_pandas()

In [None]:
df.head()

We can enable IP anonymization as follow:

In [None]:
df = NFStreamer(source="pcaps/instagram.pcap", statistics=True).to_pandas(ip_anonymization=True)

In [None]:
df.head()

Now that we have our Dataframe, we can start analyzing our data as any data. For example we can compute additional features:

* Compute data ratio on both direction (src2dst and dst2src)

In [None]:
df["src2dst_raw_bytes_data_ratio"] = df['src2dst_raw_bytes'] / df['bidirectional_raw_bytes']
df["dst2src_raw_bytes_data_ratio"] = df['dst2src_raw_bytes'] / df['bidirectional_raw_bytes']

In [None]:
df.head()

* Filter data according to some criterias:

In [None]:
df[df["dst_port"] == 443].head()

## Extend nfstream

In some use cases, we need to add features that are computed as packet level. Thus, nfstream handles such scenario using [**NFPlugin**][nfplugin].

[nfplugin]: https://nfstream.github.io/docs/api#nfplugin

* Let's suppose that we want bidirectional packets with exact IP size equal to 40 counter per flow.

In [None]:
class packet_with_40_ip_size(NFPlugin):
    def on_init(self, pkt): # flow creation with the first packet
        if pkt.ip_size == 40:
            return 1
        else:
            return 0
        
    def on_update(self, pkt, flow): # flow update with each packet belonging to the flow
        if pkt.ip_size == 40:
            flow.packet_with_40_ip_size += 1

In [None]:
df = NFStreamer(source="pcaps/google_ssl.pcap", plugins=[packet_with_40_ip_size()]).to_pandas()

In [None]:
df.head()

Our Dataframe have a new column named `packet_with_40_ip_size`.

In some cases, we need volatile features.
Let's have an example use case as following:

* We want to compute the maximum per flow  packet inter arrival time.
* Our feature will be based on iat that we do not want as feature.

Note that such feature already implemented within nfstream statistical features.

In [None]:
class iat(NFPlugin):
    def on_init(self, pkt):
        return [-1, pkt.time] # [iat value, last packet timestamp]
    def on_update(self, pkt, flow):
        flow.iat = [pkt.time - flow.iat[1], pkt.time]

class maximum_iat_ms(NFPlugin):
    def on_init(self, pkt):
        return -1 # we will set it as -1 as init value
    def on_update(self, pkt, flow):
        if flow.iat[0] > flow.maximum_iat_ms:
            flow.maximum_iat_ms = flow.iat[0]

In [None]:
df = NFStreamer(source="pcaps/instagram.pcap", plugins=[iat(volatile=True), maximum_iat_ms()]).to_pandas()

In [None]:
df.head()

Our Dataframe have a new column named `maximum_iat_ms` containing the maximum observed packet 
inter arrval time per flow and set to -1 when there is only 1 packet.