# Graphing network packets

## Preparing data

The data source comes from a publicly available network forensics repository: http://www.netresec.com/?page=PcapFiles. The selected file is https://download.netresec.com/pcap/maccdc-2012/maccdc2012_00000.pcap.gz.

```
tcpdump -qns 0 -r maccdc2012_00000.pcap > maccdc2012_00000.txt
```

For example, here is a snapshot of the resulting output:

```
09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380
09:30:07.780000 IP 192.168.24.100.1038 > 192.168.202.68.8080: tcp 0
09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380
09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380
09:30:07.780000 IP 192.168.27.100.37877 > 192.168.204.45.41936: tcp 0
09:30:07.780000 IP 192.168.24.100.1038 > 192.168.202.68.8080: tcp 0
09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380
09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380
09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380
09:30:07.780000 IP 192.168.202.68.8080 > 192.168.24.100.1038: tcp 1380
```

Given the directional nature of network traffic and the numerous ports per node, we will simplify the graph by treating traffic between nodes as undirected and ignorning the distinction between ports. The graph edges will have weights represented by the total number of bytes across both nodes in either direction.

```
python pcap_to_parquet.py maccdc2012_00000.txt
```

The resulting output will be two Parquet dataframes, `maccdc2012_nodes.parq` and `maccdc2012_edges.parq`.

## Loading data

In [None]:
import datashader as ds
import datashader.transfer_functions as tf

from bokeh.palettes import Blues9, Greens9, Oranges9
from colorcet import fire
from datashader.bundling import hammer_bundle
from datashader.layout import circular_layout, forceatlas2_layout, random_layout

from dask.distributed import Client
from fastparquet import ParquetFile

client = Client()
width, height = 2000, 2000
x_range = (-0.01, 1.01)
y_range = (-0.01, 1.01)

In [None]:
nodes_df = ParquetFile('data/maccdc2012_full_nodes.parq').to_pandas()
len(nodes_df)

In [None]:
edges_df = ParquetFile('data/maccdc2012_full_edges.parq').to_pandas()
edges_df.head()

## Edge bundling

In [None]:
def create_image(bundle, aggregator, nodes, edges, cmap=fire):
    bundled_df = bundle(nodes, edges)
    img = tf.shade(aggregator(bundled_df, 'x', 'y'), cmap=cmap)
    return tf.set_background(img, color='black')

In [None]:
def assign_positions(nodes, edges, layout):
    bare_edges = edges.copy()
    del bare_edges['protocol']
    bare_edges = bare_edges.drop_duplicates()
    return layout(nodes, bare_edges)

In [None]:
config = {
    'tcp': {'colormap': Blues9},
    'udp': {'colormap': Greens9},
    'icmp': {'colormap': Oranges9}
}
for protocol in config.keys():
    config[protocol]['edges'] = edges_df[edges_df['protocol'] == protocol]

In [None]:
cvs = ds.Canvas(width, height, x_range, y_range)
nodes_by_layout = {name: assign_positions(nodes_df, edges_df, layout)
                   for name, layout in [('random', random_layout),
                                        ('circular', circular_layout),
                                        ('forceatlas2', forceatlas2_layout)]}

In [None]:
images = {(protocol, layout): create_image(hammer_bundle,
                                           cvs.points,
                                           nodes=nodes_by_layout[layout],
                                           edges=config[protocol]['edges'],
                                           cmap=config[protocol]['colormap'])
          for protocol in config.keys()
          for layout in nodes_by_layout.keys()}

In [None]:
sum([images[(protocol, 'random')] for protocol in config.keys()])

In [None]:
sum([images[(protocol, 'circular')] for protocol in config.keys()])

In [None]:
sum([images[(protocol, 'forceatlas2')] for protocol in config.keys()])

## Nodes with active traffic

In [None]:
def nodes_by_protocol(protocol, nodes, aggregator, min_weight=0):
    grouped_edges_df = config[protocol]['edges'].groupby(['source'])[['weight']].sum()
    active_edges_df = grouped_edges_df[grouped_edges_df['weight'] >= min_weight]
    del active_edges_df['weight']
    
    active_nodes_df = active_edges_df.rename(columns={'source': 'id'})
    active_nodes_df = active_nodes_df.join(nodes)
    
    agg = aggregator(active_nodes_df, 'x', 'y')
    return tf.spread(tf.shade(agg, cmap='red'), px=3)

In [None]:
def create_graph_for_protocol_and_layout(protocol, layout, min_weight):
    return images[(protocol, layout)] + nodes_by_protocol(protocol, nodes_by_layout[layout], cvs.points, min_weight)

### Nodes with at least 1MB of TCP traffic

In [None]:
create_graph_for_protocol_and_layout('tcp', 'random', 1024*1024)

In [None]:
create_graph_for_protocol_and_layout('tcp', 'circular', 1024*1024)

In [None]:
create_graph_for_protocol_and_layout('tcp', 'forceatlas2', 1024*1024)

### Nodes with at least 16K of UDP traffic

In [None]:
create_graph_for_protocol_and_layout('udp', 'random', 16*1024)

In [None]:
create_graph_for_protocol_and_layout('udp', 'circular', 16*1024)

In [None]:
create_graph_for_protocol_and_layout('udp', 'forceatlas2', 16*1024)

### Nodes with at least 1K of ICMP traffic

In [None]:
create_graph_for_protocol_and_layout('icmp', 'random', 1024)

In [None]:
create_graph_for_protocol_and_layout('icmp', 'circular', 1024)

In [None]:
create_graph_for_protocol_and_layout('icmp', 'forceatlas2', 1024)