# Visual GPU Log Analytics Part II: GPU dataframes with RAPIDS Python cudf bindings

Graphistry is great -- Graphistry and RAPIDS/BlazingDB is better!

This tutorial series visually analyzes Zeek/Bro network connection logs using different compute engines:

* Part I: [CPU Baseline in Python Pandas](./part_i_cpu_pandas.ipynb)
* Part II: [GPU Dataframe with RAPIDS Python cudf bindings](./part_ii_gpu_cudf)


**Part II Contents:**

Time using GPU-based RAPIDS Python cudf bindings and Graphistry for a full ETL & visual analysis flow:

1. Load data
2. Analyze data
3. Visualize data

_**TIP**_: If you get out of memory errors, you usually must restart the kernel & refresh the page



In [None]:
#!pip install graphistry -q

import pandas as pd
import cudf

import graphistry
#graphistry.register(key='MY_KEY', protocol='https', server='graphistry.site.com')
graphistry.__version__

## 1. Load data

In [1]:
%%time
!curl https://www.secrepo.com/maccdc2012/conn.log.gz | gzip -d > conn.log
  
!head -n 3 conn.log

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  523M  100  523M    0     0   560k      0  0:15:56  0:15:56 --:--:--  680kM    0     0   634k      0  0:14:04  0:00:37  0:13:27  525k2k      0  0:15:04  0:02:49  0:12:15  433k269M    0     0   561k      0  0:15:54  0:08:11  0:07:43  648k  0     0   562k      0  0:15:53  0:08:26  0:07:27  654k0:09:21  0:06:26  407k 0     0   565k      0  0:15:47  0:09:26  0:06:21  534k    0   568k      0  0:15:43  0:10:29  0:05:14  495k    0     0   565k      0  0:15:47  0:12:00  0:03:47  498k0     0   560k      0  0:15:57  0:15:30  0:00:27  474k


In [None]:
# OPTIONAL: For slow devices, work on a subset
#!awk 'NR % 20 == 0' < conn.log > conn-5pc.log

In [None]:
%%time
cdf = cudf.read_csv("./conn-20pc.log", sep="\t", header=None, 
                   names=["time", "uid", "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "proto", "service",
                        "duration", "orig_bytes", "resp_bytes", "conn_state", "local_orig", "missed_bytes",
                        "history", "orig_pkts", "orig_ip_bytes", "resp_pkts", "resp_ip_bytes", "tunnel_parents"], 
                   na_values=['-'], index_col=False)

In [None]:
cdf.head(3)

## 2. Analyze Data

Summarize network activities between every communicating src/dst IP,  split by connection state

In [15]:
%%time
cdf_summary = cdf\
.groupby(['id.orig_h', 'id.resp_h', 'conn_state'])\
.agg({
    'time': ['min', 'max', 'count'],
    'duration':   ['min', 'max', 'mean'],
    'orig_bytes': ['min', 'max', 'sum', 'mean'],
    'resp_bytes': ['min', 'max', 'sum', 'mean']
})

In [None]:
cdf_summary.head(3)

## 3. Visualize data

* Nodes: 
  * IPs
  * Bigger when more sessions (split by connection state) involving them
* Edges:
  * src_ip -> dest_ip, split by connection state

In [None]:
df_summary = cdf_summary.to_pandas()

hg = graphistry.hypergraph(df_summary, entity_types = ['src_ip', 'dst_ip'])
hg['graph'].plot()

## Next Steps

* Part I: [CPU Baseline in Python Pandas](./part_i_cpu_pandas.ipynb)
* Part II: [GPU Dataframe with RAPIDS Python cudf bindings](./part_ii_gpu_cudf)