# Visual GPU Log Analytics with I: CPU Baseline in Python Pandas

Graphistry is great -- Graphistry and RAPIDS/BlazingDB is better!

This tutorial series visually analyzes Zeek/Bro network connection logs using three different compute engines:

* Part I: CPU Baseline in Python Pandas
* Part II: GPU Dataframe with RAPIDS Python bindings
* Part III: GPU Dataframe with BlazingDB SQL interface to RAPIDS


**Part I Contents:**

Time using CPU-based Python Pandas and Graphistry for a full ETL & visual analysis flow:

1. Load data
2. Analyze data
3. Visualize data



In [0]:
#!pip install graphistry -q


import pandas as pd

import graphistry
#graphistry.register(key='MY_KEY', protocol='https', server='graphistry.site.com')
graphistry.__version__

'0.9.64'

## 1. Load data

In [0]:
%%time
!curl https://www.secrepo.com/maccdc2012/conn.log.gz | gzip -d > conn.log
  
!head -n 3 conn.log

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  523M  100  523M    0     0  10.4M      0  0:00:50  0:00:50 --:--:-- 4361k
1331901000.000000	CCUIP21wTjqkj8ZqX5	192.168.202.79	50463	192.168.229.251	80	tcp	-	-	-	-	SH	-	0	Fa	1	52	1	52	(empty)
1331901000.000000	Csssjd3tX0yOTPDpng	192.168.202.79	46117	192.168.229.254	443	tcp	-	-	-	-	SF	-	0	dDafFr	3	382	9	994	(empty)
1331901000.000000	CHEt7z3AzG4gyCNgci	192.168.202.79	50465	192.168.229.251	80	tcp	http	0.010000	166	214	SF	-	0	ShADfFa	4	382	3	382	(empty)
CPU times: user 429 ms, sys: 83.2 ms, total: 512 ms
Wall time: 58.3 s


In [0]:
!awk 'NR % 20 == 0' < conn.log > conn-5pc.log

In [0]:
%%time
df = pd.read_csv("./conn-5pc.log", sep="\t", header=None, 
                 names=["time", "uid", "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "proto", "service",
                        "duration", "orig_bytes", "resp_bytes", "conn_state", "local_orig", "missed_bytes",
                        "history", "orig_pkts", "orig_ip_bytes", "resp_pkts", "resp_ip_bytes", "tunnel_parents"], 
                 na_values=['-'], index_col=False)

CPU times: user 3.87 s, sys: 107 ms, total: 3.98 s
Wall time: 3.98 s


In [0]:
df.sample(3)

Unnamed: 0,time,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,resp_bytes,conn_state,local_orig,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents
325356,1331909000.0,Cbmxx3GZgu3UTY1z5,192.168.202.110,54276,192.168.22.25,16632,tcp,,,,,S0,,0,S,1,48,0,0,(empty)
393047,1331919000.0,Cuefli1GJCgCJFRCWc,192.168.202.110,51711,192.168.229.156,24452,tcp,,,,,REJ,,0,Sr,1,48,1,40,(empty)
1118412,1332015000.0,CQGrBY1wejAKTunwy3,192.168.202.136,38980,192.168.21.152,3659,tcp,,,,,S0,,0,S,1,44,0,0,(empty)


## 2. Analyze Data

Summarize network activities between every communicating src/dst IP,  split by connection state

In [0]:
%%time
df_summary = df\
.assign(
    sum_bytes=df.apply(lambda row: row['orig_bytes'] + row['resp_bytes'], axis=1))\
.groupby(['id.orig_h', 'id.resp_h', 'conn_state'])\
.agg({
    'time': ['min', 'max', 'size'],
    'id.resp_p':  ['nunique'],
    'uid': ['nunique'],
    'duration':   ['min', 'max', 'mean'],
    'orig_bytes': ['min', 'max', 'sum', 'mean'],
    'resp_bytes': ['min', 'max', 'sum', 'mean'],
    'sum_bytes':  ['min', 'max', 'sum', 'mean']
}).reset_index()

df_summary.columns = [' '.join(col).strip() for col in df_summary.columns.values]
df_summary = df_summary\
.rename(columns={'time size': 'count'})\
.assign(
    conn_state_uid=df_summary.apply(lambda row: row['id.orig_h'] + '_' + row['id.resp_h'] + '_' + row['conn_state'], axis=1))

CPU times: user 33.2 s, sys: 188 ms, total: 33.4 s
Wall time: 33.4 s


In [0]:
print ('# rows', len(df_summary))
df_summary.sample(3)

# rows 29784


Unnamed: 0,id.orig_h,id.resp_h,conn_state,time min,time max,count,id.resp_p nunique,uid nunique,duration min,duration max,...,orig_bytes mean,resp_bytes min,resp_bytes max,resp_bytes sum,resp_bytes mean,sum_bytes min,sum_bytes max,sum_bytes sum,sum_bytes mean,conn_state_uid
7412,192.168.202.110,192.168.27.222,S0,1331903000.0,1331904000.0,4,4,4,,,...,,,,0.0,,,,0.0,,192.168.202.110_192.168.27.222_S0
1008,192.168.202.102,192.168.21.176,S0,1331903000.0,1331903000.0,1,1,1,,,...,,,,0.0,,,,0.0,,192.168.202.102_192.168.21.176_S0
22046,192.168.202.79,192.168.24.98,OTH,1331920000.0,1331920000.0,1,1,1,,,...,,,,0.0,,,,0.0,,192.168.202.79_192.168.24.98_OTH


## 3. Visualize data

* Nodes: 
  * IPs
  * Bigger when more sessions (split by connection state) involving them
* Edges:
  * src_ip -> dest_ip, split by connection state

In [0]:
%%time
hg = graphistry.hypergraph(
    df_summary,
    ['id.orig_h', 'id.resp_h'],
    direct=True,
    opts={
        'CATEGORIES': {
            'ip': ['id.orig_h', 'id.resp_h']
        }
    })

# links 29784
# events 29784
# attrib entities 3703
CPU times: user 2.27 s, sys: 42.9 ms, total: 2.31 s
Wall time: 2.31 s


In [0]:
%%time
hg['graph'].plot()

CPU times: user 3.41 s, sys: 29 ms, total: 3.44 s
Wall time: 6.42 s


## Next Steps

* Part I: CPU Baseline in Python Pandas
* Part II: GPU Dataframe with RAPIDS Python bindings
* Part III: GPU Dataframe with BlazingDB SQL interface to RAPIDS