# Threat Hunting Masterclass: Three data science notebooks for finding bad actors in your network logs

More info: https://www.graphistry.com/blog/zeek-masterclass

To get start,load logs.tar into your Splunk and replace the index used below, or call it index `corelight_tutorial`: https://data.world/graphistry/networkforensics . From there, follow the cells below.

## Purpose

These tutorials cover multiple useful topic areas:

* **Hunts**: Sample queries & visualizations for looking at encrypted traffic, DNS tunneling, network shares & logins, & file obfuscation. 
* **Data types**: Network logs around TLS, DNS, NTLM, SMB, and more
* **Methodologies**: Data science notebooks, SIEM queries, visual graph analytics
* **Tools**: Jupyter, Splunk, Bro/Zeek/Corelight, and Graphistry

## Configure

* If you are using Graphistry Marketplace, leave `GRAPHISTRY` unedited, else, uncomment and fill it in
* Fill in `SPLUNK`. Make sure the user has capabilities for REST API access and reading the index in which you put `logs.tar`

In [0]:
#graphistry
GRAPHISTRY = {
    'key': 'MY_KEY',
    'protocol': 'https',
    'server': 'beta.graphistry.com',
    'api': 2
}    

#splunk
SPLUNK = {
    'host': 'SPLUNK.MYSITE.COM',
    'scheme': 'https',
    'port': 8089,
    'username': 'corelight_tutorial',
    'password': 'MY_SPLUNK_PWD'   
}

## Imports

In [2]:
!pip install graphistry -q
!pip install splunk-sdk -q

[?25l[K     |███▏                            | 10kB 14.9MB/s eta 0:00:01[K     |██████▎                         | 20kB 1.8MB/s eta 0:00:01[K     |█████████▍                      | 30kB 2.6MB/s eta 0:00:01[K     |████████████▌                   | 40kB 1.7MB/s eta 0:00:01[K     |███████████████▊                | 51kB 2.1MB/s eta 0:00:01[K     |██████████████████▉             | 61kB 2.5MB/s eta 0:00:01[K     |██████████████████████          | 71kB 2.9MB/s eta 0:00:01[K     |█████████████████████████       | 81kB 3.3MB/s eta 0:00:01[K     |████████████████████████████▎   | 92kB 3.7MB/s eta 0:00:01[K     |███████████████████████████████▍| 102kB 2.8MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 2.8MB/s 
[?25h  Building wheel for splunk-sdk (setup.py) ... [?25l[?25hdone


In [0]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import sys
import numpy as np
import math
np.set_printoptions(threshold=sys.maxsize)

import re

import graphistry
graphistry.register(**GRAPHISTRY)

In [0]:
import splunklib
import splunklib.client as client
import splunklib.results as results

service = client.connect(**SPLUNK)

## Helpers

### General

In [0]:
def safe_log(v):
  try:
     v2 = float(v)
     return math.log(round(v2) + 1) if not np.isnan(v2) else 0
  except:
    return 0
  
  
# Convert bytes to log of numbers
# Running this twice is safe (idempotent)
# Returns a copy (no mutation of the original)
def log_of_bytes(df):
  df2 = df.copy()
  for c in [c for c in df.columns if re.match('.*bytes.*', c) and not re.match('log\(.*', c)]:
    df2['log(' + c + ')'] = df[c].apply(safe_log)      
  return df2

### Splunk
* Query splunk, with optional args like sampleRate
* Automatically paginate when result split over multiple responses
* Return as a Pandas dataframe (Note: treats all cols as strings)

In [0]:
STEP = 50000;                       
def splunkToPandas(qry, overrides={}):
    kwargs_blockingsearch = {
        "count": 0,
        "earliest_time": "2010-01-24T07:20:38.000-05:00",
        "latest_time": "now",
        "search_mode": "normal",
        "exec_mode": "blocking",
        **overrides}
    job = service.jobs.create(qry, **kwargs_blockingsearch)

    print("Search results:\n")
    resultCount = job["resultCount"]
    offset = 0;                         

    print('results', resultCount)
    out = None
    while (offset < int(resultCount)):
        print("fetching:", offset, '-', offset + STEP)
        kwargs_paginate = {**kwargs_blockingsearch,
                           "count": STEP,
                           "offset": offset}

        # Get the search results and display them
        blocksearch_results = job.results(**kwargs_paginate)
        reader = results.ResultsReader(blocksearch_results)
        lst = [x for x in reader]
        df2 = pd.DataFrame(lst)    
        out = df2 if type(out) == type(None) else pd.concat([out, df2], ignore_index=True)
        offset += STEP
    for c in out.columns:
        out[c] = out[c].astype(str)
    return out

### Bro/Zeek

Useful bindings for hypergraphs

In [0]:
categories = {
    'ip': ['id.orig_h', 'id.resp_h']
}

opts={
    'CATEGORIES': categories 
}

### Graphistry

In [0]:
##Extend graphistry.plotter.Plotter to add chainable method "my+graph.color_points_by('some_column_name')..." (and "color_edges_by")

import graphistry.plotter

def color_col_by_categorical(df, type_col):
  types = list(df[type_col].unique())
  type_to_color = {t: i for (i, t) in enumerate(types)}
  return df[type_col].apply(lambda t: type_to_color[t])

def color_col_by_continuous(df, type_col):
  mn = df[type_col].astype(float).min()
  mx = df[type_col].astype(float).max()
  if mx - mn < 0.000001:
    print('warning: too small values for color_col_by_continuous')
    return color_col_by_categorical(df, type_col)
  else:
    print('coloring for range', mn, mx)
  return df[type_col].apply(lambda v: 228010 - round(10 * (float(v) - mn)/(mx - mn) ))
  

## g * str * 'categorical' | 'continuous' -> g
def color_points_by(g, type_col, kind='categorical'):
  fn = color_col_by_categorical if kind == 'categorical' else color_col_by_continuous
  colors = fn(g._nodes, type_col)
  return g.nodes( g._nodes.assign(point_color=colors) ).bind(point_color='point_color')

## g * str * 'categorical' | 'continuous' -> g
def color_edges_by(g, type_col, kind='categorical'):
  fn = color_col_by_categorical if kind == 'categorical' else color_col_by_continuous
  colors = fn(g._edges, type_col)
  return g.edges( g._edges.assign(edge_color=colors) ).bind(edge_color='edge_color')

graphistry.plotter.Plotter.color_points_by = color_points_by
graphistry.plotter.Plotter.color_edges_by = color_edges_by

In [0]:
## remove node/edges pointing to "*::nan" values
def safe_not_nan(prog, v):
  try: 
    return not prog.match(v)
  except:
    return True
  
def drop_nan_col(df, col, prog):
  not_nans = df[col].apply(lambda v: safe_not_nan(prog, v))
  return df[ not_nans == True ]
  
def drop_nan(g, edges = ['src', 'dst'], nodes = ['nodeID']):
  prog = re.compile(".*::nan$")
  edges2 = g._edges
  for col_name in g._edges.columns:
    edges2 = drop_nan_col(edges2, col_name, prog)
  nodes2 = g._nodes
  for col_name in g._nodes.columns:
    nodes2 = drop_nan_col(nodes2, col_name, prog)
  return g.nodes(nodes2).edges(edges2)
  
graphistry.plotter.Plotter.drop_hyper_nans = drop_nan  

## Notebook intro:

## What are notebooks & why

Notebooks and their code ecosystem does a few things at the technical level:
* Web-based UI that exposes a paired Python shell session: a super terminal
* Write code, run it, see results, try again, and save your session
* Quickly connect to databases and wrangle data using the `pydata` Python ecosystem

Top and big teams are adopting notebook environments like Jupyter to solve some key problems:
* Individual advanced individuals use them for accessing the increasingly dominant Python ecosystem
  * Fast: Use at the beginning of a project for rapid analysis & rapid prototyping
  * Smart: Easiest way to use most machine learning tools
* Teams use them as a way to collaborate: 
  * Share executable investigations for one-offs
  * Lightweight automation:  investigation plays & rule/model analyses
  * Training



### Jupyter
* Edit and run a code cell and see it's output: **shift-enter** or via the UI
* You can always edit it and rerun
* Best practice: Write in order as if a full program, so you can always restart and run from th top

### Google Colab
* Hit **Connect** on the top-right to start a running personal session  for this -- it is ready when it says *Connected*. 
* Run each *cell* of the notebook in sequence: either press the **play** button to the left of the cell, or select the cell and hit **shift-enter**.  Feel free to edit the cell, and rerun it (+ the likely . impacted cells below it.)
* Best practice: Write in order as if a full program, so you can always restart and run from the top


### Pandas
Most of the preprocessing code is `pandas`, the most popular Python data science tool (https://pandas.pydata.org ). Graphistry enterprise enables you to replace this kind of manual data wrangling code with shareable point-and-click solutions.

## Graphistry intro:

* Graphistry loads below in every cell that says  `...plot()`

* If you see a giant Graphistry logo over a gray background and nothing else, click the logo to start the Graphistry session

* UI Guide: https://labs.graphistry.com/graphistry/ui.html 

* Graphistry notebook examples: https://github.com/graphistry/pygraphistry

* Palettes: https://labs.graphistry.com/graphistry/docs/palette.html


Try changing "`... | head 100`"  to  "`... | head 10000`"!

In [10]:
df = splunkToPandas(
    """
    search index=corelight_tutorial 
    | dedup id.orig_h, id.resp_h, name 
    | fields - _* 
    | head 100
    """,
    {'sample_ratio': 10}) # Optional, means "sample 1 in 10"

print('# rows', len(df))
df.sample(3)

Search results:

results 79
fetching: 0 - 50000
# rows 79


Unnamed: 0,host,id.orig_h,id.resp_h,index,linecount,name,source,sourcetype,splunk_server,uid,size
24,splunk.graphistry.com,192.168.0.54,213.155.151.181,corelight_tutorial,1,data_before_established,logs.tar:./weird_20180803_16:37:08-16:40:00-07...,weird,splunk.graphistry.com,C4cXEP3YqYEkVgiD5i,
11,splunk.graphistry.com,192.168.0.53,192.168.0.1,corelight_tutorial,1,unknown_HTTP_method,logs.tar:./weird_20180803_16:37:08-16:40:00-07...,weird,splunk.graphistry.com,CkbyH62jwxViOw5VN2,
52,splunk.graphistry.com,192.168.0.54,193.149.88.236,corelight_tutorial,1,data_before_established,logs.tar:./weird_20180803_16:37:08-16:40:00-07...,weird,splunk.graphistry.com,CA76K70LJd4XYlDl4,


In [11]:
hg = graphistry.hypergraph(
    df, 
    ["id.orig_h", "id.resp_h", "name", "uid"], 
    direct=True,
    opts={
        'CATEGORIES': {
            'ip': ['id.orig_h', 'id.resp_h'] # combine repeats across columns into the same nodes
        }
    })
hg['graph'].plot()

# links 474
# events 79
# attrib entities 154


# 1. Hunting through encrypted traffic
* **Motivation**: Internal and perimeter traffic is increasingly encrypted, but we still need to look at it for reasons including auditing encryption hygiene and understanding disguised malicious traffic.
* **Input:** SSL logs
* **Methodology**
  * Search for expired, self-signed, internal CAs, old TLS versions, ...
  * Map out & investigate
    * Work through combos of `version` TLS 1.2 (old) and `validation` search
    * Look for funny issuers, subjects
    * Use JA3 to fingerprint &  whitelist good TLS; then just focus on non-JA3
* **Insights**
  * 1: Clear clusters of TLS version hygiene issues across the various users & applications
  * 2: One cluster is signed... Obama?!?! 
  
* **Generalize**
   * Build whitelist of JA3 and look for violators
   * For unknown certs, characterize nature of activity based on behavior like periodic beaconing, heavy back-and-forth (tunnel), heavy data movement (exfil), ...
   * Map structure of certs in general: services -> certs -> authorities

In [11]:

#optional - add:     OR (version=* AND version != TLSv12)   

certs_a_df = splunkToPandas("""

    search index="corelight_tutorial" cert_chain_fuids{}=* 
    validation_status="certificate has expired" 
    OR validation_status="self signed certificate" 
    OR validation_status ="self signed certificate in certificate chain"
    
    
    | fields *
    | fields - _*
                                   

    | head 50000

    """,
    {'sample_ratio': 1})

print('# rows', len(certs_a_df))
certs_a_df.sample(10)

Search results:

results 5429
fetching: 0 - 50000
# rows 5429


Unnamed: 0,cert_chain_fuids{},cipher,curve,date_hour,date_mday,date_minute,date_month,date_second,date_wday,date_year,date_zone,established,eventtype,host,id.orig_h,id.orig_p,id.resp_h,id.resp_p,index,issuer,ja3,linecount,punct,resumed,server_name,source,sourcetype,splunk_server,splunk_server_group,subject,timeendpos,timestartpos,ts,uid,unix_category,unix_group,validation_status,version,last_alert,next_protocol
1320,"['FQZjFv40RSQpUy84Uj', 'FLBUia3CvV6rXlaco9', '...",TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,secp256r1,23,3,38,august,48,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.54,58973,108.160.162.115,443,corelight_tutorial,"CN=Go Daddy Secure Certificate Authority - G2,...",8d0230b6ce881f161d1875364f4a156b,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...","CN=*.dropbox.com,OU=Domain Control Validated",34,7,2018-08-03T23:38:48.865254Z,Cb2FMh1dXasX8ErQS7,all_hosts,default,certificate has expired,TLSv10,,
447,"['F5FcVbezfJGgnXJMh', 'FMOZCdFRjdUImboB7', 'Ft...",TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,secp256r1,23,3,38,august,52,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.54,60552,213.155.151.185,443,corelight_tutorial,"CN=Google Internet Authority G2,O=Google Inc,C=US",e417b0731e0f2c81dc81ca57cb597b25,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,play.google.com,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...","CN=*.google.com,O=Google Inc,L=Mountain View,S...",34,7,2018-08-03T23:38:52.096438Z,CTmqJD21MWioFpGYj5,all_hosts,default,certificate has expired,TLSv12,,
1365,"['FPgvM233HvfObdRu1a', 'FBXJYA2TX28U8rhjG8', '...",TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,secp256r1,23,3,38,august,48,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.54,58691,54.230.99.217,443,corelight_tutorial,"CN=VeriSign Class 3 Secure Server CA - G3,OU=T...",e03fdb6b99211ce6d1ed8a21abf4b25b,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,d2d8g5sjza4b48.cloudfront.net,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...","CN=*.cloudfront.net,O=Amazon.com\, Inc.,L=Seat...",34,7,2018-08-03T23:38:48.149782Z,COskUkJOeimG7c8He,all_hosts,default,certificate has expired,TLSv12,,
4318,"['FTPLH52E1n47WYWsZ3', 'FKngDk166EFPRwl8Kj', '...",TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,secp256r1,23,3,37,august,47,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.51,47228,217.72.201.130,443,corelight_tutorial,"CN=thawte SSL CA - G2,O=thawte\, Inc.,C=US",01f79a7537bf2cb8b8e8f450d291c632,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,3c-bs.gmx.com,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...","CN=3c-bs.gmx.com,O=1&1 Mail & Media Inc.,L=Che...",34,7,2018-08-03T23:37:47.443768Z,C9zRQz4DJgIiOzRG68,all_hosts,default,certificate has expired,TLSv12,,
3832,"['F3smzS84KN50e0ail', 'FMiALu2wPM6lmSaYNf', 'F...",TLS_RSA_WITH_3DES_EDE_CBC_SHA,,23,3,37,august,54,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.53,2140,212.227.111.53,443,corelight_tutorial,"CN=thawte SSL CA - G2,O=thawte\, Inc.,C=US",de350869b8c85de67a350c8d186f11e6,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...","CN=navigator-bs.gmx.com,O=1&1 Mail & Media Inc...",34,7,2018-08-03T23:37:54.484612Z,CndgSe2hXssGB423Cb,all_hosts,default,certificate has expired,TLSv10,,
108,"['FWNhow4DgTw6wjM2n5', 'FyqNa4Jfi3ugk0kNi', 'F...",TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,secp256r1,23,3,38,august,58,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.54,65169,173.194.71.189,443,corelight_tutorial,"CN=Google Internet Authority G2,O=Google Inc,C=US",e03fdb6b99211ce6d1ed8a21abf4b25b,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,6.client-channel.google.com,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...","CN=*.mail.google.com,O=Google Inc,L=Mountain V...",34,7,2018-08-03T23:38:58.478453Z,CI0Gi9sVsgaAxk8ea,all_hosts,default,certificate has expired,TLSv12,,
3647,"['FpqQSy49nt4FcSjhb', 'FG2diV2hbPbjisW39h', 'F...",TLS_RSA_WITH_3DES_EDE_CBC_SHA,,23,3,37,august,55,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.53,3172,217.72.201.130,443,corelight_tutorial,"CN=thawte SSL CA - G2,O=thawte\, Inc.,C=US",de350869b8c85de67a350c8d186f11e6,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...","CN=3c-bs.gmx.com,O=1&1 Mail & Media Inc.,L=Che...",34,7,2018-08-03T23:37:55.876996Z,C1Yfgb4mn7cCixGvmg,all_hosts,default,certificate has expired,TLSv10,,
3490,"['FWIxfa2SDjdMzrnoCi', 'FWh9rP1nEb2WaKquqc', '...",TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,secp256r1,23,3,38,august,1,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.54,54325,213.155.151.151,443,corelight_tutorial,"CN=Google Internet Authority G2,O=Google Inc,C=US",e03fdb6b99211ce6d1ed8a21abf4b25b,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,encrypted-tbn0.gstatic.com,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...","CN=*.google.com,O=Google Inc,L=Mountain View,S...",34,7,2018-08-03T23:38:01.197148Z,Cba4DE4Joy5sToH3Ga,all_hosts,default,certificate has expired,TLSv12,,
4254,"['FuTIDr2sly1alWQmdc', 'FIvCEv4orb8wRYCzVb', '...",TLS_RSA_WITH_AES_128_GCM_SHA256,,23,3,37,august,47,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.54,49791,37.252.162.22,443,corelight_tutorial,"CN=GeoTrust SSL CA - G2,O=GeoTrust Inc.,C=US",bfa1674e65282fa3b5444623156e83bd,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,secure.adnxs.com,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...","CN=*.adnxs.com,O=AppNexus\, Inc.,L=New York,ST...",34,7,2018-08-03T23:37:47.528301Z,CV5iOU3t71sYvLrsdl,all_hosts,default,certificate has expired,TLSv12,,
984,"['FWDhHf30wTgq26Xkdd', 'FU0AuM32YTbHwyQPSi']",TLS_RSA_WITH_RC4_128_SHA,,23,3,38,august,49,friday,2018,0,True,nix-all-logs,splunk.graphistry.com,192.168.0.53,4094,157.55.239.247,443,corelight_tutorial,"CN=Microsoft IT SSL SHA2,OU=Microsoft IT,O=Mic...",de350869b8c85de67a350c8d186f11e6,1,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",False,,logs.tar:./ssl_20180803_16:37:08-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...",CN=urs.microsoft.com,34,7,2018-08-03T23:38:49.851328Z,CEuC7D2BGQVsNZCn83,all_hosts,default,certificate has expired,TLSv10,,


### The graph:
* **Nodes**: IPs, ja3 (TLS metadata hashes), cert subject/issuers, colored by category
* **Edges**: Color by TLS version, title by issuer

In [26]:
hg = graphistry.hypergraph(
    certs_a_df, 
    ["id.orig_h", "id.resp_h", "uid", "ja3", "issuer", "subject"], ### "uid", "protocol", ....
    direct=True,
    opts={
        **opts,
        'EDGES': {
            'id.orig_h': ["id.resp_h", "ja3", "subject"],
            'ja3': ['id.resp_h'],
            "subject": ['id.resp_h'],
            'issuer': ['id.resp_h']
        }})

hg['graph'].bind(edge_title='category').drop_hyper_nans().color_points_by('category').color_edges_by('version').plot()

# links 32574
# events 5429
# attrib entities 6647


# 2. Hunting Insider Threats with NTLM+SMB

* **Motivation**:  NTLM (NT Lan Manager) logins are suspicious, especially on senstive data shares, worth auditing.
* **Input**: NTLM + SMB + other network logs (for any other IPs/activities)
* **Methodology**:
  * Seed search by NTLM activity
  * Get all other logs involving those UIDs
  * Map & audit

* **Insights**:
  * Cluster 1: Sonos smart speakers seem to be opening network-shared data that have nothing to do with listening to music
  * Cluster 2: Second cluster -- same `Workgroup` domain name, yet on a rogue host
  
* **Generalize**:
  * Map & audit NTLM and other remote logins to beginwith
  * From those hits, look at other file shares beyond SMB - dropbox, wiki's, ...
  * Map & audit file shares in general


In [23]:
ntlm_a_df = splunkToPandas("""

    search index="corelight_tutorial" 
        [ search index="corelight_tutorial" ntlm | dedup uid | fields + uid  ]
    | fields * 
                                   

    | head 1000

    """,
    {'sample_ratio': 1})

print('# rows', len(ntlm_a_df))
ntlm_a_df.sample(3)

Search results:

results 46
fetching: 0 - 50000
# rows 46


Unnamed: 0,action,date_hour,date_mday,date_minute,date_month,date_second,date_wday,date_year,date_zone,eventtype,host,id.orig_h,id.orig_p,id.resp_h,id.resp_p,index,linecount,name,punct,size,source,sourcetype,splunk_server,splunk_server_group,timeendpos,times.accessed,times.changed,times.created,times.modified,timestartpos,ts,uid,unix_category,unix_group,_bkt,_cd,_eventtype_color,_indextime,_raw,_serial,_si,_sourcetype,_subsecond,_time,native_file_system,path,service,share_type,domainname,hostname,status,success,username,tag,tag::eventtype,actions{},dropped,dst,msg,note,p,peer_descr,proto,src,suppress_for,conn_state,duration,history,local_orig,local_resp,missed_bytes,orig_bytes,orig_ip_bytes,orig_pkts,resp_bytes,resp_ip_bytes,resp_pkts,orig_cc
40,,23,3,39,august,1,friday,2018,0,nix-all-logs,splunk.graphistry.com,125.5.61.130,4577,10.0.0.11,445,corelight_tutorial,1,,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",,logs.tar:./notice_20180803_16:37:37-16:40:00-0...,notice-too_small,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...",34,,,,,7,2018-08-03T23:39:01.314346Z,C2P6jt32gESqlJqb32,all_hosts,default,corelight_tutorial~0~67A851F4-1BFE-4874-B653-8...,0:8559886,none,1558081367,"{""ts"":""2018-08-03T23:39:01.314346Z"",""uid"":""C2P...",40,"['splunk.graphistry.com', 'corelight_tutorial']",notice-too_small,0.314346,2018-08-03T23:39:01.314+00:00,,,,,,,,,,,,Notice::ACTION_LOG,False,10.0.0.11,SMBv1 Connection 125.5.61.130 to 10.0.0.11,FindSMBv1::Seen,445.0,bro,tcp,125.5.61.130,3600.0,,,,,,,,,,,,,
30,,23,3,39,august,2,friday,2018,0,nix-all-logs,splunk.graphistry.com,172.16.1.8,38889,172.16.1.7,445,corelight_tutorial,1,,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",,logs.tar:./ntlm_20180803_16:39:01-16:40:00-070...,ntlm-too_small,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...",34,,,,,7,2018-08-03T23:39:02.806384Z,CEYfiD3mbXWS12t6c1,all_hosts,default,corelight_tutorial~0~67A851F4-1BFE-4874-B653-8...,0:8560022,none,1558081367,"{""ts"":""2018-08-03T23:39:02.806384Z"",""uid"":""CEY...",30,"['splunk.graphistry.com', 'corelight_tutorial']",ntlm-too_small,0.806384,2018-08-03T23:39:02.806+00:00,,,,,WORKGROUP,INTENSE,SUCCESS,True,sonos,,,,,,,,,,,,,,,,,,,,,,,,,
13,SMB::FILE_OPEN,23,3,39,august,2,friday,2018,0,nix-all-logs,splunk.graphistry.com,172.16.1.8,38896,172.16.1.7,445,corelight_tutorial,1,\hack\reporter.log,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",498.0,logs.tar:./smb_files_20180803_16:39:01-16:40:0...,smb_files-too_small,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...",34,2018-07-24T17:56:04.616403Z,2018-07-24T17:56:04.620403Z,2018-07-24T17:56:04.616403Z,2018-07-24T17:56:04.620403Z,7,2018-08-03T23:39:02.858240Z,COGaRD3cM7jP2XFdy8,all_hosts,default,corelight_tutorial~0~67A851F4-1BFE-4874-B653-8...,0:8560959,none,1558081367,"{""ts"":""2018-08-03T23:39:02.858240Z"",""uid"":""COG...",13,"['splunk.graphistry.com', 'corelight_tutorial']",smb_files-too_small,0.85824,2018-08-03T23:39:02.858+00:00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### The graph:

Focus on representing NTLM, SMB, and generic Bro/Zeek logs.

* **Nodes**: IPs, domains/hosts/usernames, files/paths, colored by category
* **Edges**: Zeek events connecting them, colored by username

In [24]:
hg = graphistry.hypergraph(
    ntlm_a_df, 
    ["id.orig_h", "name", "id.resp_h", "path", "hostname", "domainname", "username"], ### "uid", "protocol", ....
    direct=True,
    opts={
        **opts,
        'EDGES': {
            "username": ['id.orig_h'],
            "id.orig_h": ['name', 'id.resp_h',  "hostname", "domainname"],       
            'path': ['name'],
            'hostname': ['id.resp_h'],
            'domainname': ['id.resp_h'],
            "name": ['id.resp_h'],
            "id.resp_h": ['username']
        }})
        

hg['graph'].bind(edge_title='name').drop_hyper_nans().color_points_by('category').color_edges_by('username').plot()

# links 460
# events 46
# attrib entities 31


# 3. DNS Tunneling

### 3.A. Setup -- General DNS map 

General query for looking at DNS connections with Bro/Zeek. 

Instead of showing each connection, summarize all activities across each unique 10,000 IP<>IP pairs: max bytes, first/last communication, ...

For UI work, compute the `log(..)` of bytes

In [110]:
dns_a_df = splunkToPandas("""

    search index="corelight_tutorial" sourcetype="conn"
    
    | eval total_bytes = orig_ip_bytes + resp_ip_bytes
    | eval log_total_bytes = log(orig_ip_bytes + resp_ip_bytes)

    | stats
    count(_time) as count,
    earliest(_time), latest(_time),
    values(answers{}) as answers,
    values(conn_state),
    values(history)
    values(issuer),
    values(ja3),
    values(last_alert),
    values(qtype_name),
    values(subject),
    max(*bytes), avg(*bytes), sum(*bytes),

    by id.orig_h, id.resp_h

    | eval duration_ms = last_time_ms - first_time_ms

    | head 50000

    """,
    {'sample_ratio': 1})

print('# rows', len(dns_a_df))
dns_a_df.sample(3)

Search results:

results 13412
fetching: 0 - 50000
# rows 13412


Unnamed: 0,id.orig_h,id.resp_h,count,earliest(_time),latest(_time),values(conn_state),max(log_total_bytes),max(missed_bytes),max(orig_ip_bytes),max(resp_ip_bytes),max(total_bytes),avg(log_total_bytes),avg(missed_bytes),avg(orig_ip_bytes),avg(resp_ip_bytes),avg(total_bytes),sum(log_total_bytes),sum(missed_bytes),sum(orig_ip_bytes),sum(resp_ip_bytes),sum(total_bytes),values(history),max(orig_bytes),max(resp_bytes),avg(orig_bytes),avg(resp_bytes),sum(orig_bytes),sum(resp_bytes),answers,values(qtype_name),values(issuer),values(ja3),values(subject),values(last_alert)
7526,192.168.0.54,54.149.255.94,1,1533339495.89632,1533339495.89632,S0,2.0170333392987803,0,104,0,104,2.0170333392987803,0,104,0,104,2.0170333392987803,0,104,0,104,S,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,
2073,192.168.0.51,131.103.28.9,2,1533339457.494446,1533339457.495081,SF,3.898560644939712,0,1674,6243,7917,3.898560644939712,0,1674,6243,7917,3.898560644939712,0,1674,6243,7917,ShADadtfF,822.0,5299.0,822.0,5299.0,822.0,5299.0,,,"CN=DigiCert SHA2 High Assurance Server CA,OU=w...",aa7f5e2ada5d7bb8a7dceed01f5ffd7c,"CN=*.atlassian.com,O=Atlassian Pty Ltd,L=Sydne...",
8156,192.168.0.54,70.83.216.152,1,1533339495.028559,1533339495.028559,S0,1.7323937598229684,0,54,0,54,1.7323937598229684,0,54,0,54,1.7323937598229684,0,54,0,54,D,,,,,,,,,,,,


### Graph demo

* Nodes are IPs
* Edges summarize all activity per IP<>IP: first time, ... 
  * Color by total bytes in/out

In [111]:
hg = graphistry.hypergraph(
    dns_a_df, 
    ["id.orig_h", "id.resp_h"], ### "uid", "protocol", ....
    direct=True,
    opts=opts)

hg['graph'].color_points_by('category').color_edges_by('max(log_total_bytes)', 'continuous').bind(edge_title='max(total_bytes)').plot()

# links 13412
# events 13412
# attrib entities 11814
coloring for range 1.591064607026499 8.528258188610675


### 3.B. DNS Tunnel:

* **Motivation**: DNS is a sneaky channel for hiding activity. Need to detect & unravel, whether proactive or post-breach.

* **Input**: DNS connections

* **Methodology**: 

 * Search for the top 10,000  ip->(unique dns query)->ip summaries matching tunneling heuristics:
   *  length(query) > 25: exfil / command request
   *  length(answer) > 45: received command 
 * Inspect & explain all flagged behavior
   * Pay attention to long and artificial looking queries & answers
   * Exfil: big or many queries
   * Command: strange responses
   * Tunneling: heavy back-and-forth

* **Insights**
  * Two clusters of activity
  * One seems to be tunneling: back-and-forth
  * The other seems to not have answers
  
 
 * **Generalize**
 
 The hunt continues on the identified UIDs and IPs for demo purposes, but does not reveal much. What can you find?
 
 On a full SIEM:
  * Check periodicity (timebar) for bot vs human
  * Combine with endpoint logs to correlate proceses, file accesses, and users
  * Combine with alert logs to trace back to initial breach and subsequent behavior

In [121]:
dns_b_df = splunkToPandas("""

    search index="corelight_tutorial" sourcetype="conn"
    
    | eval total_bytes = orig_ip_bytes + resp_ip_bytes
    | eval log_total_bytes = log(orig_ip_bytes + resp_ip_bytes)

    | eval query_length = length(query)
    | eval long_answers=mvfilter(length('answers{}') > 45)
    | eval long_answers_length = max(length(long_answers))
    | where query_length > 25 OR long_answers_length > 45


    | stats
    count(_time) as count,
    earliest(_time), latest(_time),
    values(answers{}) as answers,
    max(long_answers_length) as max_long_answers_length,
    values(conn_state),
    values(history)
    values(issuer),
    values(ja3),
    values(last_alert),
    values(subject),
    max(*bytes), avg(*bytes), sum(*bytes),
    values(qtype_name),
    first(uid),

    max(*bytes), avg(*bytes), sum(*bytes),
    
    by id.orig_h, id.resp_h, query, query_length                                

    | eval duration_ms = last_time_ms - first_time_ms
    
    | eval query=substr(query,1,100)
    | eval max_query_or_answer_length = max(query_length, max_long_answers_length)
    | sort max_query_or_answer_length desc                                           

    | head 50000

    """,
    {'sample_ratio': 1})

print('# rows', len(dns_b_df))
dns_b_df.sample(3)

Search results:

results 10000
fetching: 0 - 50000
# rows 10000


Unnamed: 0,id.orig_h,id.resp_h,query,query_length,count,earliest(_time),latest(_time),answers,values(qtype_name),first(uid),max_query_or_answer_length,max_long_answers_length
4002,192.168.1.128,34.215.241.13,586301a21f2856f046af6810d4c9f859b4d2c256a9b638...,228,1,1533339541.689877,1533339541.689877,108301a21f368b8052f9baffff18fed30b.sweetcoldwa...,MX,CaAbvy2ureWe5sifRf,228,53
3204,192.168.1.128,34.215.241.13,469501a21fd21adfdab9d4009aca8b170f9f3c8d5f060c...,228,1,1533339541.637895,1533339541.637895,da6d01a21f319600da70b5ffff18e6782c.sweetcoldwa...,CNAME,CaAbvy2ureWe5sifRf,228,53
9399,192.168.1.128,34.215.241.13,cf4201a21fed8911b08e25090722890fcf8d5fa7a4c436...,228,1,1533339541.664138,1533339541.664138,fac401a21f7c72f814e2bbffff18fe302d.sweetcoldwa...,CNAME,CaAbvy2ureWe5sifRf,228,53


### The graph:

* **Nodes**: IPs, queries, answers
* **Edges**: Summaries along each  orig_h->query->resp_h->answer->orig_h

**Results**:
* **UIDs**: C3ApkJ3TwWW64DtnWb , CaAbvy2ureWe5sifRf
* **IPs**: 10.0.2.30 10.0.2.20  34.215.241.13 192.168.1.128


In [122]:
hg = graphistry.hypergraph(
    dns_b_df, 
    ["id.orig_h", "id.resp_h", "query", "answers"], ### "uid", "protocol", ....
    direct=True,
    opts={
        **opts,
        'EDGES': {
            'id.orig_h': ['query'],
            'query': ['id.resp_h'],
            'id.resp_h': ['answers'],
            'answers': ['id.orig_h']
        }})

g = hg['graph'].bind(edge_title='query').drop_hyper_nans().color_points_by('category').color_edges_by('max_query_or_answer_length', 'continuous')

g.plot()

# links 40000
# events 10000
# attrib entities 19444
coloring for range 228.0 252.0
Uploading 5279 kB. This may take a while...


### Dig into interesting UIDs and IPs 1: IP map
* Surface IPs interacted with
* What log types are available 
  * `sourcetype`s: `conn` and `weird`

In [124]:
dns_b2_df = splunkToPandas("""

    search index="corelight_tutorial" 
    C3ApkJ3TwWW64DtnWb OR CaAbvy2ureWe5sifRf OR 10.0.2.30 OR 10.0.2.20  OR 34.215.241.13 OR 192.168.1.128
    | eval time=ts
    | rename answers{} as answers
    | fields *
    | fields - _*
                                   

    | head 50000

    """,
    {'sample_ratio': 1})

print('# rows', len(dns_b2_df))
dns_b2_df.sample(3)

Search results:

results 36352
fetching: 0 - 50000
# rows 36352


Unnamed: 0,date_hour,date_mday,date_minute,date_month,date_second,date_wday,date_year,date_zone,eventtype,host,id.orig_h,id.orig_p,id.resp_h,id.resp_p,index,linecount,name,notice,punct,source,sourcetype,splunk_server,splunk_server_group,time,timeendpos,timestartpos,ts,uid,unix_category,unix_group,conn_state,duration,history,local_orig,local_resp,missed_bytes,orig_bytes,orig_ip_bytes,orig_pkts,proto,resp_bytes,resp_ip_bytes,resp_pkts,AA,RA,RD,TC,Z,qclass,qclass_name,qtype,qtype_name,query,rejected,trans_id,TTLs{},answers,rcode,rcode_name,rtt,resp_cc,service,addl
27809,23,3,39,august,1,friday,2018,0,nix-all-logs,splunk.graphistry.com,192.168.1.128,62035,34.215.241.13,53,corelight_tutorial,1,,,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",logs.tar:./dns_20180803_16:36:44-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...",2018-08-03T23:39:01.801257Z,34,7,2018-08-03T23:39:01.801257Z,CaAbvy2ureWe5sifRf,all_hosts,default,,,,,,,,,,udp,,,,False,True,True,False,0.0,1.0,C_INTERNET,16.0,TXT,558601a21fe8d80facb642208680d56ffd9a327fabbde2...,False,52644.0,60.0,TXT 34 1ded01a21f9d26a538aec8ffff18fe16c1,0.0,NOERROR,2.6e-05,,,
4819,23,3,39,august,2,friday,2018,0,nix-all-logs,splunk.graphistry.com,192.168.1.128,56308,192.168.1.180,1070,corelight_tutorial,1,,,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",logs.tar:./conn_20180803_16:37:13-16:40:00-070...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...",2018-08-03T23:39:02.325427Z,34,7,2018-08-03T23:39:02.325427Z,CSJA4w3NXtHNCDjKff,all_hosts,default,REJ,4.7e-05,Sr,True,True,0.0,0.0,44.0,1.0,tcp,0.0,40.0,1.0,,,,,,,,,,,,,,,,,,,,
27698,23,3,39,august,1,friday,2018,0,nix-all-logs,splunk.graphistry.com,192.168.1.128,62035,34.215.241.13,53,corelight_tutorial,1,,,"{"""":""--::."","""":"""",""."":""..."",""."":,""."":""..."","".""...",logs.tar:./dns_20180803_16:36:44-16:40:00-0700...,conn,splunk.graphistry.com,"['dmc_group_cluster_master', 'dmc_group_deploy...",2018-08-03T23:39:01.803307Z,34,7,2018-08-03T23:39:01.803307Z,CaAbvy2ureWe5sifRf,all_hosts,default,,,,,,,,,,udp,,,,False,True,True,False,0.0,1.0,C_INTERNET,5.0,CNAME,2a6901a21f468022f3acf320ede6e51fc0b952fb2be047...,False,12869.0,60.0,ab6101a21f33dc16682c22ffff18fe346b.sweetcoldwa...,0.0,NOERROR,7e-06,,,


In [125]:
hg = graphistry.hypergraph(
    dns_b2_df, 
    ["id.orig_h", "id.resp_h"], ### "uid", "protocol", ....
    direct=True,
    opts=opts)

hg['graph'].bind(edge_title='sourcetype').drop_hyper_nans().color_points_by('category').color_edges_by('sourcetype').plot()

# links 36352
# events 36352
# attrib entities 32


### Dig into interesting UIDs and IPs 2: Mostly just connections, so inspect from that perspective
* Reuse DNS query from before

In [132]:
dns_b3_df = splunkToPandas("""

    search index="corelight_tutorial" sourcetype="conn"
    C3ApkJ3TwWW64DtnWb OR CaAbvy2ureWe5sifRf OR 10.0.2.30 OR 10.0.2.20  OR 34.215.241.13 OR 192.168.1.128
    
    | eval total_bytes = orig_ip_bytes + resp_ip_bytes
    | eval log_total_bytes = log(orig_ip_bytes + resp_ip_bytes)

    | eval query_length = length(query)
    | eval long_answers=mvfilter(length('answers{}') > 45)
    | eval long_answers_length = max(length(long_answers))
    | where query_length > 25 OR long_answers_length > 45


    | stats
    count(_time) as count,
    earliest(_time), latest(_time),
    values(answers{}) as answers,
    max(long_answers_length) as max_long_answers_length,
    values(conn_state),
    values(history)
    values(issuer),
    values(ja3),
    values(last_alert),
    values(subject),
    max(*bytes), avg(*bytes), sum(*bytes),
    values(qtype_name),
    first(uid),

    max(*bytes), avg(*bytes), sum(*bytes),
    
    by id.orig_h, id.resp_h, query, query_length                               

    | eval duration_ms = last_time_ms - first_time_ms
    
    | eval query=substr(query,1,100)
    | eval max_query_or_answer_length = max(query_length, max_long_answers_length)
    | sort max_query_or_answer_length desc                                           

    | head 50000

    """,
    {'sample_ratio': 1})

print('# rows', len(dns_b3_df))
dns_b3_df.sample(3)

Search results:

results 10000
fetching: 0 - 50000
# rows 10000


Unnamed: 0,id.orig_h,id.resp_h,query,query_length,count,earliest(_time),latest(_time),answers,values(qtype_name),first(uid),max_query_or_answer_length,max_long_answers_length
1450,192.168.1.128,34.215.241.13,1fa301a21fad17d74da48c2691f28cafc9d1174b4b4aa0...,228,1,1533339541.827051,1533339541.827051,719b01a21fb4847a8590f6ffff18fe292d.sweetcoldwa...,,CaAbvy2ureWe5sifRf,228,53
9370,192.168.1.128,34.215.241.13,ceb101a21fa46c6c9589542b03c29258c9ed5a1eb36f55...,228,1,1533339541.838568,1533339541.838568,ba4a01a21f2a46458b317effff18feab79.sweetcoldwa...,,CaAbvy2ureWe5sifRf,228,53
3141,192.168.1.128,34.215.241.13,450601a21f17ea238c375d0eafb2c10619d0f9fde7b685...,228,1,1533339541.685329,1533339541.685329,0d7901a21f18392f8a98f9ffff18fe2b9d.sweetcoldwa...,MX,CaAbvy2ureWe5sifRf,228,53


In [138]:
hg = graphistry.hypergraph(
    dns_b3_df, 
    ["id.orig_h", "id.resp_h", "query", "answers", "first(uid)"], ### "uid", "protocol", ....
    direct=True,
    opts={
        **opts,
        'EDGES': {
            'id.orig_h': ['query'],
            'query': ['id.resp_h'],
            'id.resp_h': ['answers'],
            'answers': ['id.orig_h']
        }})

hg['graph'].bind(edge_title='query').drop_hyper_nans().color_points_by('category').color_edges_by('max_query_or_answer_length', 'continuous').plot()

# links 40000
# events 10000
# attrib entities 19446
coloring for range 228.0 252.0
Uploading 5280 kB. This may take a while...


# 4. Mimetype Mismatch

* **Motivation**: When following an incident or doing a sweep, a common case is executable files hiding behind  extensions like ".jpeg", and brings into question the UIDs of all entities involved

* **Data**: Multiple. Ex:

* **Methodology**: 
  * Entity of interest: `index=main sourcetype=corelight* filename!=*.exe mime_type=application/x-dosexec`
  * Files that aren't named with the proper extension. Can pivot off md5/Sha1/Sha256. Can track tx_host and rx_host.

* **Insight**: 


In [139]:
mime_df = splunkToPandas("""

    search index=corelight_tutorial filename!=*.exe mime_type=application/x-dosexec                                         

    | head 200

    """,
    {'sample_ratio': 1})

print('# rows', len(dns_b3_df))
dns_b3_df.sample(3)

Search results:

results 5
fetching: 0 - 50000
# rows 10000


Unnamed: 0,id.orig_h,id.resp_h,query,query_length,count,earliest(_time),latest(_time),answers,values(qtype_name),first(uid),max_query_or_answer_length,max_long_answers_length
1587,192.168.1.128,34.215.241.13,231b01a21f318daa80a49920b083b87c41939785632e31...,228,1,1533339541.802094,1533339541.802094,e86101a21fe1d2f1f98635ffff18fe20ed.sweetcoldwa...,CNAME,CaAbvy2ureWe5sifRf,228,53.0
8026,192.168.1.128,34.215.241.13,b0d601a21f98da0b6f7c8d041d959c7d209119e909e764...,228,1,1533339541.640072,1533339541.640072,TXT 34 ca1d01a21f02b47131a643ffff18fe9f83,TXT,CaAbvy2ureWe5sifRf,228,
2554,192.168.1.128,34.215.241.13,38be01a21f0693e4ef0d9026781730d0d0e38957aff9f7...,228,1,1533339541.826877,1533339541.826877,80d001a21f4a8d7cda9a66ffff18fe11b7.sweetcoldwa...,,CaAbvy2ureWe5sifRf,228,53.0


### The graph:
??

In [0]:
## Old a

# Old A

Graph Modeling