# Analyzing Network Identity Data and Red Team Response with Graphistry AutoML + UMAP

We find a simple model that when clustered in a 2d plane via UMAP allows fast identification of anomalous 
computer to computer connections

In [None]:
# ! pip install graphistry[ai] 
! pip install --user --no-deps git+https://github.com/graphistry/pygraphistry.git@cudf-alex3

In [None]:
import os
import pandas as pd
from joblib import load, dump
from collections import Counter

import numpy as np
import matplotlib.pylab as plt

import graphistry
from graphistry.features import topic_model, search_model, ModelDict

In [None]:
graphistry.__version__

In [None]:
np.random.seed(137)

In [None]:
RENDER = True # set to True to render Graphistry UI inline

In [None]:
graphistry.register(api=3, protocol="https", server="hub.graphistry.com", username = '..',
                    #os.environ['USERNAME'], 
                    password='..'
                    #os.environ['GRAPHISTRY_PASSWORD']
                   )

Alert on & visualize anomalous identity events

Demo dataset: 1.6B windows events over 58 days => logins by 12K user over 14K systems
adapt to any identity system with logins. Here we subsample down to a small set of 50k events to prove out the pipeline. 

* => Can we identify accounts & computers acting anomalously? Resources being oddly accessed?
* => Can we spot the red team?
* => Operations: Identity incident alerting + identity data investigations

Community/contact for help handling bigger-than-memory & additional features

Runs on both CPU + multi-GPU
Tools: PyGraphistry[AI], DGL + PyTorch, and NVIDIA RAPIDS / umap-learn

In [None]:
# data source citation
# """A. D. Kent, "Cybersecurity Data Sources for Dynamic Network Research,"
# in Dynamic Networks in Cybersecurity, 2015.

# @InProceedings{akent-2015-enterprise-data,
#    author = {Alexander D. Kent},
#    title = {{Cybersecurity Data Sources for Dynamic Network Research}},
#    year = 2015,
#    booktitle = {Dynamic Networks in Cybersecurity},
#    month =        jun,
#    publisher = {Imperial College Press}
# }"""

# Get the Data


In [None]:
# small sample (get almost equivalent results without overheating computer over the 1.6B events in the full dataset)
df = pd.read_csv('https://gist.githubusercontent.com/silkspace/c7b50d0c03dc59f63c48d68d696958ff/raw/31d918267f86f8252d42d2e9597ba6fc03fcdac2/redteam_50k.csv', index_col=0)
df.head(5)

# Graphistry UMAP in a single line of code

In [None]:
# umap pipeline in one line
g = graphistry.nodes(df.sample(1000)).umap(engine='umap_learn')
g.plot()

In [None]:
print(df.shape) # -> 50+k

In [None]:
# here are the post-facto red team events
red_team = pd.read_csv('https://gist.githubusercontent.com/silkspace/5cf5a94b9ac4b4ffe38904f20d93edb1/raw/888dabd86f88ea747cf9ff5f6c44725e21536465/redteam_labels.csv', index_col=0)
red_team['feats2'] = red_team.feats  # since red team data didn't include metadata

# Modeling

Make sure you `mkdir(data)` or change path below


In [None]:
process = True  
# makes a combined feature we can use for topic modeling!
if process:
    # we create two types of models
    df['feats'] = df.src_computer + ' ' + df.dst_computer + ' ' + df.auth_type + ' ' + df.logontype
    # and one of just computer to computer 
    df['feats2'] = df.src_computer + ' ' + df.dst_computer
    ndf = df.drop_duplicates(subset=['feats'])
    ndf.to_parquet('auth-feats-one-column.parquet')
else:
    ndf = pd.read_parquet('auth-feats-one-column.parquet')
    
print(ndf.shape)

## Red Team Data 
Add it to the front of the DataFrame so we can keep track of it

In [None]:
# make a subsampled dataframe with the anom red-team data at top...so we can keep track.
# we don't need the full `df`, only the unique entries of 'feats' in `ndf` for 
# fitting a model (in a few cells below)

tdf = pd.concat([red_team.reset_index(), ndf.reset_index()])
tdf

In [None]:
# add a fidicial index used later
tdf['node'] = range(len(tdf))

In [None]:
# total number of red team events
tdf.RED.sum()

## Enrichment

In [None]:
def get_confidences_per_cluster(g, col='RED', verbose=False):
    """
        From DBSCAN clusters, will assess how many Red Team events exist,
        assessing confidence.
        
    """
    resses = []
    df = g._nodes
    labels = df._dbscan
    cnt = Counter(labels)
    for clust, count in cnt.most_common():
        res = df[df._dbscan==clust]
        n = res.shape[0]
        n_reds = res[col].sum()
        resses.append([clust, n_reds/n, n_reds, n])
        if n_reds>0 and verbose:
            print('-'*20)
            print(f'cluster: {clust}\n   red {100*n_reds/n:.2f}% or {n_reds} out of {count}')
    conf_dict = {k[0]: k[1] for k in resses}
    confidence = [conf_dict[k] for k in df._dbscan.values]
    # enrichment
    g._nodes['confidence'] = confidence
    conf_df = pd.DataFrame(resses, columns=['_dbscan', 'confidence', 'n_red', 'total_in_cluster'])
    conf_df = conf_df.sort_values(by='confidence', ascending=False)
    return g, conf_df
    

# The Full UMAP Pipelines
Fit a model on 'feats' column

In [None]:
# this is a convienence method for setting parameters in `g.featurize()/umap()` -- just a verbose dictionary
cyber_model = ModelDict('A topic model for computer to computer', **topic_model)

# umap_params_gpu = {'n_components': 2, 
#                    'n_neighbors': 20,
#                    'min_dist': 0.1, 
#                    'spread': 1, 
#                    'local_connectivity': 1, 
#                    'repulsion_strength': 2, 
#                    'negative_sample_rate': 5}
#cyber_model.update(umap_params_gpu)

cyber_model.update(dict(n_topics=32, X=['feats2']))  # name the column to featurize, which we lumped into `feats2`

cyber_model

In [None]:
# if you stop processing during execution, sometimes calling this will unblock you on subsequent calls should it give an error.
#g.reset_caches()

In [None]:
%%time
process = True  # set to false after it's run for ease of speed
if process:
    # ##################################
    g = graphistry.nodes(tdf, 'node')  # two lines does the heavy lifting
    # gpu version, will detect gpu and run
    #g5 = g.umap(engine='auto', **cyber_model, verbose=True).dbscan(min_dist=1, verbose=True)
    
    # cpu version
    g5 = g.umap(engine='umap_learn', **cyber_model, verbose=True).dbscan(min_dist=0.1, verbose=True)
    # #########################
    
    g5, cluster_confidences = get_confidences_per_cluster(g5, verbose=True)
    g5.save_search_instance('auth-feat-topic.search')
else:
    g = graphistry.bind()
    g5 = g.load_search_instance('auth-feat-topic.search')
    g5, cluster_confidences = get_confidences_per_cluster(g5)

In [None]:
# nodes dataframe is now enriched with _dbscan label
g5._nodes._dbscan

In [None]:
# the UMAP coordinates
g5._node_embedding

## Plot Graph
Color by `confidence` and hover over `red` team histogram to see where events occur. Alternatively, color by `cluster` assignment

In [None]:
g5.name('auth test').plot(render=True)

In [None]:
# see how the model has organized features
X = g5._node_features
X

In [None]:
x = g5.get_matrix(['interactive', 'c17', 'microsoft'])
x.plot()

## Predict | Online Mode

Once a model is fit, predict on new batches as we demonstrate here

There are three main methods

`g.transform` and `g.transform_umap` and if dbscan has been run, `g.transform_dbscan` 

see help(*) on each to learn more

One may save the model as above, load it, and wrap in a FastAPI endpoint, etc, to serve in production pipelines.

In [None]:
# first sample a batch from the normal data (auth=df)
sdf = df.sample(200)
emb_normal, xp_normal, _ = g5.transform_umap(sdf, None, kind='nodes', return_graph=False)
# then transform all the red team data
emb_red, xp_red, _ = g5.transform_umap(red_team, None, kind='nodes', return_graph=False)

In [None]:
emb_red

In [None]:
# transform_dbscan will predict on new data (here just red_team to prove it works)
g7 = g5.transform_dbscan(red_team, None, kind='nodes', return_graph=True, verbose=True)

In [None]:
_, ccdf = get_confidences_per_cluster(g7)

In [None]:
print(f'total confidence across clusters {ccdf.confidence.mean()*100}%')

In [None]:
g7.plot()

# We can simulate how a batch of new data would behave

In [None]:
# cpu version
plt.figure(figsize=(10,7))
plt.scatter(g5._node_embedding.x, g5._node_embedding.y, c='b', s=60, alpha=0.5)  # the totality of the fit data
plt.scatter(emb_normal.x, emb_normal.y, c='g') # batch of new data
plt.scatter(emb_red.x, emb_red.y, c='r') # red labels to show good cluster seperation
plt.scatter(emb_normal.x, emb_normal.y, c='g') # batch of new data, to see if they occlude 

In [None]:
# gpu version
# scatter to see how well it does.
plt.figure(figsize=(10,7))
plt.scatter(g5._node_embedding.x.to_numpy(), g5._node_embedding.y.to_numpy() , c='b', s=60, alpha=0.5)  # the totality of the fit data
plt.scatter(emb_normal.x.to_numpy(), emb_normal.y.to_numpy(), c='g') # batch of new data
plt.scatter(emb_red.x.to_numpy(), emb_red.y.to_numpy(), c='r') # red labels to show good cluster seperation
plt.scatter(emb_normal.x.to_numpy(), emb_normal.y.to_numpy(), c='g') # batch of new data, to see if they occlude 

## 96% Reduction in Alerts

This indicates a huge reduction in the search space needed.

Since we have clear cluster assignments along with (post facto) confidences of known anomalous activity, we can reduce the search space on new events (gotten via Kafka, Splunk, etc)

In [None]:
# percent of RED team labels we get with 10% confidence or above
p = cluster_confidences[cluster_confidences.confidence>0.1].n_red.sum()/cluster_confidences[cluster_confidences.confidence>0.1].total_in_cluster.sum()
print(f'{100*p:.2f}%')

In [None]:
# number of data points *not* to consider (and it's more if we look at df proper!)
cluster_confidences[cluster_confidences.confidence<0.1].total_in_cluster.sum()

In [None]:
p = cluster_confidences[cluster_confidences.confidence<0.1].total_in_cluster.sum()/cluster_confidences.total_in_cluster.sum()
print(f'Alert Reduction {100*p:.2f}%')

In [None]:
plt.figure(figsize=(10,7))
plt.plot(np.cumsum([k[2] for k in cluster_confidences.values]))
plt.xlabel('Anomolous Cluster Number')  # shows that we can ignore first clusters (containing most of the alerts)
plt.ylabel('Number of Identified Red Team Events')
print()

## Supervised UMAP
Here we use the RED team label to help supervise the UMAP fit. 
This might be useful once teams have actually identified RED team events 
and want to help separate clusters. 
While separation is better, the unsupervised version does well without.

In [None]:
# g.reset_caches()

In [None]:
%%time
process = True
if process:
    # ##################################  # an example of setting features explicitly, could use ModelDict 
    g = graphistry.nodes(tdf, 'node')
    g6 = g.umap(X=['feats'], y =['RED'], 
                min_words=100000, # set high to bypass sbert encoding
                cardinality_threshold=2, # set low to force topic modeling
                n_topics=32,
                spread=1,
                use_scaler_target=None,  # keep labels unscaled
                dbscan=True, engine='umap_learn')  # add dbscan here
    # ##################################
    
    g6, cluster_confidences6  = get_confidences_per_cluster(g6, verbose=True)
    g6.save_search_instance('auth-feat-supervised-topic.search')
else:
    g = graphistry.bind()
    g6 = g.load_search_instance('auth-feat-supervised-topic.search')
    g6, cluster_confidences6  = get_confidences_per_cluster(g6)
    

In [None]:
g6.get_matrix(target=True).astype(int)

### Plot
Color by `confidence` and hover over `red` team histogram to see where events occur. Alternatively, color by `_dbscan` assignment

In [None]:
g6.name('auth topic with supervised umap').plot(render=RENDER)

## A model of Computer-Computer and metadata features
Here we include `auth_type` and `logontype` 

In [None]:
tdf['feats']

In [None]:
%%time
process = True
if process:
    # #####################################
    g = graphistry.nodes(tdf, 'node')
    g7 = g.umap(X=['feats'], #y =['RED'], 
                min_words=100000, 
                cardinality_threshold=2, 
                n_topics=32,
                use_scaler=None,
                use_scaler_target=None, 
                spread=1,
                dbscan=True, engine='auto')  # add dbscan here
    # ###################################
    g7, cluster_confidences7  = get_confidences_per_cluster(g7)
    #g7.save_search_instance('auth-just-ip-topic.search')
else:
    g7 = graphistry.bind().load_search_instance('auth-just-ip-topic.search')
    g7, cluster_confidences7  = get_confidences_per_cluster(g7)


In [None]:
cluster_confidences7

### Plot
Color by `confidence` and hover over `red` team histogram to see where events occur. Alternatively, color by `cluster` assignment

In [None]:
g7.name('auth topic ips-ips only, no supervision').plot(render=RENDER)
# very similar to graph with metadata included, showing that ip-ip is strong indicator of phenomenon

# Conditional Probability
Let's see if conditiona probability of computer to computer connections can give us good histograms to tease out red team nodes? This is to baseline the above UMAP models, and we find in retrospect, UMAP wins. 

The conditional graph is however useful to see aggregate behavior, and coloring by 'red' team shows topology of Infection

In [None]:
g = graphistry.edges(tdf, "src_computer", "dst_computer")

In [None]:
x='dst_computer'
given='src_computer'
cg = g.conditional_graph(x, given, kind='edges')

In [None]:
# the new edge dataframe assess conditiona prob of computer-to-computer connection
cprob = cg._edges
cprob

In [None]:
# enrich the edges dataframe with the redteam data
# since cprobs lost those labels during the function call
indx = cprob.src_computer.isin(red_team.src_computer) & cprob.dst_computer.isin(red_team.dst_computer)
cprob.loc[indx, 'red'] = 1
cprob.loc[~indx, 'red'] = 0

In [None]:
cprob

In [None]:
# add edges back to graphistry instance
cg._edges = cprob

In [None]:
# full condprob graph
cg.plot(render=RENDER)

## Learning
The conditional graph shows that most of the edge probabilities are between 4e-7 and 0.03, whose bucket contains most of the events. Thus the chances of finding the red team edges are ~ 1e-4 -- slim indeed. UMAP wins.

Likewise the transpose conditional is even worse 
with prob_detection ~ 6e-5