# Interactive UMAP visualization with Graphistry for scale and explainability

UMAP is a great algorithm for clustering rich data. It is commonly used for turning data with many columns into more approachable 2D/3D visualizations, finding clusters, and making cluster classifiers. You may be familiar with earlier dimensionality-reduction algorithms like PCA, K-Means, and t-SNE that UMAP generally improves upon. As great as UMAP is, unfortunately, traditional UMAP results are still hard to understand for questions like what elements are in a cluster or why, and hard to interact with as you tweak their many settings. This notebook shows how to use Graphistry and graphs to quickly understand and interact with the results. 

We demonstrate working the leading CPU implementation of UMAP, [umap_learn](https://github.com/lmcinnes/umap). You may also enjoy our [end-to-end GPU tutorial](https://github.com/RAPIDSAcademy/rapidsacademy/blob/master/tutorials/security/tour/Tutorial_3_incident_umap_knn.ipynb) that uses the GPU-accelerated RAPIDS cuML implementaton of UMAP, though it currently takes an extract step of having to manually compute the k-nn.

For an ongoing example, we use a security event log (IPs, timestamps, counts, alert names, ...), and we've seen similarly great results for areas like fraud, genomics, and misinformation:

* Prep 1: Install
* Prep 2: Load and clean data
* Prep 3: Featurization
* Prep 4: Normalize & weight
* UMAP
* Visualize 1: UMAP as a graph
* Visualize 2: Explaining UMAP connections

## Prep 1: Install

Install umap and graphistry if you have not already

If you are not running a graphistry server, you can use a [free Hub account](https://www.graphistry.com/get-started) via the username/password option

In [1]:
# Already installed in Graphistry & RAPIDS distros
# ! pip install --user umap-learn
# ! pip install --user graphistry

In [2]:
import graphistry, pandas as pd, umap

# To specify Graphistry account & server, use:
# graphistry.register(api=3, username='...', password='...', protocol='https', server='hub.graphistry.com')
# For more options, see https://github.com/graphistry/pygraphistry#configure

## Prep 2: Load and clean data

UMAP works with most tabular data. You can use it with rows that have strings, numbers, dates and more!

The below small example is server logs of security honeypots getting hacked.

In [3]:
df = pd.read_csv('../../data/honeypot.csv')
df['victimPort'] = df['victimPort'].astype('uint32')
df['time(max)'] = pd.to_datetime(df['time(max)'] * 1000 * 1000 * 1000)
df['time(min)'] = pd.to_datetime(df['time(min)'] * 1000 * 1000 * 1000)
print(df.info())
df.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   attackerIP  220 non-null    object        
 1   victimIP    220 non-null    object        
 2   victimPort  220 non-null    uint32        
 3   vulnName    220 non-null    object        
 4   count       220 non-null    int64         
 5   time(max)   220 non-null    datetime64[ns]
 6   time(min)   220 non-null    datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(3), uint32(1)
memory usage: 11.3+ KB
None


Unnamed: 0,attackerIP,victimIP,victimPort,vulnName,count,time(max),time(min)
47,151.252.204.92,172.31.14.66,139,MS08067 (NetAPI),1,2015-02-03 02:11:12,2015-02-03 02:11:12
35,125.64.35.67,172.31.14.66,443,IIS Vulnerability,17,2015-01-11 06:54:02,2014-11-24 19:53:30
61,178.186.90.105,172.31.14.66,445,MS08067 (NetAPI),4,2015-01-24 20:01:50,2015-01-24 19:53:26
64,178.77.190.33,172.31.14.66,445,MS08067 (NetAPI),6,2014-12-30 19:30:04,2014-12-30 19:09:12
206,89.188.229.82,172.31.14.66,445,MS08067 (NetAPI),5,2015-01-15 11:28:46,2015-01-15 11:16:59


## Prep 3: Featurization

UMAP operates on numeric columns, so we create a new table of numeric values using several common feature encodings:
* Replace categorical values like specific IPs, ports, and alert names with many one-hot encoded columns. Ex: For column `"victimIP"`, and many columns like `"victimIP_oh_127.0.0.1"` whose values are 0/1
* Component columns: Split IPs like "172.31.13.124" into parts like "172" vs "31" in case there are phenomena like coordinated IP ranges
* Compute derived and entangled columns, like augmenting the min/max times of an alert being seen with the duration (`max - min`)

While the original data only had 7 columns, the new one has 33. We've worked with 10K+ columns in GPU-accelerated use cases.

You may benefit from using libraries to streamline the normalization. We only use pandas calls to be clear, and in a way that is directly translatable to cuDF for automatic GPU acceleration on bigger workloads.

In [4]:
dummmies = [
    pd.get_dummies(df[c], prefix=f'{c}_oh')
    for c in ['victimIP', 'victimPort', 'vulnName']
]
encoded_ips = ([
    df[[]].assign(
        attackerIP_a = df['attackerIP'].str.extract("^(\d+)\.").astype('uint8'),
        attackerIP_b = df['attackerIP'].str.extract("^\d+\.(\d+)\.").astype('uint8'),
        attackerIP_c = df['attackerIP'].str.extract("^\d+\.\d+\.(\d+)\.").astype('uint8'),
        attackerIP_d = df['attackerIP'].str.extract("^\d+\.\d+\.\d+\.(\d+)$").astype('uint8'),
        victimIP_a = df['victimIP'].str.extract("^(\d+)\.").astype('uint8'),
        victimIP_b = df['victimIP'].str.extract("^\d+\.(\d+)\.").astype('uint8'),
        victimIP_c = df['victimIP'].str.extract("^\d+\.\d+\.(\d+)\.").astype('uint8'),
        victimIP_d = df['victimIP'].str.extract("^\d+\.\d+\.\d+\.(\d+)$").astype('uint8')
    )
])
orig_continuous = [
    df[['victimPort', 'count', 'time(max)', 'time(min)']].assign(
        duration=df['time(max)'] - df['time(min)']
    )
]

df2 = pd.concat(encoded_ips + dummmies + orig_continuous, axis=1)
print('new shape:', df2.info())
df2.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 33 columns):
 #   Column                              Non-Null Count  Dtype          
---  ------                              --------------  -----          
 0   attackerIP_a                        220 non-null    uint8          
 1   attackerIP_b                        220 non-null    uint8          
 2   attackerIP_c                        220 non-null    uint8          
 3   attackerIP_d                        220 non-null    uint8          
 4   victimIP_a                          220 non-null    uint8          
 5   victimIP_b                          220 non-null    uint8          
 6   victimIP_c                          220 non-null    uint8          
 7   victimIP_d                          220 non-null    uint8          
 8   victimIP_oh_172.31.13.124           220 non-null    uint8          
 9   victimIP_oh_172.31.14.66            220 non-null    uint8          
 10  victimPort_oh_

Unnamed: 0,attackerIP_a,attackerIP_b,attackerIP_c,attackerIP_d,victimIP_a,victimIP_b,victimIP_c,victimIP_d,victimIP_oh_172.31.13.124,victimIP_oh_172.31.14.66,...,vulnName_oh_MS08067 (NetAPI),vulnName_oh_MYDOOM Vulnerability,vulnName_oh_MaxDB Vulnerability,vulnName_oh_SYMANTEC Vulnerability,vulnName_oh_TIVOLI Vulnerability,victimPort,count,time(max),time(min),duration
36,125,64,35,67,172,31,14,66,0,1,...,0,1,0,0,0,3128,2,2014-11-13 12:17:18,2014-11-12 03:59:29,1 days 08:17:49
206,89,188,229,82,172,31,14,66,0,1,...,1,0,0,0,0,445,5,2015-01-15 11:28:46,2015-01-15 11:16:59,0 days 00:11:47
37,125,64,35,67,172,31,14,66,0,1,...,0,0,1,0,0,9999,20,2015-01-06 23:39:12,2014-10-01 09:27:12,97 days 14:12:00
91,188,44,107,239,172,31,14,66,0,1,...,1,0,0,0,0,139,2,2015-02-22 09:01:55,2015-02-22 08:39:39,0 days 00:22:16
6,110,39,211,169,172,31,14,66,0,1,...,1,0,0,0,0,445,4,2015-01-01 03:52:15,2015-01-01 03:39:39,0 days 00:12:36


## Prep 4: Normalize & weight
Once you have numeric data, UMAP is still sensitive to how you normalize each column. We do a simple conversion of each column to values between 0-1.

Fancier normalizations of some columns might also try to adjust for aspects like the distribution. Likewise, you can try increasing specific columns to being 0-10 to increase their relative weight.

In [5]:
df3 = df2.copy()
for c in df3:
    #print(c)
    df3[c] = ((df3[c] - df3[c].min())/(df3[c].max() - df3[c].min())).fillna(0)
print(df3.info())
df3.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 33 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   attackerIP_a                        220 non-null    float64
 1   attackerIP_b                        220 non-null    float64
 2   attackerIP_c                        220 non-null    float64
 3   attackerIP_d                        220 non-null    float64
 4   victimIP_a                          220 non-null    float64
 5   victimIP_b                          220 non-null    float64
 6   victimIP_c                          220 non-null    float64
 7   victimIP_d                          220 non-null    float64
 8   victimIP_oh_172.31.13.124           220 non-null    float64
 9   victimIP_oh_172.31.14.66            220 non-null    float64
 10  victimPort_oh_80                    220 non-null    float64
 11  victimPort_oh_135                   220 non-n

Unnamed: 0,attackerIP_a,attackerIP_b,attackerIP_c,attackerIP_d,victimIP_a,victimIP_b,victimIP_c,victimIP_d,victimIP_oh_172.31.13.124,victimIP_oh_172.31.14.66,...,vulnName_oh_MS08067 (NetAPI),vulnName_oh_MYDOOM Vulnerability,vulnName_oh_MaxDB Vulnerability,vulnName_oh_SYMANTEC Vulnerability,vulnName_oh_TIVOLI Vulnerability,victimPort,count,time(max),time(min),duration
75,0.815315,0.267717,0.632411,0.916,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.036798,0.071429,0.404545,0.404472,0.000117
192,0.378378,0.728346,0.928854,0.772,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.036798,0.080357,0.069167,0.069074,0.000128
203,0.391892,0.976378,0.683794,0.032,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.036798,0.008929,0.275634,0.275661,1e-05
67,0.801802,0.098425,0.822134,0.612,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.036798,0.035714,0.635178,0.635118,0.000111
206,0.396396,0.740157,0.905138,0.324,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.036798,0.035714,0.697024,0.69702,5.5e-05


## UMAP

UMAP has many options -- run `help(umap.UMAP)` for details. It returns two things we use:
* An (x,y) position pair for each record
* A weighted edgelist (sparse matrix) listing higher-value similarities between records

We enrich our original data frame with the x/y positions and create a new one with the edges. Note that we throw away most of the features: we'll get explainable summaries later.

In [6]:
# see help(umap.UMAP)
umap_options = {
    'n_components': 2,
    'metric': 'euclidean'
}

In [7]:
%%time
embedding = umap.UMAP(**umap_options).fit(df3)
embedding

CPU times: user 6.7 s, sys: 71.7 ms, total: 6.78 s
Wall time: 5.34 s


UMAP(dens_frac=0.0, dens_lambda=0.0)

In [8]:
%%time
coo = embedding.graph_.tocoo()
print('coo lens', len(coo.row), len(coo.col), len(coo.data))
print(coo.row[0:5], coo.col[0:5], coo.data[0:5])
weighted_edges_df = pd.DataFrame({
    's': coo.row,
    'd': coo.col,
    'w': coo.data
})
weighted_edges_df.sample(3)

coo lens 3868 3868 3868
[0 0 0 0 0] [ 47  60  70  91 112] [0.24795613 0.16554515 0.14451326 0.15970446 0.49268547]
CPU times: user 2.2 ms, sys: 0 ns, total: 2.2 ms
Wall time: 1.68 ms


Unnamed: 0,s,d,w
1793,97,131,0.451485
2762,156,193,0.180001
221,12,7,0.441724


In [9]:
nodes_df = pd.concat([
    df,
    pd.DataFrame(embedding.embedding_).rename(columns={0: 'x', 1: 'y'})
], axis=1)
nodes_df['x'] = nodes_df['x'] * 100
nodes_df['y'] = nodes_df['y'] * 100
nodes_df = nodes_df.reset_index().rename(columns={'index': 'n'})
print(nodes_df.info())
nodes_df.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   n           220 non-null    int64         
 1   attackerIP  220 non-null    object        
 2   victimIP    220 non-null    object        
 3   victimPort  220 non-null    uint32        
 4   vulnName    220 non-null    object        
 5   count       220 non-null    int64         
 6   time(max)   220 non-null    datetime64[ns]
 7   time(min)   220 non-null    datetime64[ns]
 8   x           220 non-null    float32       
 9   y           220 non-null    float32       
dtypes: datetime64[ns](2), float32(2), int64(2), object(3), uint32(1)
memory usage: 14.7+ KB
None


Unnamed: 0,n,attackerIP,victimIP,victimPort,vulnName,count,time(max),time(min),x,y
134,134,27.54.176.156,172.31.14.66,445,MS08067 (NetAPI),10,2015-01-22 12:53:34,2015-01-22 12:28:32,150.326599,1203.725098
2,2,105.186.127.152,172.31.14.66,445,MS04011 (LSASS),1,2014-12-30 18:59:20,2014-12-30 18:59:20,504.711639,721.059692
199,199,87.116.229.250,172.31.14.66,445,MS08067 (NetAPI),9,2014-10-19 23:59:28,2014-10-19 23:40:39,263.191925,1076.88562
84,84,187.58.58.135,172.31.14.66,445,MS08067 (NetAPI),7,2015-01-03 13:22:37,2015-01-03 13:09:39,136.485535,1302.276855
51,51,176.108.184.1,172.31.14.66,445,MS08067 (NetAPI),11,2014-11-05 03:01:01,2014-11-05 02:33:43,185.407608,843.841431


## Visualize 1: Interactive UMAP using graphs

We first use UMAP data for an interactive graph visualization you can inspect and manipulate

* Nodes: Represents the original records
  * Position: From the UMAP embedding
  * Size: Bind to the original 'count' column
  * Color: Use Graphistry's default to autoinfer a community label based on edges (below)
* Edges: Shows UMAP's [inferred connectivites](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html) (correlations)
  * Color: Edge weight, cold to hot
  * Weights: From UMAP's inferred connectivities
 
You can think of UMAP's edges as being the most important pairwise weighted votes stating "these records should be close together". In force-directed graph layout algorithms, they act as elastic springs that prevent the nodes from drifting too far apart.

The visualization lets you interactively explore phenomena like coloring by alert name and time and drilling into specific clusters. As UMAP is fairly fast here (and can be faster via the cuML flow), we have both a fast visual interaction loop and decently fast coding loop.

In [15]:
# Most of the settings are optional and can be changed on-the-fly in the UI
g = (
    graphistry
    .nodes(nodes_df, 'n')
    .edges(weighted_edges_df, 's', 'd')
    .bind(point_x='x', point_y='y', edge_weight='w')
    .settings(url_params={'play': 0, 'edgeInfluence': 5})
    .encode_edge_color('w', ['maroon', 'pink', 'white'], as_continuous=True)
    .encode_point_size('count')
)

In [11]:
g.plot()


Fascinatingly, when Graphistry's force-directed graph layout algorithm reuses UMAP's inferred edge connectivities, the layout does not significantly change from what UMAP computes. Try hitting the "play" button in the tool to see for yourself! That means the graph-based intuitions for subsequent interactions, such as removing key nodes/edges and reclustering, should be consistent.

## Visualize 2: Explainable UMAP connections

When nodes have many features, even having UMAP's edges showing their nearest neighbors does not clarify which attributes best explain why they are being clustered. For example, the primary partitioning largely follows alert name, but those split into subclusters with interesting designs based on secondary factors like IP address and time.

To visually clarify which features a pair of nodes have in commmon, we add edges between them, one for each attribute in common. The more common features between two nodes, the more edges. This is similar to how [graphistry.hypergraph(df, ...)['graph'].plot()](https://github.com/graphistry/pygraphistry) works. For simplicity, the below algorithm computes new edges far exact feature matches.

For initial intuition, we color the edges based on the type -- IP, alert name, etc. -- but it can also may make sense to color them by specific values, like a particular IP address or alert name.

In [12]:
#triple: src_node_EDGE_dst_node
edge_triples = (g
 ._edges
 .merge(g._nodes, left_on=g._source, right_on=g._node)
 .rename(columns={c: f'src_{c}' for c in g._nodes})
 .merge(g._nodes, left_on=g._destination, right_on=g._node)
 .rename(columns={c: f'dst_{c}' for c in g._nodes})
)

#print(edge_triplescolumns)
equivs = []
for c in g._nodes:
    equiv = edge_triples[ edge_triples[f'src_{c}'] == edge_triples[f'dst_{c}'] ]
    if len(equiv) > 0:
        equiv = equiv[[g._source, g._destination]].assign(
            type=c,
            match_val=edge_triples[f'src_{c}'],
            w=0.1)
        equiv[c] = edge_triples[f'src_{c}']
        print('adding', c, len(equiv))
        equivs.append(equiv)
    else:
        print('no hits on col', c)

equivs_df = pd.concat(equivs)
equivs_df['match_val'] = equivs_df['match_val'].astype(str)  # ensure arrow works  
#equivs_df.sample(10)

edges2 = pd.concat([g._edges.assign(type='umap', match_val='1'), equivs_df])
g2 = (g
      .edges(edges2)
      #.edges(edges2[edges2['type'] == 'attackerIP'])
      .bind(edge_label='match_val')
      .encode_edge_color('type', categorical_mapping={
          'umap': 'grey',
          'victimIP': 'blue',
          'attackerIP': 'lightblue',
          'victimPort': 'green',
          'vulnName': 'yellow',
          'count': 'white'
      })
)

no hits on col n
adding attackerIP 46
adding victimIP 3704
adding victimPort 3410
adding vulnName 3374
adding count 464
no hits on col time(max)
no hits on col time(min)
no hits on col x
no hits on col y


In [13]:
g2.plot()

In [14]:
print(g2._edges.info())
g2._edges.sample(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14866 entries, 0 to 3863
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   s           14866 non-null  int32  
 1   d           14866 non-null  int32  
 2   w           14866 non-null  float64
 3   type        14866 non-null  object 
 4   match_val   14866 non-null  object 
 5   attackerIP  46 non-null     object 
 6   victimIP    3704 non-null   object 
 7   victimPort  3410 non-null   float64
 8   vulnName    3374 non-null   object 
 9   count       464 non-null    float64
dtypes: float64(3), int32(2), object(5)
memory usage: 1.1+ MB
None


Unnamed: 0,s,d,w,type,match_val,attackerIP,victimIP,victimPort,vulnName,count
540,203,214,0.1,victimPort,445,,,445.0,,
2680,176,27,0.1,victimIP,172.31.13.124,,172.31.13.124,,,
40,132,70,0.1,victimPort,139,,,139.0,,
