# String Resemblance Grouping

## Table of Contents

* [Introduction](#introduction)
* [Windows Log Example](#windows-logs-example)
    * [Dataset](#dataset)
    * [Data Processing and Sampling](#data-processing-and-sampling)
    * [Model Training](#model-training)
    * [Results](#results)
        * [Group Counts](#group-counts)
        * [Number of unique logs trained over](#number-of-unique-logs-trained-over)
        * [Number of groups identified by model training](#number-of-groups-identifed-by-model-training)
        * [Exploring group spread](#exploring-group-spread)
* [Conclusions](#conclusions)
* [References](#references)

## Introduction
SRG is designed to find a subset of representative strings within a a large collection of messages. These representative strings create groupings with which to categorize the messages for further exploration or triage.

## Windows Logs Example

### Dataset
Some example logs are available through Zenodo. This notebook will be utilizing Windows logs.


In [1]:
import urllib.request
url = 'https://zenodo.org/record/3227177/files/Windows.tar.gz'
filename = '../datasets/Windows.tar.gz'
urllib.request.urlretrieve(url, filename)

('../datasets/Windows.tar.gz', <http.client.HTTPMessage at 0x7fbcc43e8b50>)

In [2]:
from subprocess import call
call(['tar', '-xzf', '../datasets/Windows.tar.gz', '-C', '../datasets/'])

0

In [3]:
import os
import sys
import glob
srg_path = os.path.abspath('../')
if srg_path not in sys.path:
    sys.path.append(srg_path)
from srg import SRG
import dask_cudf

In [4]:
import time

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster()
client = Client(cluster)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 46491 instead
2022-10-28 16:30:31,977 - distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-guvrkzph', purging
2022-10-28 16:30:31,978 - distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-e5aet1o0', purging
2022-10-28 16:30:31,980 - distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-qq28u2o7', purging
2022-10-28 16:30:31,981 - distributed.diskutils - INFO - Found stale lock file and directory '/home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-o3gcx3ld', purging
2022-10-28 16:30:31,983 - 

0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:46491/status,

0,1
Dashboard: http://127.0.0.1:46491/status,Workers: 8
Total threads: 8,Total memory: 503.79 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:34969,Workers: 8
Dashboard: http://127.0.0.1:46491/status,Total threads: 8
Started: Just now,Total memory: 503.79 GiB

0,1
Comm: tcp://127.0.0.1:39947,Total threads: 1
Dashboard: http://127.0.0.1:41937/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:42143,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-jfgjyrix,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-jfgjyrix
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:35183,Total threads: 1
Dashboard: http://127.0.0.1:40993/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:34727,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-xzyioi8_,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-xzyioi8_
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:37935,Total threads: 1
Dashboard: http://127.0.0.1:45159/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:43007,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-dfmjilze,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-dfmjilze
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:42219,Total threads: 1
Dashboard: http://127.0.0.1:46061/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:36025,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-_pjplpbi,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-_pjplpbi
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:43093,Total threads: 1
Dashboard: http://127.0.0.1:32785/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:34237,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-69jzy9ha,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-69jzy9ha
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:42647,Total threads: 1
Dashboard: http://127.0.0.1:40623/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:42251,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-wc_i6nng,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-wc_i6nng
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:34585,Total threads: 1
Dashboard: http://127.0.0.1:33087/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:46807,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-_z9_8i1z,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-_z9_8i1z
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:35975,Total threads: 1
Dashboard: http://127.0.0.1:42691/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:43923,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-eb_tvk87,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-eb_tvk87
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB


### Data processing and sampling
The logs are loaded into a Dask cudf object, excess whitespace is collapsed into a single space, and then a sample of the logs is taken for training.

In [5]:
win_log_dir = '../datasets/Windows.log'
win_logs = dask_cudf.read_csv(win_log_dir, delimiter = ',', names = ['ts', 'log']).dropna()

In [6]:
def collapse_whitespace(x):
    return ' '.join(x.split())

In [7]:
def clean_up(df):
    pdf = df.to_pandas().copy()
    pdf['cleaned'] = pdf.apply(lambda row: collapse_whitespace(row['log']), axis=1)
    return pdf

In [8]:
L = win_logs.sample(frac=0.001).map_partitions(lambda df: clean_up(df))

In [9]:
len(L)

114535

### Model Training

In [10]:
win_srg = SRG()
win_srg.fit(L, column='cleaned', shingle_size=4)

Current dimension:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5 [00:00<?, ?it/s]

### Results

In [11]:
win_labeled = win_srg.transform(L, column='cleaned')

#### Group counts

In [12]:
label_counts = win_labeled[['srg_label', 'srg_rep']].groupby('srg_rep').count().compute()
label_counts

Unnamed: 0_level_0,srg_label
srg_rep,Unnamed: 1_level_1
Info CBS Appl: DetectUpdate,180
Info CBS Appl: Evaluating package applicability for package Package_30_for_KB2791765~31bf3856ad364e35~amd64~~6.1.1.2,2128
Info CBS Appl: Partial install Status testing,70
Info CBS Appl: SelfUpdate detect,45461
Info CBS Appl: detect Parent,8803
Info CBS Appl: detectParent: package: Microsoft-Windows-RDP-BlueIP-Package-TopLevel~31bf3856ad364e35~amd64~~7.2.7601.16415,29
Info CBS Appl: detectParent: package: Package_60_for_KB3147071~31bf3856ad364e35~amd64~~6.1.1.1,8728
Info CBS Applicability(ComponentAnalyzerEvaluateSelfUpdate): Component: wow64_microsoft-windows-p..gssystems.resources_31bf3856ad364e35_6.1.7601.22183_pl-pl_c4015429ab7cd276,2284
Info CBS Applicability(ComponentAnalyzerEvaluateSelfUpdate): Component: wow64_microsoft-windows-systemrestore-main_31bf3856ad364e35_6.1.7601.23136_none_afd04622176b963e,414
Info CBS Applicability(ComponentAnalyzerEvaluateSelfUpdate): Component: x86_microsoft-windows-a..on-authui.resources_31bf3856ad364e35_6.1.7601.22269_sk-sk_76188dfc5aac5101,40709


#### Number of unique logs trained over

In [13]:
len(win_labeled['cleaned'].unique())

15610

#### Number of groups identifed by model training

In [14]:
len(win_srg._groups)

45

#### Exploring group spread
The below calculates the resemblance of each log to it's representative and finds the maximum, average, and standard deviation of these resemblances for each group. This provides insight into how well each group is represented. This could also inform additional subsetting and further SRG training if there's a large, spread out group that warrants further investigation.

In [15]:
from srg.fastmap.distances import Jaccard

In [16]:
def rep_dists(x, y):
    return Jaccard(shingle_size=4).calculate(x, y)

In [17]:
def calc_dists(df):
    pdf = df.copy()
    pdf['rep_dist'] = pdf.apply(lambda row: rep_dists(row['cleaned'], row['srg_rep']), axis=1)
    return pdf

In [18]:
labeled_dists = win_labeled.map_partitions(lambda df: calc_dists(df))

In [19]:
avg_rep_dists = labeled_dists[['srg_label', 'srg_rep', 'rep_dist']].groupby(['srg_label', 'srg_rep']).agg({'rep_dist': ['count', 'max', 'mean', 'std']}).compute()

In [20]:
avg_rep_dists

Unnamed: 0_level_0,Unnamed: 1_level_0,rep_dist,rep_dist,rep_dist,rep_dist
Unnamed: 0_level_1,Unnamed: 1_level_1,count,max,mean,std
srg_label,srg_rep,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0,Info CBS Appl: SelfUpdate detect,45461,0.454545,0.441977,0.074532
1,Info CBS Appl: DetectUpdate,180,0.0,0.0,0.0
2,Info CBS Appl: detect Parent,8803,0.0,0.0,0.0
3,Info CBS Appl: Partial install Status testing,70,0.0,0.0,0.0
4,Info CBS EvaluateApplicability,1684,0.909091,0.0653,0.13285
5,Info CSI 00001083 Transaction merge required,35,0.976471,0.650783,0.334025
6,pA = PROCESSOR_ARCHITECTURE_INTEL (0),19,1.0,0.38238,0.215892
7,Version = 6.1.7601.23714,59,0.6,0.323588,0.096725
8,Version = 6.1.7600.16385,25,0.47619,0.019048,0.095238
9,Version = 11.2.9600.18059,18,0.307692,0.24547,0.068588


### Conclusions
This shows how logs (or any string data) can be syntactically grouped with minimal *a priori* parameterization. After fitting the model to your data, group statistics can be calculated for dataset insights or specific groups of interest can be further explored for triage and informed filtering. This filtering and triage could be particularly helpful in incident investigations or root cause analyses. Models can be saved for future inferencing.

### References
https://github.com/logpai/loghub