# Syslog Resemblance Grouping

## Table of Contents

* Introduction
* Windows Log Example
    * Dataset
    * Data Processing and Sampling
    * Model Training
    * Results
        * Group Counts
        * Number of unique logs trained over
        * Number of groups identified by model training
        * Exploring group spread
* Conclusions
* References

## Introduction
SRG is designed to find a subset of representative strings within a a large collection of messages. These representative strings create groupings with which to categorize the messages for further exploration or triage.

## Windows Logs Example

### Dataset
Some example logs are available through Zenodo. This notebook will be utilizing Windows logs.


In [1]:
import urllib.request
url = 'https://zenodo.org/record/3227177/files/Windows.tar.gz'
filename = '../datasets/Windows.tar.gz'
urllib.request.urlretrieve(url, filename)

('../datasets/Windows.tar.gz', <http.client.HTTPMessage at 0x7f8ec00815b0>)

In [3]:
from subprocess import call
call(['tar', '-xzf', '../datasets/Windows.tar.gz', '-C', '../datasets/'])

0

In [10]:
import os
import sys
import glob
srg_path = os.path.abspath('../')
if srg_path not in sys.path:
    sys.path.append(srg_path)
from srg import SRG
import dask_cudf

In [11]:
import time

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster()
client = Client(cluster)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 40635 instead
2022-10-27 21:28:00,546 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-10-27 21:28:00,547 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-10-27 21:28:00,549 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-10-27 21:28:00,559 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
  next(self.gen)
2022-10-27 21:28:01,600 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
  next(self.gen)
2022-10-27 21:28:03,594 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
  next(self.gen)
2022-10-27 21:28:07,818 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
  next(self.gen)
2022-10-27 21:28:16,010 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize


0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:40635/status,

0,1
Dashboard: http://127.0.0.1:40635/status,Workers: 8
Total threads: 8,Total memory: 503.79 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:33539,Workers: 8
Dashboard: http://127.0.0.1:40635/status,Total threads: 8
Started: Just now,Total memory: 503.79 GiB

0,1
Comm: tcp://127.0.0.1:41837,Total threads: 1
Dashboard: http://127.0.0.1:45613/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:44189,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-qq28u2o7,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-qq28u2o7
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:39765,Total threads: 1
Dashboard: http://127.0.0.1:36589/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:44165,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-guvrkzph,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-guvrkzph
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:42787,Total threads: 1
Dashboard: http://127.0.0.1:34775/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:43595,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-k0rslt1b,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-k0rslt1b
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:37985,Total threads: 1
Dashboard: http://127.0.0.1:36425/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:34313,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-e5aet1o0,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-e5aet1o0
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:39777,Total threads: 1
Dashboard: http://127.0.0.1:41225/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:35187,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-0rdnyvi9,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-0rdnyvi9
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:35941,Total threads: 1
Dashboard: http://127.0.0.1:37449/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:41181,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-6yp27e09,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-6yp27e09
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:38971,Total threads: 1
Dashboard: http://127.0.0.1:33177/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:41463,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-o3gcx3ld,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-o3gcx3ld
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB

0,1
Comm: tcp://127.0.0.1:41645,Total threads: 1
Dashboard: http://127.0.0.1:45963/status,Memory: 62.97 GiB
Nanny: tcp://127.0.0.1:42389,
Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-j9mirtbn,Local directory: /home/nfs/sdavis/morpheus-experimental/syslog-resemblance-grouping/training-tuning/dask-worker-space/worker-j9mirtbn
GPU: Tesla V100-SXM2-32GB,GPU memory: 31.75 GiB


### Data processing and sampling
The logs are loaded into a Dask cudf object, excess whitespace is collapsed into a single space, and then a sample of the logs is taken for training.

In [12]:
win_log_dir = '../datasets/Windows.log'
win_logs = dask_cudf.read_csv(win_log_dir, delimiter = ',', names = ['ts', 'log']).dropna()

In [13]:
def collapse_whitespace(x):
    return ' '.join(x.split())

In [14]:
def clean_up(df):
    pdf = df.to_pandas().copy()
    pdf['cleaned'] = pdf.apply(lambda row: collapse_whitespace(row['log']), axis=1)
    return pdf

In [15]:
L = win_logs.sample(frac=0.001).map_partitions(lambda df: clean_up(df))

In [16]:
len(L)

114535

### Model Training

In [17]:
win_srg = SRG()
win_srg.fit(L, column='cleaned', shingle_size=4)

Current dimension:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5 [00:00<?, ?it/s]

### Results

In [18]:
win_labeled = win_srg.transform(L, column='cleaned')

#### Group counts

In [19]:
label_counts = win_labeled[['srg_label', 'srg_rep']].groupby('srg_rep').count().compute()
label_counts

Unnamed: 0_level_0,srg_label
srg_rep,Unnamed: 1_level_1
Info CBS Appl: DetectUpdate,169
Info CBS Appl: Evaluating applicability block(non detectUpdate part),1858
Info CBS Appl: Evaluating package applicability for package Package_34_for_KB2604115~31bf3856ad364e35~amd64~~6.1.1.3,2062
Info CBS Appl: Partial install Status testing,58
Info CBS Appl: SelfUpdate detect,45020
Info CBS Appl: detect Parent,8918
Info CBS Appl: detectParent: package: Package_8_for_KB3127220~31bf3856ad364e35~amd64~~6.1.1.0,8909
Info CBS Applicability(ComponentAnalyzerEvaluateSelfUpdate): Component: amd64_microsoft-windows-b..t-windows.resources_31bf3856ad364e35_6.1.7601.23126_ja-jp_a9f3dcbf1d3303df,26835
Info CBS Applicability(ComponentAnalyzerEvaluateSelfUpdate): Component: amd64_system.web_b03f5f7f11d50a3a_6.1.7601.18410_none_83d72e4ebeaa89e8,47
Info CBS Applicability(ComponentAnalyzerEvaluateSelfUpdate): Component: msil_system.web.regularexpressions_b03f5f7f11d50a3a_6.1.7601.18758_none_21ead132db2029f7,49


#### Number of unique logs trained over

In [20]:
len(win_labeled['cleaned'].unique())

15665

#### Number of groups identifed by model training

In [21]:
len(win_srg._groups)

51

#### Exploring group spread
The below calculates the resemblance of each log to it's representative and finds the maximum, average, and standard deviation of these resemblances for each group. This provides insight into how well each group is represented. This could also inform additional subsetting and further SRG training if there's a large, spread out group that warrants further investigation.

In [22]:
from srg.fastmap.distances import Jaccard

In [23]:
def rep_dists(x, y):
    return Jaccard(shingle_size=4).calculate(x, y)

In [24]:
def calc_dists(df):
    pdf = df.copy()
    pdf['rep_dist'] = pdf.apply(lambda row: rep_dists(row['cleaned'], row['srg_rep']), axis=1)
    return pdf

In [25]:
labeled_dists = win_labeled.map_partitions(lambda df: calc_dists(df))

In [26]:
avg_rep_dists = labeled_dists[['srg_label', 'srg_rep', 'rep_dist']].groupby(['srg_label', 'srg_rep']).agg({'rep_dist': ['count', 'max', 'mean', 'std']}).compute()

In [27]:
avg_rep_dists

Unnamed: 0_level_0,Unnamed: 1_level_0,rep_dist,rep_dist,rep_dist,rep_dist
Unnamed: 0_level_1,Unnamed: 1_level_1,count,max,mean,std
srg_label,srg_rep,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0,Info CBS Appl: SelfUpdate detect,45020,0.454545,0.441602,0.075605
1,Info CBS Appl: DetectUpdate,169,0.0,0.0,0.0
2,Info CBS Appl: detect Parent,8918,0.0,0.0,0.0
3,Info CBS Appl: Partial install Status testing,58,0.0,0.0,0.0
4,Info CBS EvaluateApplicability,1729,0.911765,0.070192,0.142489
5,Info CSI 000007e5 Transaction merge required,34,0.97619,0.58755,0.337404
6,pA = PROCESSOR_ARCHITECTURE_AMD64 (9),18,1.0,0.284296,0.338176
7,Version = 11.2.9600.17914,13,0.818182,0.312444,0.176746
8,Version = 11.2,4,0.72,0.54,0.36
9,Version = 6.2.7601.18514,3,0.6875,0.429167,0.374235


### Conclusions
This shows how logs (or any string data) can be sytactically grouped with minimal *a priori* parameterization. After fitting the model to your data, group statistics can be calculated for dataset insights or specific groups of interest can be further explored for triage and informed filtering. Models can be saved for future inferencing.

### References
https://github.com/logpai/loghub