ClusterLog: Unsupervised Clusterization of Error Messages

Requirements:

Python >= 3.7 < 3.8

This package doesn't work currently with python 2.7 because of kneed library, and with python 3.8 because of gensim.

editdistance==0.5.3
gensim==3.8.1
kneed==0.5.0
nltk==3.4.5
numpy==1.16.4
pandas==1.0.1
pyonmttok==1.10.1
scikit-learn==0.21.2
matplotlib==3.0.3
hdbscan==0.8.24
python-rake==1.4.5
pytextrank==2.0.1
Jinja2==2.11.1

Execute in command line to download dictionary required for pyTextRank library

python -m spacy download en_core_web_sm

Input: Pandas DataFrame with error log messages. DataFrame may have arbitrary columns and column names, but it must contain index column with IDs and column with text log messages. The name of log column is not fixed, but it must be specified explicitly in settings as 'target'. Possible structure of DataFrame is the following (in this example, tagret='log_message'):

ID   |   log_message                                                            | timestamp
-----------------------------------------------------------------------------------------------------
 1   |   No events to process: 16000 (skipEvents) >= 2343 (inputEvents of HITS  | 2019-10-01T10:18:49
 2   |   AODtoDAOD got a SIGKILL signal (exit code 137)                         | 2019-10-01T09:01:57
 ...

Required input:

df
target

Optional input:

clusterization_settings
- w2v_size (default: 100)
- w2v_window (default: 7)
- min_samples (default: 1)
model_name (path to a file with word2vec model)
mode ('create'(default) | 'update' | 'load')
output_file (path to report file)
add_placeholder (default: FALSE)
threshold (clustering threshold, default = 5000)
matching_accuracy (accuracy threshold, default = 0.8)
clustering_type (ML | SIMILARITY, default=SIMILARITY)
algorithm (dbscan|hdbscan|hierarchical, default=dbscan)

Modes:

create
- Create word2vec model based on large sample of error logs
- Save it to file ‘word2vec.model’ on server for further usage
process
- Load word2vec model from file (without re-training the model)
update
- Load word2vec model from file and train (update) this model with new error logs
- Save updated model in file

Clusterization of error log messages is implemented as a chain of methods:

data_preparation - cleaning initial log messages from all substrings with digits
grouping equals - group dataframe by equal cleaned messages
tokenization - split each log message into tokens (pyonmttok + retaining spaces)
tokens_vectorization - train word2vec model
sentence_vectorization - convert word2vec to sent2vec model
kneighbors - calculate k-neighbors
epsilon_search - search epsilon for the DBSCAN algorithm
dbscan - execute DBSCAN clusterization, returns cluster labels
reclusterization - reclustering the existing clusters using the Levenshtein distances between sequences of tokens
validation - calculating similarity score for each cluster

Output:

The output is available in different views:

ALL - DataFrame grouped by cluster numbers
INDEX - dictionary of lists of indexes for all clusters
TARGET - dictionary of lists of error messages for all clusters
cluster labels - array of cluster labels (as output of DBSCAN -> fit_predict())

Clusters statistics:

Clusters Statistics returns DataFrame or dictionary with statistic for all clusters:

cluster_name - name of a cluster
cluster_size - number of log messages in cluster
pattern - all common substrings in messages in the cluster
mean_similarity - average similarity of log messages in cluster
std_similarity - standard deviation of similarity of log messages in cluster
indices - indices of the initial dataframe, corresponding to the cluster

Installation:

pip install clusterlogs

Usage:

from clusterlogs import pipeline

Detailed usage of this library is described at clusterlogs_notebook.ipynb.

Author: maria.grigorieva@cern.ch (Maria Grigorieva)

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
clusterlogs		clusterlogs
models		models
reports		reports
samples		samples
test		test
.gitignore		.gitignore
1day_job_errors.ipynb		1day_job_errors.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterlogs

clusterlogs

models

models

reports

reports

samples

samples

test

test

.gitignore

.gitignore

1day_job_errors.ipynb

1day_job_errors.ipynb

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

ClusterLog: Unsupervised Clusterization of Error Messages

About

Releases

Packages

Contributors 5

Languages

License

maria-grigorieva/ClusterLog

Folders and files

Latest commit

History

Repository files navigation

ClusterLog: Unsupervised Clusterization of Error Messages

About

Resources

License

Stars

Watchers

Forks

Languages