<font size=20>A novel pattern-based edit distance for automatic log parsing</font>

The purpose of this notebook is to allow users to reproduce the experiments described in the paper "A novel pattern-based edit distance for automatic log parsing" by M. Raynal, M.O. Buob and G. Quénot in the International Conference for Pattern Recognition 2022.

# Installation

The installation instructions are designed to be run on a Debian or Ubuntu OS.

## Requirements

__Python__: the three algorithm tested in this notebook are implemented (at least partially) in python 3. This means a python interpreter is required. The version should not be older than 3.5.  
Installation instructions can be found on the [official Python website](https://www.python.org/)

__Jupyter notebook or jupyter lab__: The experiments pipeline is implemented within this notebook for ease of use. Installation instructions can be found on the [official Jupyter website](https://jupyter.org/).

__The following Python modules can be installed using pip:__
* pytest: `[sudo] pip3 install [-U] pytest`
* pytest-cov: `[sudo] pip3 install [-U] pytest-cov`
* setuptools: `[sudo] pip3 install [-U] setuptools`
* pybgl: `[sudo] pip3 install [-U] pytest-cov`

__A C++ compiler and the Boost.python library__:
* g++: `sudo apt install g++`
* Boost.Python: `sudo apt install libboost-python-dev`

__git__: `sudo apt install git`

## Log clustering repositories

One repository needs to be cloned and installed for each algorithm.

__Pattern-based clustering__:   
`cd ~/git/`  
`git clone git@github.com:nokia/pattern-clustering.git`  
`cd pattern-clustering`  
`[sudo] python setup.py install [--user]`  

__Drain3__:
`cd ~/git/`  
`git clone git@github.com:raynalm/Drain3.git`  
`cd Drain3`  
`[sudo] python setup.py install [--user]`  

__LogMine__:
`cd ~/git/`  
`git clone git@github.com:raynalm/logmine.git`  
`cd logmine`  
`[sudo] python setup.py install [--user]`  

In [1]:
import sys  
sys.path.insert(0, '.')

## Loading datasets and hyperparameters

### Dataset

In [2]:
from experiments_pattern_clustering import load_data


ROOT_LOG_PATH = "../logs/"

templates_dict, logs_dict = load_data(ROOT_LOG_PATH)
templates_dict_modified, logs_dict_modified = load_data(ROOT_LOG_PATH, load_modified=True)

### Hyperparameters

In [3]:
from experiments_pattern_clustering import load_pattern_collection

FUNDAMENTAL_COLLECTION = load_pattern_collection("../parameters/fundamental_collection.json")
BASIC_COLLECTION = load_pattern_collection("../parameters/basic_collection.json")
SPECIFIC_COLLECTION = load_pattern_collection("../parameters/specific_collection.json")

## Experiments

In [4]:
from experiments_pattern_clustering import METRICS, TIME, PA, ARI, NUM_CLUSTERS

### LogMine 

In [5]:
from experiments_pattern_clustering import evaluate_logmine_clustering, to_logmine_params

LOGMINE_REPO_PATH = "../../logmine"
print(
    evaluate_logmine_clustering(
        LOGMINE_REPO_PATH,
        "../logs/Apache/Apache_2k.log",
        templates_dict["Apache"],
        METRICS,
        max_dist=0.06,
        logmine_regexps=to_logmine_params(BASIC_COLLECTION)
    )
)

{'parsing accuracy': 0.2905, 'adjusted rand index': 0.9628006226024521, 'time': 0.43281006813049316, 'number of clusters': 4}


### Drain

In [6]:
from experiments_pattern_clustering import evaluate_drain_clustering


print(
    evaluate_drain_clustering(
        "../logs/Apache/Apache_2k.log",
        templates_dict["Apache"],
        METRICS,
        BASIC_COLLECTION,
        show_clusters=False,
        sim_th=0.03,
        depth=3
    )
)


{'parsing accuracy': 1.0, 'adjusted rand index': 1.0, 'time': 0.5992441177368164, 'number of clusters': 6}


### Pattern clustering

In [8]:
import string
from experiments_pattern_clustering import evaluate_fast_pattern_clustering, make_map_name_dfa_densities
from siva.multi_grep import MultiGrepFonctorLargest

PC_BASIC_COLLECTION = {**FUNDAMENTAL_COLLECTION, **BASIC_COLLECTION}
ALPHABET = set(string.printable)
MAP_NAME_DFA, MAP_NAME_DENSITY = make_map_name_dfa_densities(PC_BASIC_COLLECTION, ALPHABET)


print(
    evaluate_fast_pattern_clustering(
        "../logs/Apache/Apache_2k.log",
        templates_dict["Apache"],
        METRICS,
        MAP_NAME_DFA,
        MultiGrepFonctorLargest,
        MAP_NAME_DENSITY,
        0.15,
        show_clusters=False,
    )
)

{'parsing accuracy': 0.3065, 'adjusted rand index': 0.5517378285183966, 'time': 5.392209053039551, 'number of clusters': 4}
