<font size=20>A novel pattern-based edit distance for automatic log parsing</font>

This notebooks shows how to run the three log clustering algorithms considered in [1] namely:
* Logmine
* Drain
* Pattern clustering

[1] _"A novel pattern-based edit distance for automatic log parsing"_, by M. Raynal, M.O. Buob and G. Quénot in the International Conference for Pattern Recognition 2022,

# Pre-requisites

Please follow [the installation steps](https://github.com/nokia/pattern-clustering/wiki/Experiments-with-Drain-and-Logmine).

# Loading datasets and hyperparameters

## Dataset

Load the log files and their corresponding (modified) templates. Compared to the standard templates, the modified templates are more consistent.

In [None]:
from pathlib import Path
from utils import load_data

LOG_DIR = Path("logs")

# Pass load_modified=False to load the standard templates
(map_name_templates, map_name_lines) = load_data(LOG_DIR, load_modified=True)  

assert map_name_templates["Apache"]  # The templates related to Apache
assert map_name_lines["Apache"]      # The log lines related to Apache

## Pattern collection

Load the pattern collection, used to represent each input log line at the pattern scale.

In [None]:
from pathlib import Path
from utils import load_pattern_collection

PARAMETERS_DIR = Path("parameters")

BASIC_COLLECTION = load_pattern_collection(PARAMETERS_DIR / "basic_collection.json")
assert BASIC_COLLECTION

## Experiments

For each experiment reported below, you pass `show_clusters=True` to display in which cluster falls each line of the input log file.

### LogMine 

Run the [modified algorithm](https://github.com/raynalm/logmine), as the [standard implementation](https://github.com/trungdq88/logmine) does not return the cluster assigned to each input log line.

In [None]:
from pathlib import Path
from utils import METRICS, evaluate_logmine_clustering, to_logmine_params, canonic_log_name

LOGMINE_REPO_DIR = Path("../../../logmine")
APACHE_LOG_PATH = LOG_DIR / "Apache/Apache_2k.log"
APACHE_LOG_NAME = canonic_log_name(APACHE_LOG_PATH)

print(
    evaluate_logmine_clustering(
        LOGMINE_REPO_DIR,
        APACHE_LOG_PATH,
        map_name_templates[APACHE_LOG_NAME],
        METRICS,
        max_dist=0.06,
        logmine_regexps=to_logmine_params(BASIC_COLLECTION),
        show_clusters=False
    )
)

### Drain

Run the [modified algorithm](https://github.com/raynalm/Drain3), as the [standard implementation](https://github.com/IBM/Drain3)  does not return the cluster assigned to each input log line.

In [None]:
from utils import METRICS, evaluate_drain_clustering

print(
    evaluate_drain_clustering(
        APACHE_LOG_PATH,
        map_name_templates[APACHE_LOG_NAME],
        METRICS,
        BASIC_COLLECTION,
        sim_th=0.03,
        depth=3,
        show_clusters=False
    )
)

### Pattern clustering

Run the [`pattern_clustering` algorithm](https://github.com/nokia/pattern-clustering/).

In [None]:
import string
from utils import evaluate_pattern_clustering, make_map_name_dfa_densities
from pattern_clustering.multi_grep import MultiGrepFunctorLargest

PC_BASIC_COLLECTION = {**FUNDAMENTAL_COLLECTION, **BASIC_COLLECTION}
ALPHABET = set(string.printable)
(MAP_NAME_DFA, MAP_NAME_DENSITY) = make_map_name_dfa_densities(PC_BASIC_COLLECTION, ALPHABET)

print(
    evaluate_pattern_clustering(
        APACHE_LOG_PATH,
        map_name_templates["Apache"],
        METRICS,
        MAP_NAME_DFA,
        MAP_NAME_DENSITY,
        0.15,
        show_clusters=False
    )
)