Add ULP from ICSME'22

logpai · Sep 7, 2023 · a90854d · a90854d
1 parent fcb019d
commit a90854d
Show file tree

Hide file tree

Showing 13 changed files with 460 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -44,6 +44,7 @@ Logparser provides a machine learning toolkit and benchmarks for automated log p
 | ICWS'17 | [Drain](https://github.com/logpai/logparser/tree/main/logparser/Drain#drain) | [Drain: An Online Log Parsing Approach with Fixed Depth Tree](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf), by Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.|
 | ICPC'18 | [MoLFI](https://github.com/logpai/logparser/tree/main/logparser/MoLFI#molfi) | [A Search-based Approach for Accurate Identification of Log Message Formats](http://publications.uni.lu/bitstream/10993/35286/1/ICPC-2018.pdf), by Salma Messaoudi, Annibale Panichella, Domenico Bianculli, Lionel Briand, Raimondas Sasnauskas.  |
 | TSE'20 | [Logram](https://github.com/logpai/logparser/tree/main/logparser/Logram#logram) | [Logram: Efficient Log Parsing Using n-Gram Dictionaries](https://arxiv.org/pdf/2001.03038.pdf), by Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun (Peter) Chen. |
+| ICSME'22 | [ULP](https://github.com/logpai/logparser/tree/main/logparser/ULP#ULP) | [An Effective Approach for Parsing Large Log Files](https://users.encs.concordia.ca/~abdelw/papers/ICSME2022_ULP.pdf), by Issam Sedki, Abdelwahab Hamou-Lhadj, Otmane Ait-Mohamed, Mohammed A. Shehab. |
 
 :bulb: Welcome to submit a PR to push your parser code to logparser and add your paper to the table.
 

diff --git a/THIRD_PARTIES.md b/THIRD_PARTIES.md
@@ -8,3 +8,4 @@ The logparser package is built on top of the following third-party libraries:
 | MoLFI  |  https://github.com/SalmaMessaoudi/MoLFI  |     Apache-2.0     |
 |    alignment (LogMine)  |      https://gist.github.com/aziele/6192a38862ce569fe1b9cbe377339fbe      | GPL |
 |    Logram  |      https://github.com/BlueLionLogram/Logram      | NA |
+|    ULP     | https://github.com/SRT-Lab/ULP | MIT |
diff --git a/docs/tools/Drain.md b/docs/tools/Drain.md
@@ -7,8 +7,6 @@ Drain is one of the representative algorithms for log parsing. It can parse logs
 
 Drain first preprocess logs according to user-defined domain knowledge, ie. regex. Second, Drain starts from the root node of the parse tree with the preprocessed log message. The 1-st layer nodes in the parse tree represent log groups whose log messages are of different log message lengths. Third,  Drain traverses from a 1-st layer node to a leaf node. Drain selects the next internal node by the tokens in the beginning positions of the log message.Then Drain calculate similarity between log message and log event of each log group to decide whether to put the log message into existing log group. Finally, Drain update the Parser Tree by scaning the tokens in the same position of the log message and the log event.
 
-
-
 Read more information about Drain from the following paper:
 
 + Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. [Drain: An Online Log Parsing Approach with Fixed Depth Tree](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf), *IEEE International Conference on Web Services (ICWS)*, 2017.
diff --git a/docs/tools/LKE.md b/docs/tools/LKE.md
@@ -1,13 +1,13 @@
 LKE
 ===
 
-LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messsages. After further group splitting with fine tuning, log keys are generated from the resulting clusters.
+LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messages. After further group splitting with fine-tuning, log keys are generated from the resulting clusters.
 
 **Step 1**: Log clustering. Weighted edit distance is designed to evaluate the similarity between two logs, WED=\sum_{i=1}^{n}\frac{1}{1+e^{x_{i}-v}} . n is the number of edit operations to make two logs the same, x_{i} is the column index of the word which is edited by the i-th operation, v is a parameter to control weight. LKE links two logs if the WED between them is less than a threshold \sigma . After going through all pairs of logs, each connected component is regarded as a cluster. Threshold \sigma is automatically calculated by utilizing K-means clustering to separate all WED between all pair of logs into 2 groups, and the largest distance from the group containing smaller WED is selected as the value of \sigma .
 
 **Step 2**: Cluster splitting. In this step, some clusters are further partitioned. LKE firstly finds out the longest common sequence (LCS) of all the logs in the same cluster. The rests of the logs are dynamic parts separated by common words, such as “/10.251.43.210:55700” or “blk_904791815409399662”. The number of unique words in each dynamic part column, which is denoted as |DP| , is counted. For example, |DP|=2 for the dynamic part column between “src:” and “dest:” in log 2 and log 3. If the smallest |DP| is less than threshold \phi , LKE will use this dynamic part column to partition the cluster.
 
-**Step 3**: Log template extraction. This step is similar to the step 4 of IPLoM. The only difference is that LKE removes all variables when they generate log templates, instead of representing them by wildcards.
+**Step 3**: Log template extraction. This step is similar to step 4 of IPLoM. The only difference is that LKE removes all variables when they generate log templates, instead of representing them by wildcards.
 
 Read more information about LKE from the following paper:
 

diff --git a/logparser/LKE/README.md b/logparser/LKE/README.md
@@ -1,6 +1,6 @@
 # LKE
 
-LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messsages. After further group splitting with fine tuning, log keys are generated from the resulting clusters.
+LKE (Log Key Extraction) is one of the representative algorithms for log parsing. It first leverages empirical rules for preprocessing and then uses weighted edit distance for hierarchical clustering of log messages. After further group splitting with fine tuning, log keys are generated from the resulting clusters.
 
 Read more information about LKE from the following paper:
 

diff --git a/logparser/ULP/README.md b/logparser/ULP/README.md
@@ -0,0 +1,59 @@
+# ULP
+
+ULP (Universal Log Parsing) is a highly accurate log parsing tool, the ability to extract templates from unstructured log data. ULP learns from sample log data to recognize future log events. It combines pattern matching and frequency analysis techniques. First, log events are organized into groups using a text processing method. Frequency analysis is then applied locally to instances of the same group to identify static and dynamic content of log events. When applied to 10 log datasets of the Loghub benchmark, ULP achieves an average accuracy of 89.2%, which outperforms the accuracy of four leading log parsing tools, namely Drain, Logram, Spell and AEL. Additionally, ULP can parse up to four million log events in less than 3 minutes. ULP can be readily used by practitioners and researchers to parse effectively and efficiently large log files so as to support log analysis tasks.
+
+Read more information about Drain from the following paper:
+
++ Issam Sedki, Abdelwahab Hamou-Lhadj, Otmane Ait-Mohamed, Mohammed A. Shehab. [An Effective Approach for Parsing Large Log Files](https://users.encs.concordia.ca/~abdelw/papers/ICSME2022_ULP.pdf), *Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME)*, 2022.
+
+### Running
+
+The code has been tested in the following enviornment:
++ python 3.7.6
++ regex 2022.3.2
++ pandas 1.0.1
++ numpy 1.18.1
++ scipy 1.4.1
+
+Run the following scripts to start the demo:
+
+```
+python demo.py
+```
+
+Run the following scripts to execute the benchmark:
+
+```
+python benchmark.py
+```
+
+### Benchmark
+
+Running the benchmark script on Loghub_2k datasets, you could obtain the following results.
+
+|   Dataset   | F1_measure | Accuracy |
+|:-----------:|:----------|:--------|
+|     HDFS    | 0.999984   | 0.9975   |
+|    Hadoop   | 0.999923   | 0.9895   |
+|    Spark    | 0.994593   | 0.922    |
+|  Zookeeper  | 0.999876   | 0.9925   |
+|     BGL     | 0.999453   | 0.93     |
+|     HPC     | 0.994433   | 0.9505   |
+| Thunderbird | 0.998665   | 0.6755   |
+|   Windows   | 0.989051   | 0.41     |
+|    Linux    | 0.476099   | 0.3635   |
+|   Android   | 0.971417   | 0.838    |
+|  HealthApp  | 0.993431   | 0.9015   |
+|    Apache   | 1          | 1        |
+|  Proxifier  | 0.739766   | 0.024    |
+|   OpenSSH   | 0.939796   | 0.434    |
+|  OpenStack  | 0.834337   | 0.4915   |
+|     Mac     | 0.981294   | 0.814    |
+
+
+### Citation
+
+:telescope: If you use our logparser tools or benchmarking results in your publication, please kindly cite the following papers.
+
++ [**ICSE'19**] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, Michael R. Lyu. [Tools and Benchmarks for Automated Log Parsing](https://arxiv.org/pdf/1811.03509.pdf). *International Conference on Software Engineering (ICSE)*, 2019.
++ [**DSN'16**] Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. [An Evaluation Study on Log Parsing and Its Use in Log Mining](https://jiemingzhu.github.io/pub/pjhe_dsn2016.pdf). *IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)*, 2016.
diff --git a/logparser/ULP/ULP.py b/logparser/ULP/ULP.py
@@ -0,0 +1,223 @@
+# =========================================================================
+# This file is modified from https://github.com/SRT-Lab/ULP
+#
+# MIT License
+# Copyright (c) 2022 Universal Log Parser
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+# =========================================================================
+
+import os
+import pandas as pd
+import regex as re
+import time
+import warnings
+from collections import Counter
+from string import punctuation
+
+warnings.filterwarnings("ignore")
+
+
+class LogParser:
+    def __init__(self, log_format, indir="./", outdir="./result/", rex=[]):
+        """
+        Attributes
+        ----------
+            rex : regular expressions used in preprocessing (step1)
+            path : the input path stores the input log file name
+            logName : the name of the input file containing raw log messages
+            savePath : the output path stores the file containing structured logs
+        """
+        self.path = indir
+        self.indir = indir
+        self.outdir = outdir
+        self.logName = None
+        self.savePath = outdir
+        self.df_log = None
+        self.log_format = log_format
+        self.rex = rex
+
+    def tokenize(self):
+        event_label = []
+        # print("\n============================Removing obvious dynamic variables======================\n\n")
+        for idx, log in self.df_log["Content"].iteritems():
+            tokens = log.split()
+            tokens = re.sub(r"\\", "", str(tokens))
+            tokens = re.sub(r"\'", "", str(tokens))
+            tokens = tokens.translate({ord(c): "" for c in "!@#$%^&*{}<>?\|`~"})
+
+            re_list = [
+                "([\da-fA-F]{2}:){5}[\da-fA-F]{2}",
+                "\d{4}-\d{2}-\d{2}",
+                "\d{4}\/\d{2}\/\d{2}",
+                "[0-9]{2}:[0-9]{2}:[0-9]{2}(?:[.,][0-9]{3})?",
+                "[0-9]{2}:[0-9]{2}:[0-9]{2}",
+                "[0-9]{2}:[0-9]{2}",
+                "0[xX][0-9a-fA-F]+",
+                "([\(]?[0-9a-fA-F]*:){8,}[\)]?",
+                "^(?:[0-9]{4}-[0-9]{2}-[0-9]{2})(?:[ ][0-9]{2}:[0-9]{2}:[0-9]{2})?(?:[.,][0-9]{3})?",
+                "(\/|)([a-zA-Z0-9-]+\.){2,}([a-zA-Z0-9-]+)?(:[a-zA-Z0-9-]+|)(:|)",
+            ]
+
+            pat = r"\b(?:{})\b".format("|".join(str(v) for v in re_list))
+            tokens = re.sub(pat, "<*>", str(tokens))
+            tokens = tokens.replace("=", " = ")
+            tokens = tokens.replace(")", " ) ")
+            tokens = tokens.replace("(", " ( ")
+            tokens = tokens.replace("]", " ] ")
+            tokens = tokens.replace("[", " [ ")
+            event_label.append(str(tokens).lstrip().replace(",", " "))
+
+        self.df_log["event_label"] = event_label
+
+        return 0
+
+    def getDynamicVars2(self, petit_group):
+        petit_group["event_label"] = petit_group["event_label"].map(
+            lambda x: " ".join(dict.fromkeys(x.split()))
+        )
+        petit_group["event_label"] = petit_group["event_label"].map(
+            lambda x: " ".join(
+                filter(None, (word.strip(punctuation) for word in x.split()))
+            )
+        )
+
+        lst = petit_group["event_label"].values.tolist()
+
+        vec = []
+        big_lst = " ".join(v for v in lst)
+        this_count = Counter(big_lst.split())
+
+        if this_count:
+            max_val = max(this_count, key=this_count.get)
+            for word in this_count:
+                if this_count[word] < this_count[max_val]:
+                    vec.append(word)
+
+        return vec
+
+    def remove_word_with_special(self, sentence):
+        sentence = sentence.translate(
+            {ord(c): "" for c in "!@#$%^&*()[]{};:,/<>?\|`~-=+"}
+        )
+        length = len(sentence.split())
+
+        finale = ""
+        for word in sentence.split():
+            if (
+                not any(ch.isdigit() for ch in word)
+                and not any(not c.isalnum() for c in word)
+                and len(word) > 1
+            ):
+                finale += word
+
+        finale = finale + str(length)
+        return finale
+
+    def outputResult(self):
+        self.df_log.to_csv(
+            os.path.join(self.savePath, self.logName + "_structured.csv"), index=False
+        )
+
+    def load_data(self):
+        headers, regex = self.generate_logformat_regex(self.log_format)
+
+        self.df_log = self.log_to_dataframe(
+            os.path.join(self.path, self.logname), regex, headers, self.log_format
+        )
+
+    def generate_logformat_regex(self, logformat):
+        """Function to generate regular expression to split log messages"""
+        headers = []
+        splitters = re.split(r"(<[^<>]+>)", logformat)
+        regex = ""
+        for k in range(len(splitters)):
+            if k % 2 == 0:
+                splitter = re.sub(" +", "\\\s+", splitters[k])
+                regex += splitter
+            else:
+                header = splitters[k].strip("<").strip(">")
+                regex += "(?P<%s>.*?)" % header
+                headers.append(header)
+        regex = re.compile("^" + regex + "$")
+        return headers, regex
+
+    def log_to_dataframe(self, log_file, regex, headers, logformat):
+        """Function to transform log file to dataframe"""
+        log_messages = []
+        linecount = 0
+        with open(log_file, "r") as fin:
+            for line in fin.readlines():
+                try:
+                    match = regex.search(line.strip())
+                    message = [match.group(header) for header in headers]
+                    log_messages.append(message)
+                    linecount += 1
+                except Exception as e:
+                    print("[Warning] Skip line: " + line)
+        logdf = pd.DataFrame(log_messages, columns=headers)
+        logdf.insert(0, "LineId", None)
+        logdf["LineId"] = [i + 1 for i in range(linecount)]
+        return logdf
+
+    def parse(self, logname):
+        start_timeBig = time.time()
+        print("Parsing file: " + os.path.join(self.path, logname))
+
+        self.logname = logname
+
+        regex = [r"blk_-?\d+", r"(\d+\.){3}\d+(:\d+)?"]
+
+        self.load_data()
+        self.df_log = self.df_log.sample(n=2000)
+        self.tokenize()
+        self.df_log["EventId"] = self.df_log["event_label"].map(
+            lambda x: self.remove_word_with_special(str(x))
+        )
+        groups = self.df_log.groupby("EventId")
+        keys = groups.groups.keys()
+        stock = pd.DataFrame()
+        count = 0
+
+        re_list2 = ["[ ]{1,}[-]*[0-9]+[ ]{1,}", ' "\d+" ']
+
+        generic_re = re.compile("|".join(re_list2))
+
+        for i in keys:
+            l = []
+            slc = groups.get_group(i)
+
+            template = slc["event_label"][0:1].to_list()[0]
+            count += 1
+            if slc.size > 1:
+                l = self.getDynamicVars2(slc.head(10))
+                pat = r"\b(?:{})\b".format("|".join(str(v) for v in l))
+                if len(l) > 0:
+                    template = template.lower()
+                    template = re.sub(pat, "<*>", template)
+
+            template = re.sub(generic_re, " <*> ", template)
+            slc["event_label"] = [template] * len(slc["event_label"].to_list())
+
+            stock = stock.append(slc)
+            stock = stock.sort_index()
+
+        self.df_log = stock
+
+        self.df_log["EventTemplate"] = self.df_log["event_label"]
+        if not os.path.exists(self.savePath):
+            os.makedirs(self.savePath)
+        self.df_log.to_csv(
+            os.path.join(self.savePath, logname + "_structured.csv"), index=False
+        )
+        elapsed_timeBig = time.time() - start_timeBig
+        print(f"Parsing done in {elapsed_timeBig} sec")
+        return 0
diff --git a/logparser/ULP/__init__.py b/logparser/ULP/__init__.py
@@ -0,0 +1 @@
+from .ULP import *