The loghub_2k datasets are sampled from loghub logs, containing 2,000 lines of log messages for each log. The message templates are extracted based on regular expressions and then manually validated and annotated. The loghub_2k datasets have been initially used for benchmarking log parsers by the work "Tools and Benchmarks for Automated Log Parsing" in ICSE 2019.
The loghub_2k_corrected datasets are developed by the work "Guidelines for Assessing the Accuracy of Log Message Template Identification Techniques" in ICSE 2022, which further refines and fixes some of the incorrected ground-truth event templates of the original loghub_2k datasets.
Loghub provides a large collection of system log datasets, which are freely accessible for AI-driven log analytics research. The raw logs can be accessed at https://github.com/logpai/loghub.
Loghub provides large-scale raw logs, but lacks annotated event templates in such scale. To evaluate log parsers in a more rigorous and practical setting, LogPub provides large-scale mannual annotations for raw logs in Loghub. The LogPub datasets can be accessed at https://github.com/logpai/LogPub.