TADBench

🧭 Overview

TADBench is a comprehensive benchmark suite designed specifically for evaluating trace anomaly detection algorithms. It encompasses five trace datasets and integrates seven state-of-the-art anomaly detection methods. All datasets are labeled, and the algorithms are implemented under a unified SDK, ensuring consistency and ease of use. By standardizing both data formats and code implementations, TADBench enables researchers to perform rigorous and reproducible experiments, thereby gaining insights into each method's strengths and limitations. This practical guidance facilitates more informed algorithm selection for real-world deployments, effectively bridging the gap between academic research and industrial application.

🚀 Usage

🗂️ Dataset Standardization

To enable efficient algorithm evaluation and cross-dataset comparisons, we implement a unified data format for all trace datasets. New datasets can be easily integrated while maintaining compatibility with existing detection algorithms.

@dataclass
class Span:
    """Represents a single operation in distributed tracing"""
    trace_id: str                   
    span_id: str                    
    parent_span_id: Optional[str]   
    children_span_list: List['Span'] 
    
    start_time: datetime            
    duration: float                 # Execution time in milliseconds
    service_name: str               
    anomaly: int = 0                # 0=normal, 1=abnormal
    status_code: Optional[str] = None 
    operation_name: Optional[str] = None 
    root_cause: Optional[bool] = None    # True if root cause span
    latency: Optional[int] = None        # 1=latency anomaly
    structure: Optional[int] = None      # 1=structural anomaly
    extra: Optional[dict] = None         

@dataclass 
class Trace:
    """Complete trace representation"""
    trace_id: str                   
    root_span: Span                 
    span_count: int = 0
    anomaly_type: Optional[int] = None   # 0=normal, 1=latency, 2=structure, 3=both
    source: Optional[str] = None    # Origin info

🧰 Algorithm SDK

This benchmark uses a standardized template (TADTemplate) to ensure consistent implementation and evaluation of trace anomaly detection algorithms. All algorithms inherit from this abstract base class, standardizing the format of inputs and outputs across different algorithms.

class TADTemplate(ABC):
    def __init__(self, dataset_name=None, data_path=None):
        self.dataset_name = dataset_name
        self.data_path = data_path

    @abstractmethod
    def preprocess_data(self):
        """Preprocess raw trace data (feature engineering, normalization, etc.)"""
        pass

    @abstractmethod
    def train(self):
        """Train model and save trained model"""
        pass

    @abstractmethod
    def test(self):
        """Evaluate model by different datasets"""
        pass

🧪 Implementing New Algorithms

Researchers can add new detection algorithms by:

Inheriting from TADTemplate
Implementing all abstract methods
Maintaining consistent input/output formats

⚙️ Installation

def main(mode, dataset_name, data_path, anomaly_type):
   arg = TraceAnomaly(dataset_name=dataset_name, data_path=data_path)
   if mode == 'preprocess':
       arg.preprocess_data()
   elif mode == 'train':
       arg.train()
   elif mode == 'test':
       arg.test(anomaly_type=anomaly_type)
   else:
       print("Please choose specific mode from 'preprocess', 'train' and 'test'")

Each algorithm implementation includes a main.py entry point with a unified interface. Depending on your task (preprocessing, training, or testing), you can execute different modes via command-line arguments.

If you want to preprocess data before training or testing, use the following command:

python main.py --mode preprocess --dataset_name <DATASET_NAME> --data_path <DATA_PATH>

<DATASET_NAME>: Name of the dataset
<DATA_PATH>: Path to the raw dataset

To train an algorithm on a specific dataset, run:

python main.py --mode train --dataset_name <DATASET_NAME>

This will load the dataset, train the model, and save the checkpoints accordingly.

To evaluate the trained model on the dataset, execute:

python main.py --mode test --dataset_name <DATASET_NAME> --anomaly_type <ANOMALY_TYPE>

This will load the trained model and output evaluation results such as precision, recall, F1-score and accuracy.

💾 Datasets

All datasets are stored in ./Datasets. The following table summarizes different characteristics of the datasets.

Dataset	Average Trace Depth	Average Span Numbers per Trace	Granularity	Service Scale	Average Service Number per Trace	Operation Scale	Average Operation Number per Trace
TrainTicket	3.3	39.0	Operation	29	5.9	64	7.6
GAIA	4.7	9.3	Service	10	6.4	-	-1
AIOps2020	5.5	23.1	Service	10	7.8	-	-
AIOps2022	4.2	21.7	Operation	40	6.9	29	11.6
AIOps2023	3.8	14.5	Service	37	1.3	-	-

🧠 Algorithms

We use 7 trace anomaly detection algorithms for evaluation, and the following figure categorizes these algorithms.

Algorithm	Publication	Paper Title
Multimodal LSTM	CLOUD'19	S. Nedelkoski, J. Cardoso, and O. Kao, “Anomaly detection from system tracing data using multimodal deep learning,” in 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), 2019, pp. 179–186.
TraceAnomaly	ISSRE'20	P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei, “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” in 2020 IEEE 31st International Symposium on Software Reliability Engineerin (ISSRE), 2020, pp. 48–58.
CRISP	USENIX ATC'22	Z. Zhang, M. K. Ramanathan, P. Raj, A. Parwal, T. Sherwood, and M. Chabbi, “CRISP: Critical path analysis of Large-Scale microservice architectures,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22) Carlsbad, CA: USENIX Association, Jul. 2022, pp. 655–672.
TraceCRL	ESEC/FSE'22	C. Zhang, X. Peng, T. Zhou, C. Sha, Z. Yan, Y. Chen, and H. Yang, “Tracecrl: contrastive representation learning for microservice trace analysis,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2022. New York, NY, USA: Association for Computing Machinery, 2022, p. 1221–1232.
PUTraceAD	ISSRE'22	K. Zhang, C. Zhang, X. Peng, and C. Sha, “Putracead: Trace anomaly detection with partial labels based on gnn and pu learning,” in 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE), 2022, pp. 239–250.
TraceVAE	WWW'23	Z. Xie, H. Xu, W. Chen, W. Li, H. Jiang, L. Su, H. Wang, and D. Pei, “Unsupervised anomaly detection on microservice traces through graph vae,” in Proceedings of the ACM Web Conference 2023, ser. WWW ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 2874–2884.
GTrace	ESEC/FSE'23	Z. Xie, C. Pei, W. Li, H. Jiang, L. Su, J. Li, G. Xie, and D. Pei, “From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 1739–1749.

📈 Results

We record the F1-score and Accuracy for each algorithm on each dataset in the following table.

Algorithm	TrainTicket F1 (%)	TrainTicket ACC (%)	GAIA F1 (%)	GAIA ACC (%)	AIOps2020 F1 (%)	AIOps2020 ACC (%)	AIOps2022 F1 (%)	AIOps2022 ACC (%)	AIOps2023 F1 (%)	AIOps2023 ACC (%)
Multimodal LSTM	68.5	73.7	89.8	70.7	54.5	41.6	59.2	51.4	64.4	80.5
TraceAnomaly	62.6	82.1	46.3	65.3	56.8	77.9	58.5	78.7	55.4	82.9
CRISP	62.5	82.0	44.5	64.1	57.0	78.0	58.1	78.5	66.4	87.2
PUTraceAD	95.4	98.9	68.6	84.8	48.3	78.6	68.1	88.1	74.7	95.5
TraceCRL	55.6	50.9	75.0	73.9	53.0	40.8	53.2	36.2	44.2	57.4
TraceVAE	97.3	98.2	90.9	91.1	57.1	72.7	78.9	86.7	74.0	90.0
GTrace	99.4	80.3	70.9	64.5	71.8	81.9	76.5	93.8	70.3	87.6

We also compare the performance of different algorithms on datasets with varying data characteristics. The following figure summarize the results:

🏆 Algorithm Leaderboard

We implement an evaluation leaderboard to systematically compare detection performance across algorithms and datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TADBench

🧭 Overview

🚀 Usage

🗂️ Dataset Standardization

🧰 Algorithm SDK

🧪 Implementing New Algorithms

⚙️ Installation

💾 Datasets

🧠 Algorithms

📈 Results

🏆 Algorithm Leaderboard

📚 References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
CRISP		CRISP
Datasets		Datasets
GTrace		GTrace
Multimodel-LSTM		Multimodel-LSTM
PUTraceAD		PUTraceAD
TraceAnomaly		TraceAnomaly
TraceCRL		TraceCRL
TraceVAE		TraceVAE
leaderboard		leaderboard
picture		picture
sdk		sdk
README.md		README.md
data_format.py		data_format.py

nkalgo/TADBench

Folders and files

Latest commit

History

Repository files navigation

TADBench

🧭 Overview

🚀 Usage

🗂️ Dataset Standardization

🧰 Algorithm SDK

🧪 Implementing New Algorithms

⚙️ Installation

💾 Datasets

🧠 Algorithms

📈 Results

🏆 Algorithm Leaderboard

📚 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages