Triangle: Empowering Incident Triage with Multi-Agents

Triangle is an end-to-end incident triage system using multiple LLM agents to route incidents to appropriate teams. It addresses challenges in cloud service incident management through semantic distillation and multi-role agent negotiation. Experiments show Triangle improves triage accuracy by over 20% while reducing response time. The system has been successfully deployed at a leading technology company serving millions of users.

🌟 Overview

This project implements an intelligent incident triage system using multiple Large Language Model (LLM) agents to analyze and route incidents to the most suitable teams. The system combines TF-IDF similarity matching with semantic analysis to achieve accurate incident assignment. By leveraging both rule-based and machine-learning components, Triangle ensures adaptability, scalability, and continuous improvement over time.

Key Goals

Efficiency: Reduce incident response time and streamline communication across different teams.
Accuracy: Increase correct team assignment by understanding the semantic context of incoming incidents.
Scalability: Seamlessly integrate new capabilities and scale to handle an increasing number of incidents.
Extensibility: Allow new routing policies, data sources, and integration points to be added with minimal overhead.

🚀 Features

Multi-Agent Architecture
Triage Decider: Makes final routing decisions.
Team Manager: Handles team information and negotiations.
Analyzer: Performs semantic analysis and TF-IDF matching.
Intelligent Matching
TF-IDF based similarity scoring.
Semantic analysis of incident descriptions.
Multi-hop routing capability for complex incident redirections.
Team function phrase matching for increased accuracy.
Performance Tracking
Real-time accuracy monitoring through dashboards.
Detailed logging of decisions for post-mortem analysis.
Result analysis and visualization for iterative improvements.
Confidence Estimation
Confidence scores for each triage decision.
Threshold-based auto-assignment or manual review process.

⚙️ Architecture

The system consists of three main components, each performing specialized tasks to ensure consistency and efficiency:

Triage Decider (TriageDecider)
- Gathers incident data from the Analyzer and Team Manager.
- Matches incidents with relevant teams based on confidence scores.
- Provides traceable reasoning for each routing decision.
Team Manager (TeamManager)
- Maintains team capability profiles and summary key phrases.
- Negotiates and escalates incidents when multiple teams are possible matches.
- Ensures that team availability and load constraints are respected.
Analyzer (Analyser)
- Performs TF-IDF analysis to derive initial similarity ranks.
- Conducts semantic distillation of incident data.
- Merges findings to produce final similarity and confidence metrics.

⏱ Performance Tracking and Metrics

Triangle continuously monitors performance indicators to evaluate its effectiveness:

Metric	Description
Accuracy	Percentage of correct team assignments
Response Time	Average triage completion time
Escalation Rate	Frequency of manual interventions or reassignments
Confidence Score Mean	Average confidence for automated triage decisions

By measuring these metrics over time, Triangle helps identify improvements and ensure consistent, data-driven enhancements to incident triage workflows.

📋 Requirements

Python 3.8+
Azure OpenAI API access
Required Python packages (see requirements.txt)

Recommended Environment

A stable internet connection for reliable LLM access.
Sufficient resource allocation (CPU/Memory) for larger incident volumes.

🛠️ Installation & Setup

Clone the repository
```
git clone <url>
cd triangle
```
Setting Up the Virtual Environment

Follow these steps to create and activate a virtual environment, then install the required packages from requirements.txt.

Prerequisites

Python Installation: Ensure Python is installed on your system. You can download it from the official Python website.
Pip Verification: Verify pip is installed by running pip --version in your terminal or command prompt.

Steps

Create a Virtual Environment

Open your terminal (Linux/Mac) or command prompt (Windows) and navigate to your project directory:
```
cd /path/to/your/project
```
Create a virtual environment named venv:
```
python -m venv venv
```
Activate the Virtual Environment
- Windows:
```
.\venv\Scripts\activate
```
- Linux/Mac:
```
source venv/bin/activate
```
After activation, your terminal prompt will change to indicate that the virtual environment is active.
Install Required Packages

With the virtual environment activated, install the dependencies listed in requirements.txt:
```
pip install -r requirements.txt
```
This command will read the requirements.txt file and install all necessary packages.

Configure Azure OpenAI Credentials

Create a config.json file with your Azure OpenAI credentials:

{
    "ENDPOINT_URL": "your_azure_endpoint",
    "DEPLOYMENT_NAME": "your_deployment_name",
    "API_VERSION": "your_api_version",
    "API_KEY": "your_api_key"
}

📊 Data Format

Team Data (person.json)

[
    {
        "name": "team_name",
        "summary_key_phrases": ["key_phrase1", "key_phrase2", ...]
    }
]

Incident Data (dataset.json)

[
    {
        "case": "incident_id",
        "message": "incident_description",
        "last_person": "assigned_team"
    }
]

Advanced Topics for Data Management

Data Versioning: Use Git LFS or specialized tools to manage large datasets and historical changes.
Privacy & Security: Ensure that only sanitized or anonymized data is shared where needed, and follow your organization's data handling policies.

🎯 Usage

Run the main triage system:

python triangle.py

You can customize parameters in triangle.py to adjust agent behaviors, logging levels, or threshold settings for confidence scores.

📈 Results

When the triage process completes, the system generates detailed results in the results directory, including:

Assignment decisions for each incident
Confidence scores
Performance metrics
Routing paths

Review these logs continuously to identify recurring issues and potential improvements in your triage logic.

📖 Contributing

We welcome contributions from the community to make Triangle even better:

Fork the Repo and create your branch from main.
Implement Features or bug fixes in alignment with the project’s guidelines.
Open a Pull Request, detailing your changes, improvements, and testing for easy review.

❓ FAQ

Question	Answer
How do I add a new team?	Add a new JSON object in person.json and include relevant key phrases that describe the team’s domain.
How do I retrain or refine the model?	Update your training scripts using new incident data, then adjust the Analyzer module accordingly.
Is on-prem deployment supported?	Yes, you can run Triangle self-hosted, but you need a stable internal environment for the LLM API.

🔎 Limitations & Future Work

Language Coverage: While the system supports English data, non-English data may require additional adjustments in the Analyzer.
Contextual Knowledge: Domain-specific knowledge bases can help enrich the semantic matching but are currently not fully integrated.
LLM Dependence: Triage decisions depend on the accuracy, availability, and cost of LLM services.
Future Enhancements: Plans include adding advanced multi-modal interfaces (voice, images) and incorporating user feedback loops for continuous learning.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
config.json		config.json
config_azure.json		config_azure.json
evaluation.py		evaluation.py
requirements.txt		requirements.txt
role.py		role.py
triangle.py		triangle.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Triangle: Empowering Incident Triage with Multi-Agents

🌟 Overview

Key Goals

🚀 Features

⚙️ Architecture

⏱ Performance Tracking and Metrics

📋 Requirements

Recommended Environment

🛠️ Installation & Setup

Prerequisites

Steps

📊 Data Format

Team Data (person.json)

Incident Data (dataset.json)

Advanced Topics for Data Management

🎯 Usage

📈 Results

📖 Contributing

❓ FAQ

🔎 Limitations & Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

microsoft/Triangle

Folders and files

Latest commit

History

Repository files navigation

Triangle: Empowering Incident Triage with Multi-Agents

🌟 Overview

Key Goals

🚀 Features

⚙️ Architecture

⏱ Performance Tracking and Metrics

📋 Requirements

Recommended Environment

🛠️ Installation & Setup

Prerequisites

Steps

📊 Data Format

Team Data (person.json)

Incident Data (dataset.json)

Advanced Topics for Data Management

🎯 Usage

📈 Results

📖 Contributing

❓ FAQ

🔎 Limitations & Future Work

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages