Triangle is an end-to-end incident triage system using multiple LLM agents to route incidents to appropriate teams. It addresses challenges in cloud service incident management through semantic distillation and multi-role agent negotiation. Experiments show Triangle improves triage accuracy by over 20% while reducing response time. The system has been successfully deployed at a leading technology company serving millions of users.
This project implements an intelligent incident triage system using multiple Large Language Model (LLM) agents to analyze and route incidents to the most suitable teams. The system combines TF-IDF similarity matching with semantic analysis to achieve accurate incident assignment. By leveraging both rule-based and machine-learning components, Triangle ensures adaptability, scalability, and continuous improvement over time.
- Efficiency: Reduce incident response time and streamline communication across different teams.
- Accuracy: Increase correct team assignment by understanding the semantic context of incoming incidents.
- Scalability: Seamlessly integrate new capabilities and scale to handle an increasing number of incidents.
- Extensibility: Allow new routing policies, data sources, and integration points to be added with minimal overhead.
-
Multi-Agent Architecture
-
Triage Decider: Makes final routing decisions.
-
Team Manager: Handles team information and negotiations.
-
Analyzer: Performs semantic analysis and TF-IDF matching.
-
Intelligent Matching
-
TF-IDF based similarity scoring.
-
Semantic analysis of incident descriptions.
-
Multi-hop routing capability for complex incident redirections.
-
Team function phrase matching for increased accuracy.
-
Performance Tracking
-
Real-time accuracy monitoring through dashboards.
-
Detailed logging of decisions for post-mortem analysis.
-
Result analysis and visualization for iterative improvements.
-
Confidence Estimation
-
Confidence scores for each triage decision.
-
Threshold-based auto-assignment or manual review process.
The system consists of three main components, each performing specialized tasks to ensure consistency and efficiency:
-
Triage Decider (TriageDecider)
- Gathers incident data from the Analyzer and Team Manager.
- Matches incidents with relevant teams based on confidence scores.
- Provides traceable reasoning for each routing decision.
-
Team Manager (TeamManager)
- Maintains team capability profiles and summary key phrases.
- Negotiates and escalates incidents when multiple teams are possible matches.
- Ensures that team availability and load constraints are respected.
-
Analyzer (Analyser)
- Performs TF-IDF analysis to derive initial similarity ranks.
- Conducts semantic distillation of incident data.
- Merges findings to produce final similarity and confidence metrics.
Triangle continuously monitors performance indicators to evaluate its effectiveness:
| Metric | Description |
|---|---|
| Accuracy | Percentage of correct team assignments |
| Response Time | Average triage completion time |
| Escalation Rate | Frequency of manual interventions or reassignments |
| Confidence Score Mean | Average confidence for automated triage decisions |
By measuring these metrics over time, Triangle helps identify improvements and ensure consistent, data-driven enhancements to incident triage workflows.
- Python 3.8+
- Azure OpenAI API access
- Required Python packages (see requirements.txt)
- A stable internet connection for reliable LLM access.
- Sufficient resource allocation (CPU/Memory) for larger incident volumes.
-
Clone the repository
git clone <url> cd triangle
-
Setting Up the Virtual Environment
Follow these steps to create and activate a virtual environment, then install the required packages from
requirements.txt.
- Python Installation: Ensure Python is installed on your system. You can download it from the official Python website.
- Pip Verification: Verify
pipis installed by runningpip --versionin your terminal or command prompt.
-
Create a Virtual Environment
Open your terminal (Linux/Mac) or command prompt (Windows) and navigate to your project directory:
cd /path/to/your/projectCreate a virtual environment named
venv:python -m venv venv
-
Activate the Virtual Environment
- Windows:
.\venv\Scripts\activate
- Linux/Mac:
source venv/bin/activate
After activation, your terminal prompt will change to indicate that the virtual environment is active.
- Windows:
-
Install Required Packages
With the virtual environment activated, install the dependencies listed in
requirements.txt:pip install -r requirements.txt
This command will read the
requirements.txtfile and install all necessary packages. -
Configure Azure OpenAI Credentials
Create a
config.jsonfile with your Azure OpenAI credentials:{ "ENDPOINT_URL": "your_azure_endpoint", "DEPLOYMENT_NAME": "your_deployment_name", "API_VERSION": "your_api_version", "API_KEY": "your_api_key" }
[
{
"name": "team_name",
"summary_key_phrases": ["key_phrase1", "key_phrase2", ...]
}
][
{
"case": "incident_id",
"message": "incident_description",
"last_person": "assigned_team"
}
]- Data Versioning: Use Git LFS or specialized tools to manage large datasets and historical changes.
- Privacy & Security: Ensure that only sanitized or anonymized data is shared where needed, and follow your organization's data handling policies.
Run the main triage system:
python triangle.pyYou can customize parameters in triangle.py to adjust agent behaviors, logging levels, or threshold settings for confidence scores.
When the triage process completes, the system generates detailed results in the results directory, including:
- Assignment decisions for each incident
- Confidence scores
- Performance metrics
- Routing paths
Review these logs continuously to identify recurring issues and potential improvements in your triage logic.
We welcome contributions from the community to make Triangle even better:
- Fork the Repo and create your branch from
main. - Implement Features or bug fixes in alignment with the projectβs guidelines.
- Open a Pull Request, detailing your changes, improvements, and testing for easy review.
| Question | Answer |
|---|---|
| How do I add a new team? | Add a new JSON object in person.json and include relevant key phrases that describe the teamβs domain. |
| How do I retrain or refine the model? | Update your training scripts using new incident data, then adjust the Analyzer module accordingly. |
| Is on-prem deployment supported? | Yes, you can run Triangle self-hosted, but you need a stable internal environment for the LLM API. |
- Language Coverage: While the system supports English data, non-English data may require additional adjustments in the Analyzer.
- Contextual Knowledge: Domain-specific knowledge bases can help enrich the semantic matching but are currently not fully integrated.
- LLM Dependence: Triage decisions depend on the accuracy, availability, and cost of LLM services.
- Future Enhancements: Plans include adding advanced multi-modal interfaces (voice, images) and incorporating user feedback loops for continuous learning.