AI-Powered Kubernetes Incident Analysis & Root Cause Analysis Tool
KubeRCA is an open-source tool that automatically collects incident context from Kubernetes environments and provides Root Cause Analysis (RCA) and response guides using LLM.
When alerts fire in your cluster, KubeRCA:
- Receives alerts via Alertmanager webhook
- Creates/updates incidents and sends Slack thread notifications
- Analyzes context with AI (Strands Agents: Gemini/OpenAI/Anthropic)
- Streams realtime updates to the dashboard via SSE
- Supports similar incident search, feedback, and in-app chat workflows
- Automated Context Collection - Gather logs, metrics, and K8s events when alerts fire
- AI-Powered Analysis - LLM-based root cause analysis with Strands Agents (Gemini/OpenAI/Anthropic)
- Similar Incident Search - Vector similarity search using pgvector
- Slack Integration - Real-time notifications with threaded analysis results
- Realtime Dashboard Sync - Server-Sent Events (
/api/v1/events) with polling fallback - Operator Feedback Loop - Vote/comment APIs for incidents and alerts
- In-App AI Chat - Context-aware chat via Backend
POST /api/v1/chatand AgentPOST /chat - Webhook Settings UI - CRUD management for outbound webhook integrations
- Google OIDC Login - One-click Google authentication with email allowlist
- Web Dashboard - React-based UI for incident management
- Helm Deployment - Easy installation via Helm charts
flowchart LR
%% External
AM[Alertmanager]
SL[Slack Bot]
LLM[LLM API Gemini OpenAI Anthropic]
PR[Prometheus]
K8S[Kubernetes API]
TP[Tempo]
LO[Loki]
GK[Grafana]
AL[Alloy]
OIDC[Google OIDC]
%% Internal
subgraph KubeRCA
FE[Frontend React TypeScript]
BE[Backend Go Gin]
AG[Agent Python FastAPI]
PG[(PostgreSQL pgvector)]
end
AM -->|Webhook| BE
BE -->|Thread notification| SL
FE -->|Auth Incident Alert API| BE
FE -->|SSE stream| BE
BE -->|Analyze and summarize| AG
BE -->|Chat request| AG
AG -->|K8s Context| K8S
AG -->|Metrics Query| PR
AG -->|LLM Analysis| LLM
AG -.->|Trace Query| TP
BE -->|Embeddings| LLM
BE -.->|OIDC Token Exchange| OIDC
FE -.->|OIDC Redirect| OIDC
BE <-->|Data| PG
AG -.->|Session optional| PG
AL -.->|Collector| PR
AL -.->|Collector| LO
AL -.->|Collector| TP
GK -.->|Dashboard| PR
GK -.->|Dashboard| LO
GK -.->|Dashboard| TP
| Step | Description |
|---|---|
| 1 | Alertmanager sends alerts to Backend via webhook |
| 2 | Backend creates/updates incidents, stores alerts, and posts Slack thread messages |
| 3 | Backend requests POST /analyze to Agent asynchronously |
| 4 | Agent collects K8s/Prometheus/Tempo context and calls LLM provider |
| 5 | Backend stores analysis history (alerts, alert_analyses, artifacts) |
| 6 | Backend emits SSE events and Frontend refreshes data in realtime |
| 7 | Incident resolve triggers Agent POST /summarize-incident + embedding storage |
| 8 | Frontend searches similar incidents, sends feedback, and uses in-app AI chat |
| Component | Technology |
|---|---|
| Backend | Go 1.24 + Gin |
| Agent | Python 3.10+ + FastAPI + Strands Agents |
| Frontend | React 18 + TypeScript + Vite + Tailwind CSS |
| Database | PostgreSQL + pgvector |
| Category | Technology |
|---|---|
| Deployment | Helm, ArgoCD |
| IaC | Terraform |
| Monitoring | Prometheus, Alertmanager, Grafana |
| Logging | Loki, Grafana Alloy |
| AI/LLM | Strands Agents (Gemini/OpenAI/Anthropic) |
| Category | Technology |
|---|---|
| Chaos Engineering | Chaos Mesh |
| Load Testing | k6 |
- Kubernetes cluster (1.25+)
- Helm 3.x
- AI provider API key (Gemini / OpenAI / Anthropic)
- PostgreSQL with pgvector extension (bundled subchart or external)
- Slack bot token + channel ID (optional)
# Optional: login to Public ECR (if your environment requires it)
aws ecr-public get-login-password --region us-east-1 | \
helm registry login --username AWS --password-stdin public.ecr.aws
# Install/upgrade (chart version from charts/kube-rca/Chart.yaml)
helm upgrade --install kube-rca oci://public.ecr.aws/r5b7j2e4/kube-rca-ecr/charts/kube-rca \
--namespace kube-rca --create-namespace \
--version <chart-version> \
-f values.yamlgit clone https://github.com/your-org/kube-rca.git
cd kube-rca/helm-charts/main
helm upgrade --install kube-rca charts/kube-rca \
--namespace kube-rca --create-namespace \
-f values.yamlbackend:
embedding:
provider: "gemini"
apiKey:
existingSecret: "kube-rca-ai"
key: "ai-studio-api-key"
postgresql:
secret:
existingSecret: "postgresql"
key: "password"
slack:
enabled: true
secret:
existingSecret: "kube-rca-slack"
agent:
aiProvider: "gemini"
gemini:
secret:
existingSecret: "kube-rca-ai"
key: "ai-studio-api-key"
prometheus:
url: "http://prometheus-server.monitoring:9090"
frontend:
ingress:
enabled: true
hosts:
- kube-rca.example.comFor OpenAI/Anthropic, set
agent.aiProvidertoopenaioranthropicand pointagent.openai.secret/agent.anthropic.secretto the corresponding secret key (openai-api-key/anthropic-api-key).
Add the KubeRCA webhook receiver to your Alertmanager configuration:
receivers:
- name: "kube-rca"
webhook_configs:
- url: "http://<release>-backend.<namespace>.svc.cluster.local:8080/webhook/alertmanager"
send_resolved: true
route:
receiver: "kube-rca"
# or add as a child routeExample (release: kube-rca, namespace: kube-rca):
http://kube-rca-backend.kube-rca.svc.cluster.local:8080/webhook/alertmanager
| Secret | Keys | Notes |
|---|---|---|
postgresql |
postgres-password, password |
PostgreSQL (Bitnami subchart) |
kube-rca-ai |
ai-studio-api-key / openai-api-key / anthropic-api-key |
Keys depend on agent.aiProvider / backend.embedding.provider |
kube-rca-slack |
kube-rca-slack-token, kube-rca-slack-channel-id |
Required if Slack enabled |
kube-rca-auth |
admin-username, admin-password, kube-rca-jwt-secret, oidc-client-id, oidc-client-secret |
Auth + OIDC (via ExternalSecret or manual) |
For full configuration options, see the Helm chart values at helm-charts/main/charts/kube-rca/README.md.
cd backend/main
go mod tidy
go run .
# or
go test ./...cd agent/main
make install # uv sync
make lint # ruff check
make test # pytest
make run # uvicorn dev servercd frontend/main
npm ci
npm run dev # development server
npm run build # production build
npm run lint # eslintContributions are welcome! Please read our contributing guidelines before submitting PRs.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Made with dedication for the Kubernetes community
