AI-Powered Analysis Service for Kubernetes Incidents
The KubeRCA Agent is a Python-based analysis service that performs Root Cause Analysis (RCA) on Kubernetes incidents. It receives alert payloads from the Backend, collects relevant context from the Kubernetes cluster and Prometheus, and uses LLM via Strands Agents (Gemini/OpenAI/Anthropic) to generate comprehensive analysis reports.
- AI-Powered RCA - Uses Strands Agents with Gemini/OpenAI/Anthropic for intelligent analysis
- Kubernetes Context - Collects pod logs, events, and resource status
- Generic Manifest Read Tools - Reads namespaced core/CRD manifests via
apiVersion+resource - Prometheus Integration - Queries relevant metrics for analysis
- Session Persistence - PostgreSQL-backed session history when
SESSION_DB_*is configured - Fallback Mode - Returns basic summary when the provider API key is unavailable
flowchart LR
BE[Backend] -->|POST /analyze| AG[Agent]
AG -->|Logs, Events| K8S[Kubernetes API]
AG -->|PromQL Query| PR[Prometheus]
AG -->|LLM Analysis| LLM[LLM Provider API]
AG -.->|Session Storage| PG[(PostgreSQL)]
AG -->|Analysis Result| BE
- Receive alert payload from Backend
- Collect Kubernetes context (logs, events, pod status)
- Query Prometheus for relevant metrics
- Build analysis prompt with collected context
- Send to Strands Agents (Gemini/OpenAI/Anthropic) for RCA
- Return structured analysis result
| Category | Technology |
|---|---|
| Language | Python 3.10+ |
| Framework | FastAPI |
| AI/LLM | Strands Agents (Gemini/OpenAI/Anthropic) |
| Package Manager | uv |
| Linting | ruff |
| Testing | pytest |
| Container | Docker |
| CI/CD | GitHub Actions |
- Python 3.10+
- uv (Python package manager)
- (Optional) Kubernetes cluster access
- (Optional) AI provider API key
# Run in repository root
# (monorepo layout: cd agent/main)
make install
# or manually:
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"make run
# or manually:
uvicorn app.main:app --host 0.0.0.0 --port 8000The server starts at http://localhost:8000.
make test
# or:
pytestmake lint # Check code style
make format # Auto-format code| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Service info |
| GET | /ping |
Health check |
| GET | /healthz |
Kubernetes health probe |
| POST | /analyze |
Analyze single alert |
| POST | /summarize-incident |
Summarize resolved incident |
| GET | /openapi.json |
OpenAPI specification |
Analyzes a single alert with Kubernetes/Prometheus context.
Request:
{
"alert": {
"status": "firing",
"labels": {
"alertname": "HighMemoryUsage",
"severity": "critical",
"namespace": "default",
"pod": "example-pod"
},
"annotations": {
"summary": "High memory usage detected",
"description": "Pod memory usage > 90%"
},
"startsAt": "2024-01-01T00:00:00Z",
"fingerprint": "abc123"
},
"thread_ts": "1234567890.123456"
}Response:
{
"status": "ok",
"thread_ts": "1234567890.123456",
"analysis": "## Root Cause Analysis\n...",
"analysis_summary": "Brief summary of the issue",
"analysis_detail": "Detailed RCA markdown content...",
"analysis_quality": "medium",
"missing_data": ["alert.labels.pod"],
"warnings": ["namespace/pod_name missing from alert labels"],
"capabilities": {
"k8s_core": "ok",
"manifest_read": "ok",
"prometheus": "unavailable",
"tempo": "unavailable",
"mesh": "unknown",
"traffic_policy": "unknown"
},
"context": {
"namespace": "default",
"pod_name": "example-pod",
"analysis_quality": "medium"
},
"artifacts": []
}Summarizes a resolved incident with all associated alerts.
The analysis engine can inspect namespaced Kubernetes manifests (core and CRD) with:
get_manifest(namespace, api_version, resource, name)list_manifests(namespace, api_version, resource, label_selector=None, limit=20)
Examples:
get_manifest("bookinfo", "v1", "services", "reviews")
get_manifest("bookinfo", "networking.istio.io/v1", "virtualservices", "reviews-route")
list_manifests("bookinfo", "v1", "configmaps", "app=reviews", 10)
Notes:
api_versionsupports both core (v1) and grouped (group/version) formats.resourcemust be a plural resource name (for example:pods,services,virtualservices).- For security and readability, secret values are masked and
status/metadata.managedFieldsare omitted inget_manifestresponses.
| Variable | Description | Default |
|---|---|---|
AI_PROVIDER |
LLM provider (gemini, openai, anthropic) |
gemini |
GEMINI_API_KEY |
Gemini API key for Strands Agents | - |
OPENAI_API_KEY |
OpenAI API key for Strands Agents | - |
ANTHROPIC_API_KEY |
Anthropic API key for Strands Agents | - |
GEMINI_MODEL_ID |
Gemini model ID | gemini-3-flash-preview |
OPENAI_MODEL_ID |
OpenAI model ID | gpt-4o |
ANTHROPIC_MODEL_ID |
Anthropic model ID | claude-sonnet-4-20250514 |
PROMETHEUS_URL |
Prometheus base URL | - (disabled) |
LOG_LEVEL |
Logging level | info |
WEB_CONCURRENCY |
Uvicorn worker count | 1 |
| Variable | Description | Default |
|---|---|---|
K8S_API_TIMEOUT_SECONDS |
K8s API timeout | 5 |
K8S_EVENT_LIMIT |
Max events to fetch | 25 |
K8S_LOG_TAIL_LINES |
Log lines to fetch | 25 |
| Variable | Description | Default |
|---|---|---|
PROMETHEUS_URL |
Prometheus base URL | - |
PROMETHEUS_HTTP_TIMEOUT_SECONDS |
HTTP timeout | 5 |
| Variable | Description | Default |
|---|---|---|
TEMPO_URL |
Tempo base URL (e.g. http://tempo.monitoring.svc:3100) |
- |
TEMPO_HTTP_TIMEOUT_SECONDS |
Tempo HTTP timeout | 10 |
TEMPO_TENANT_ID |
Tempo tenant header value (X-Scope-OrgID) |
- |
TEMPO_TRACE_LIMIT |
Max traces fetched per alert | 5 |
TEMPO_LOOKBACK_MINUTES |
Minutes before startsAt for trace search window |
15 |
TEMPO_FORWARD_MINUTES |
Minutes after startsAt for trace search window |
5 |
| Variable | Description | Default |
|---|---|---|
PROMPT_TOKEN_BUDGET |
Approximate token budget | 32000 |
PROMPT_MAX_LOG_LINES |
Max log lines in prompt | 25 |
PROMPT_MAX_EVENTS |
Max events in prompt | 25 |
PROMPT_SUMMARY_MAX_ITEMS |
Max session summaries | 3 |
MASKING_REGEX_LIST_JSON |
JSON array of regex patterns for masking before LLM/DB response flows | [] |
| Variable | Description | Default |
|---|---|---|
LLM_RETRY_MAX_ATTEMPTS |
Max retry attempts for transient LLM API errors (5xx, 429) | 5 |
LLM_RETRY_MIN_WAIT |
Minimum backoff wait time in seconds | 1.0 |
LLM_RETRY_MAX_WAIT |
Maximum backoff wait time in seconds | 60.0 |
| Variable | Description |
|---|---|
SESSION_DB_HOST |
PostgreSQL host |
SESSION_DB_PORT |
PostgreSQL port |
SESSION_DB_NAME |
Database name |
SESSION_DB_USER |
Database user |
SESSION_DB_PASSWORD |
Database password |
If
GEMINI_API_KEY/OPENAI_API_KEY/ANTHROPIC_API_KEYis set, configureSESSION_DB_*together.
agent/
├── app/
│ ├── main.py # FastAPI entrypoint
│ ├── api/
│ │ ├── analysis.py # POST /analyze, POST /summarize-incident
│ │ └── health.py # GET /, /ping, /healthz
│ ├── clients/
│ │ ├── k8s.py
│ │ ├── prometheus.py
│ │ ├── tempo.py
│ │ ├── session_repository.py
│ │ ├── summary_store.py
│ │ ├── strands_agent.py
│ │ ├── strands_patch.py
│ │ └── llm_providers/
│ ├── core/
│ │ ├── config.py
│ │ ├── dependencies.py
│ │ └── logging.py
│ ├── models/
│ ├── schemas/
│ │ ├── alert.py
│ │ └── analysis.py
│ └── services/analysis.py
├── docs/openapi.json
├── scripts/export_openapi.py
├── tests/
├── Dockerfile
├── Makefile
└── pyproject.toml
| Command | Description |
|---|---|
make install |
Install dependencies |
make run |
Run development server |
make lint |
Run ruff linter |
make format |
Format code with ruff |
make test |
Run pytest |
make build IMAGE=<tag> |
Build Docker image |
make curl-analyze |
Test analyze endpoint |
make curl-analyze-local |
Test with local server |
When API changes, regenerate the OpenAPI spec:
uv run python scripts/export_openapi.pyThe spec is saved to docs/openapi.json.
Auto-regenerate OpenAPI on commit:
git config core.hooksPath .githooksrelease-please parses conventional commits; merge commits that include the PR title can be double-counted in the changelog.
- Prefer
Squash and mergeorRebase and merge. - If
Create a merge commitis used, keep the PR title non-conventional (e.g., "Merge PR #123"). - Use Conventional Commits for change commits that should appear in the changelog.
make test
# or:
pytest tests/Requires a Kubernetes cluster and provider API key:
AI_PROVIDER=gemini GEMINI_API_KEY=xxx KUBECONFIG=~/.kube/config make test-analysis-localcurl -X POST http://localhost:8000/analyze \
-H 'Content-Type: application/json' \
-d '{
"alert": {
"status": "firing",
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"namespace": "default",
"pod": "example-pod"
},
"annotations": {
"summary": "Test summary",
"description": "Test description"
},
"startsAt": "2024-01-01T00:00:00Z",
"fingerprint": "test-fingerprint"
},
"thread_ts": "test-thread"
}'docker build -t kube-rca-agent .
# or:
make build IMAGE=kube-rca-agentdocker run -d -p 8000:8000 \
-e GEMINI_API_KEY=your-api-key \
kube-rca-agentTest the agent with a real OOMKilled scenario in Kubernetes.
kubectlwith cluster accesskube-rcanamespace exists
# Create OOM pod only
make test-oom-only
# Full test with analysis
GEMINI_API_KEY=xxx make test-analysis-local
# Cleanup
make cleanup-oom| Variable | Description | Default |
|---|---|---|
KUBE_CONTEXT |
Kubernetes context | current |
LOCAL_OOM_NAMESPACE |
Test namespace | kube-rca |
LOCAL_OOM_DEPLOYMENT |
Deployment name | oomkilled-test |
LOCAL_OOM_MEMORY_LIMIT |
Memory limit | 64Mi |
CLEANUP |
Auto-cleanup after test | false |
When GEMINI_API_KEY is not set, the agent returns a fallback summary:
{
"status": "ok",
"analysis": {
"summary": "Alert received but AI analysis unavailable",
"detail": "Basic alert information..."
}
}- KubeRCA Backend - Go REST API server
- KubeRCA Frontend - React web dashboard
- Helm Charts - Kubernetes deployment
- Chaos Scenarios - Failure injection tests
This project is part of KubeRCA, licensed under the MIT License. See the LICENSE file for details.
