Skip to content

kube-rca/agent

Repository files navigation

KubeRCA Logo

KubeRCA Agent

AI-Powered Analysis Service for Kubernetes Incidents

Python FastAPI Strands Agents (Multi Provider) uv


Overview

The KubeRCA Agent is a Python-based analysis service that performs Root Cause Analysis (RCA) on Kubernetes incidents. It receives alert payloads from the Backend, collects relevant context from the Kubernetes cluster and Prometheus, and uses LLM via Strands Agents (Gemini/OpenAI/Anthropic) to generate comprehensive analysis reports.

Key Features

  • AI-Powered RCA - Uses Strands Agents with Gemini/OpenAI/Anthropic for intelligent analysis
  • Kubernetes Context - Collects pod logs, events, and resource status
  • Generic Manifest Read Tools - Reads namespaced core/CRD manifests via apiVersion + resource
  • Prometheus Integration - Queries relevant metrics for analysis
  • Session Persistence - PostgreSQL-backed session history when SESSION_DB_* is configured
  • Fallback Mode - Returns basic summary when the provider API key is unavailable

Architecture

flowchart LR
  BE[Backend] -->|POST /analyze| AG[Agent]
  AG -->|Logs, Events| K8S[Kubernetes API]
  AG -->|PromQL Query| PR[Prometheus]
  AG -->|LLM Analysis| LLM[LLM Provider API]
  AG -.->|Session Storage| PG[(PostgreSQL)]
  AG -->|Analysis Result| BE
Loading

Analysis Flow

  1. Receive alert payload from Backend
  2. Collect Kubernetes context (logs, events, pod status)
  3. Query Prometheus for relevant metrics
  4. Build analysis prompt with collected context
  5. Send to Strands Agents (Gemini/OpenAI/Anthropic) for RCA
  6. Return structured analysis result

Tech Stack

Category Technology
Language Python 3.10+
Framework FastAPI
AI/LLM Strands Agents (Gemini/OpenAI/Anthropic)
Package Manager uv
Linting ruff
Testing pytest
Container Docker
CI/CD GitHub Actions

Quick Start

Prerequisites

  • Python 3.10+
  • uv (Python package manager)
  • (Optional) Kubernetes cluster access
  • (Optional) AI provider API key

Installation

# Run in repository root
# (monorepo layout: cd agent/main)
make install
# or manually:
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

Run Development Server

make run
# or manually:
uvicorn app.main:app --host 0.0.0.0 --port 8000

The server starts at http://localhost:8000.

Run Tests

make test
# or:
pytest

Lint & Format

make lint    # Check code style
make format  # Auto-format code

API Endpoints

Method Endpoint Description
GET / Service info
GET /ping Health check
GET /healthz Kubernetes health probe
POST /analyze Analyze single alert
POST /summarize-incident Summarize resolved incident
GET /openapi.json OpenAPI specification

POST /analyze

Analyzes a single alert with Kubernetes/Prometheus context.

Request:

{
  "alert": {
    "status": "firing",
    "labels": {
      "alertname": "HighMemoryUsage",
      "severity": "critical",
      "namespace": "default",
      "pod": "example-pod"
    },
    "annotations": {
      "summary": "High memory usage detected",
      "description": "Pod memory usage > 90%"
    },
    "startsAt": "2024-01-01T00:00:00Z",
    "fingerprint": "abc123"
  },
  "thread_ts": "1234567890.123456"
}

Response:

{
  "status": "ok",
  "thread_ts": "1234567890.123456",
  "analysis": "## Root Cause Analysis\n...",
  "analysis_summary": "Brief summary of the issue",
  "analysis_detail": "Detailed RCA markdown content...",
  "analysis_quality": "medium",
  "missing_data": ["alert.labels.pod"],
  "warnings": ["namespace/pod_name missing from alert labels"],
  "capabilities": {
    "k8s_core": "ok",
    "manifest_read": "ok",
    "prometheus": "unavailable",
    "tempo": "unavailable",
    "mesh": "unknown",
    "traffic_policy": "unknown"
  },
  "context": {
    "namespace": "default",
    "pod_name": "example-pod",
    "analysis_quality": "medium"
  },
  "artifacts": []
}

POST /summarize-incident

Summarizes a resolved incident with all associated alerts.

Generic Manifest Read Tools

The analysis engine can inspect namespaced Kubernetes manifests (core and CRD) with:

  • get_manifest(namespace, api_version, resource, name)
  • list_manifests(namespace, api_version, resource, label_selector=None, limit=20)

Examples:

get_manifest("bookinfo", "v1", "services", "reviews")
get_manifest("bookinfo", "networking.istio.io/v1", "virtualservices", "reviews-route")
list_manifests("bookinfo", "v1", "configmaps", "app=reviews", 10)

Notes:

  • api_version supports both core (v1) and grouped (group/version) formats.
  • resource must be a plural resource name (for example: pods, services, virtualservices).
  • For security and readability, secret values are masked and status/metadata.managedFields are omitted in get_manifest responses.

Configuration

Environment Variables

Variable Description Default
AI_PROVIDER LLM provider (gemini, openai, anthropic) gemini
GEMINI_API_KEY Gemini API key for Strands Agents -
OPENAI_API_KEY OpenAI API key for Strands Agents -
ANTHROPIC_API_KEY Anthropic API key for Strands Agents -
GEMINI_MODEL_ID Gemini model ID gemini-3-flash-preview
OPENAI_MODEL_ID OpenAI model ID gpt-4o
ANTHROPIC_MODEL_ID Anthropic model ID claude-sonnet-4-20250514
PROMETHEUS_URL Prometheus base URL - (disabled)
LOG_LEVEL Logging level info
WEB_CONCURRENCY Uvicorn worker count 1

Kubernetes Context

Variable Description Default
K8S_API_TIMEOUT_SECONDS K8s API timeout 5
K8S_EVENT_LIMIT Max events to fetch 25
K8S_LOG_TAIL_LINES Log lines to fetch 25

Prometheus

Variable Description Default
PROMETHEUS_URL Prometheus base URL -
PROMETHEUS_HTTP_TIMEOUT_SECONDS HTTP timeout 5

Tempo (APM Traces)

Variable Description Default
TEMPO_URL Tempo base URL (e.g. http://tempo.monitoring.svc:3100) -
TEMPO_HTTP_TIMEOUT_SECONDS Tempo HTTP timeout 10
TEMPO_TENANT_ID Tempo tenant header value (X-Scope-OrgID) -
TEMPO_TRACE_LIMIT Max traces fetched per alert 5
TEMPO_LOOKBACK_MINUTES Minutes before startsAt for trace search window 15
TEMPO_FORWARD_MINUTES Minutes after startsAt for trace search window 5

Prompt Configuration

Variable Description Default
PROMPT_TOKEN_BUDGET Approximate token budget 32000
PROMPT_MAX_LOG_LINES Max log lines in prompt 25
PROMPT_MAX_EVENTS Max events in prompt 25
PROMPT_SUMMARY_MAX_ITEMS Max session summaries 3
MASKING_REGEX_LIST_JSON JSON array of regex patterns for masking before LLM/DB response flows []

LLM Retry

Variable Description Default
LLM_RETRY_MAX_ATTEMPTS Max retry attempts for transient LLM API errors (5xx, 429) 5
LLM_RETRY_MIN_WAIT Minimum backoff wait time in seconds 1.0
LLM_RETRY_MAX_WAIT Maximum backoff wait time in seconds 60.0

Session Storage (Required when LLM provider key is set)

Variable Description
SESSION_DB_HOST PostgreSQL host
SESSION_DB_PORT PostgreSQL port
SESSION_DB_NAME Database name
SESSION_DB_USER Database user
SESSION_DB_PASSWORD Database password

If GEMINI_API_KEY/OPENAI_API_KEY/ANTHROPIC_API_KEY is set, configure SESSION_DB_* together.


Project Structure

agent/
├── app/
│   ├── main.py                # FastAPI entrypoint
│   ├── api/
│   │   ├── analysis.py        # POST /analyze, POST /summarize-incident
│   │   └── health.py          # GET /, /ping, /healthz
│   ├── clients/
│   │   ├── k8s.py
│   │   ├── prometheus.py
│   │   ├── tempo.py
│   │   ├── session_repository.py
│   │   ├── summary_store.py
│   │   ├── strands_agent.py
│   │   ├── strands_patch.py
│   │   └── llm_providers/
│   ├── core/
│   │   ├── config.py
│   │   ├── dependencies.py
│   │   └── logging.py
│   ├── models/
│   ├── schemas/
│   │   ├── alert.py
│   │   └── analysis.py
│   └── services/analysis.py
├── docs/openapi.json
├── scripts/export_openapi.py
├── tests/
├── Dockerfile
├── Makefile
└── pyproject.toml

Development

Makefile Commands

Command Description
make install Install dependencies
make run Run development server
make lint Run ruff linter
make format Format code with ruff
make test Run pytest
make build IMAGE=<tag> Build Docker image
make curl-analyze Test analyze endpoint
make curl-analyze-local Test with local server

Export OpenAPI Spec

When API changes, regenerate the OpenAPI spec:

uv run python scripts/export_openapi.py

The spec is saved to docs/openapi.json.

Git Hooks (Optional)

Auto-regenerate OpenAPI on commit:

git config core.hooksPath .githooks

Contributing

Merge Policy (release-please)

release-please parses conventional commits; merge commits that include the PR title can be double-counted in the changelog.

  • Prefer Squash and merge or Rebase and merge.
  • If Create a merge commit is used, keep the PR title non-conventional (e.g., "Merge PR #123").
  • Use Conventional Commits for change commits that should appear in the changelog.

Testing

Unit Tests

make test
# or:
pytest tests/

Local Integration Test

Requires a Kubernetes cluster and provider API key:

AI_PROVIDER=gemini GEMINI_API_KEY=xxx KUBECONFIG=~/.kube/config make test-analysis-local

Manual API Test

curl -X POST http://localhost:8000/analyze \
  -H 'Content-Type: application/json' \
  -d '{
    "alert": {
      "status": "firing",
      "labels": {
        "alertname": "TestAlert",
        "severity": "warning",
        "namespace": "default",
        "pod": "example-pod"
      },
      "annotations": {
        "summary": "Test summary",
        "description": "Test description"
      },
      "startsAt": "2024-01-01T00:00:00Z",
      "fingerprint": "test-fingerprint"
    },
    "thread_ts": "test-thread"
  }'

Docker

Build Image

docker build -t kube-rca-agent .
# or:
make build IMAGE=kube-rca-agent

Run Container

docker run -d -p 8000:8000 \
  -e GEMINI_API_KEY=your-api-key \
  kube-rca-agent

OOMKilled Test Scenario

Test the agent with a real OOMKilled scenario in Kubernetes.

Prerequisites

  • kubectl with cluster access
  • kube-rca namespace exists

Run Test

# Create OOM pod only
make test-oom-only

# Full test with analysis
GEMINI_API_KEY=xxx make test-analysis-local

# Cleanup
make cleanup-oom

Environment Variables

Variable Description Default
KUBE_CONTEXT Kubernetes context current
LOCAL_OOM_NAMESPACE Test namespace kube-rca
LOCAL_OOM_DEPLOYMENT Deployment name oomkilled-test
LOCAL_OOM_MEMORY_LIMIT Memory limit 64Mi
CLEANUP Auto-cleanup after test false

Fallback Behavior

When GEMINI_API_KEY is not set, the agent returns a fallback summary:

{
  "status": "ok",
  "analysis": {
    "summary": "Alert received but AI analysis unavailable",
    "detail": "Basic alert information..."
  }
}

Related Components


License

This project is part of KubeRCA, licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors