You've been there. 3 AM. PagerDuty fires. You stumble to your laptop, squint at a graph, dig through logs, cross-reference a recent deployment, form a hypothesis, write a Slack message, wait for approval, apply a fix, then spend an hour writing a post-mortem that nobody reads.
Every single time.
IRAS does all of that — automatically, in under 2 minutes — and only wakes you up to press Approve.
No Slack token? No PagerDuty key? No problem. Everything falls back to mock clients. You only need an Anthropic API key and Docker.
# 1. Clone
git clone https://github.com/krishnashakula/IRAS.git && cd IRAS
# 2. Start Postgres
docker run -d --name iras-postgres \
-e POSTGRES_USER=iras -e POSTGRES_PASSWORD=secret -e POSTGRES_DB=iras \
-p 5432:5432 postgres:16
# 3. Install
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# 4. Configure (only two fields required)
cp .env.example .env
# Set ANTHROPIC_API_KEY and POSTGRES_URL
# 5. Run
python run.pyINFO IRAS graph compiled and ready
INFO Uvicorn running on http://0.0.0.0:8000
# Fire an alert
curl -X POST http://localhost:8000/webhook/alert \
-H "Content-Type: application/json" \
-d '{
"title": "High error rate on payment-service",
"timestamp": "2026-05-03T10:30:00Z",
"service": "payment-service",
"error_rate": 0.45
}'
# → {"incident_id": "550e8400-...", "status": "processing"}
# Watch it work in your terminal, then approve
curl -X POST http://localhost:8000/incidents/550e8400-.../approveIRAS runs a 9-node LangGraph state machine. Each stage produces a typed Pydantic model — no raw strings, no prompt output you have to parse.
Alert → Triage → Context → RCA → Plan → [YOU] → Apply → Post-mortem
↑ ↓
└── retry if confidence < 0.7
Stage 1 — Ingest: Any JSON webhook with title + timestamp is accepted. PagerDuty, Prometheus AlertManager, Datadog, Grafana — or a raw curl. Extra fields pass straight through to the AI.
Stage 2 — Triage (Claude Haiku): P0–P3 severity, affected services, estimated blast radius, confidence score. Fast and cheap — Haiku is used here intentionally.
Stage 3 — Context Gathering (Claude Haiku + tool calls): Three parallel tool calls:
fetch_logs→ error/warning lines from Elasticsearch or Lokifetch_metrics→ current vs. 7-day baseline from Prometheusfetch_deployments→ recent GitHub Deployments for the affected service
Stage 4 — Root Cause Analysis (Claude Sonnet): Produces a RootCauseHypothesis with primary_cause, contributing_factors, evidence (specific log lines), and a confidence score.
Confidence gate: if score < 0.7, the graph loops back to context-gathering for a broader evidence window. After RCA_MAX_ATTEMPTS, it escalates automatically.
Stage 5 — Remediation Planning (Claude Sonnet): Ordered steps with human-readable descriptions, exact rollback commands, risk levels, and estimated durations.
Stage 6 — Human Approval (you): LangGraph's interrupt() pauses the graph. State is checkpointed to PostgreSQL — the server can restart and the incident survives. You get a Slack message with Approve / Reject buttons, or hit the API directly.
Stage 7 — Apply Remediation: Steps execute sequentially. On failure, completed steps roll back in reverse order using their stored rollback_command.
Stage 8 — Post-mortem (Claude Sonnet): Timeline, root cause, resolution, action items. Written regardless of outcome — resolved or escalated. Stored in PostgreSQL, posted to Slack.
Most AI agent projects trust the model's output at face value. IRAS doesn't.
Safety invariants enforced in code, not prompts:
# The model cannot generate an unsafe plan that bypasses approval.
# These checks run regardless of what the model returns.
if any(step.risk_level == "high" for step in plan.steps):
plan.requires_human_approval = True # forced
if any(not step.rollback_command.strip() for step in plan.steps):
plan.reversible = False # forced
plan.requires_human_approval = True # forced292 tests, 99% coverage — including adversarial scenarios specifically designed to test model misbehavior:
- Model lies about
risk_level→ safety override catches it - Model returns empty
rollback_command→ plan blocked - All context tools fail simultaneously → graceful degradation, not crash
- 20 concurrent incidents → zero state contamination
- Unicode, XSS, 10,000-character payloads → handled cleanly
- Model confidence never reaches threshold → automatic PagerDuty escalation
pytest -q # 292 tests, ~30s
pytest tests/stress/ -v --no-cov # adversarial scenarios
pytest --cov=src/iras --cov-report=html # 99% coverage reportgraph TB
subgraph Sources["Alert Sources"]
PD[PagerDuty]
PROM[Prometheus AlertManager]
DD[Datadog / Grafana]
ANY[Any JSON Webhook]
end
subgraph API["FastAPI REST API"]
WH["POST /webhook/alert"]
APR["POST /incidents/{id}/approve"]
REJ["POST /incidents/{id}/reject"]
HLT["GET /health"]
end
subgraph Graph["LangGraph State Machine"]
ING[Ingestion]
TRI["Triage Agent\nClaude Haiku"]
CTX["Context Gathering\nClaude Haiku"]
RCA["RCA Agent\nClaude Sonnet"]
GEN["Generate Plan\nClaude Sonnet"]
APP["Approval\n⏸ interrupt"]
REM[Apply Remediation]
ESC[Escalation]
PM["Post-mortem Agent\nClaude Sonnet"]
end
subgraph Integrations["External Integrations"]
SL[Slack]
PG2[PagerDuty]
LOGS["Elasticsearch / Loki"]
METRICS[Prometheus]
DEPLOY[GitHub Deployments]
DB[(PostgreSQL)]
end
Sources --> WH
WH --> ING
ING --> TRI --> CTX --> RCA
RCA -->|"conf < 0.7, attempts < max"| CTX
RCA -->|"conf >= 0.7"| GEN
RCA -->|"attempts exhausted"| ESC
GEN --> APP
APP -->|approved| REM
APP -->|rejected| ESC
REM --> PM
ESC --> PM
PM --> DB
PM --> SL
APR --> APP
REJ --> APP
CTX --> LOGS & METRICS & DEPLOY
ESC --> PG2 & SL
GEN --> SL
Graph --> DB
The most technically interesting part of IRAS is how it handles the human-in-the-loop approval step.
Most agent frameworks fake this with polling or timeouts. LangGraph's interrupt() is different: the graph genuinely pauses mid-execution, serializes its entire state to PostgreSQL, and resumes from exactly that point when the human responds — even across server restarts, deployments, or process crashes.
# The graph pauses here. State is in Postgres.
# The server can restart. The incident is safe.
human_decision = interrupt({"message": "Approve remediation plan?"})
# Resumes here when POST /incidents/{id}/approve is called.
if human_decision["approved"]:
return apply_remediation(state)
else:
return escalate(state)This is why IRAS is built on LangGraph and not a simpler framework. Durable execution matters for production incident response.
sequenceDiagram
actor Monitor as Monitoring System
participant API as FastAPI
participant Graph as LangGraph
participant Claude as Claude (Anthropic)
participant Tools as External Tools
participant DB as PostgreSQL
participant Slack as Slack
actor Human as On-Call Engineer
Monitor->>API: POST /webhook/alert
API-->>Monitor: 202 {"incident_id": "abc123"}
API->>Graph: ainvoke(state, thread_id="abc123")
Graph->>Graph: ingestion — validate + init state
Graph->>Claude: triage_agent
Claude-->>Graph: TriageResult {severity: P1, confidence: 0.9}
Graph->>Claude: context_agent (tool-calling)
Claude->>Tools: fetch_logs() + fetch_metrics() + fetch_deployments()
Tools-->>Claude: raw evidence
Claude-->>Graph: ContextBundle {logs, metrics, deployments}
Graph->>Claude: rca_agent
Claude-->>Graph: RootCauseHypothesis {confidence: 0.88} ✓
Graph->>Claude: remediation_agent
Claude-->>Graph: RemediationPlan {3 steps + rollbacks}
Graph->>Slack: Post approval with [Approve] [Reject]
Note over Graph,DB: interrupt() — graph pauses, state checkpointed to PostgreSQL
Human->>API: POST /incidents/abc123/approve
API->>Graph: Command(resume={"approved": True})
Graph->>Graph: apply_remediation — execute all steps
Graph->>Claude: postmortem_agent
Claude-->>Graph: PostMortem {timeline, root_cause, action_items}
Graph->>DB: INSERT INTO postmortems
Graph->>Slack: Post post-mortem summary
IRAS/
├── src/iras/
│ ├── api/
│ │ ├── app.py # FastAPI lifespan: init checkpointer → build graph
│ │ ├── background.py # Approval timeout monitor
│ │ └── routes/
│ │ ├── webhook.py # POST /webhook/alert
│ │ └── approval.py # POST /incidents/{id}/approve|reject
│ │
│ ├── graph/
│ │ ├── builder.py # Wire 9 nodes + conditional edges → compile
│ │ ├── checkpointer.py # AsyncPostgresSaver (singleton + asyncio.Lock)
│ │ ├── state.py # IncidentState TypedDict
│ │ └── nodes/
│ │ ├── ingestion.py
│ │ ├── triage.py # → Claude Haiku
│ │ ├── context_gathering.py # → Claude Haiku + tool calls
│ │ ├── rca.py # → Claude Sonnet + retry routing
│ │ ├── generate_plan.py # → Claude Sonnet + Slack notify
│ │ ├── approval.py # interrupt() — durable human checkpoint
│ │ ├── apply_remediation.py # Execute steps + rollback on failure
│ │ ├── escalation.py # PagerDuty + Slack
│ │ └── postmortem.py # → Claude Sonnet + persist to DB
│ │
│ ├── agents/ # One Pydantic AI agent per stage
│ ├── models/ # TriageResult · ContextBundle · RootCauseHypothesis
│ │ # RemediationPlan · RemediationStep · PostMortem
│ ├── tools/ # Elasticsearch · Loki · Prometheus · GitHub · Slack · PagerDuty
│ └── config/settings.py # Pydantic Settings — reads .env
│
├── tests/
│ ├── unit/ # Fully mocked
│ ├── integration/ # Live service tests (opt-in)
│ ├── e2e/ # Full graph runs with MemorySaver
│ └── stress/ # 47 adversarial + real-world scenarios
| Severity | Meaning | Approval window |
|---|---|---|
| P0 | Complete outage | 15 minutes, then auto-escalate |
| P1 | Major degradation | 2 hours |
| P2 | Partial degradation | 2 hours |
| P3 | Warning / informational | 2 hours |
Escalation is triggered when: RCA confidence never reaches threshold after max retries · Human rejects the plan · Approval timeout expires.
On escalation: idempotent PagerDuty incident fires + structured Slack message with full context. Post-mortem always runs — resolved or not.
cp .env.example .env| Variable | Required | Description |
|---|---|---|
ANTHROPIC_API_KEY |
✅ | Claude API key (sk-ant-...) |
POSTGRES_URL |
✅ | postgresql://user:pass@host:5432/db |
SLACK_BOT_TOKEN |
⬜ | Falls back to mock client if unset |
SLACK_ONCALL_CHANNEL_ID |
⬜ | Slack channel for on-call alerts |
PAGERDUTY_INTEGRATION_KEY |
⬜ | Falls back to mock client if unset |
PROMETHEUS_BASE_URL |
⬜ | Falls back to mock client if unset |
ELASTICSEARCH_BASE_URL |
⬜ | Pick one log backend |
LOKI_BASE_URL |
⬜ | Pick one log backend |
LANGSMITH_API_KEY |
⬜ | LangSmith graph tracing |
LOGFIRE_TOKEN |
⬜ | Logfire agent tracing |
RCA_CONFIDENCE_THRESHOLD |
⬜ | Default: 0.7 |
RCA_MAX_ATTEMPTS |
⬜ | Default: 3 |
APPROVAL_TIMEOUT_P0_MINUTES |
⬜ | Default: 15 |
APPROVAL_TIMEOUT_DEFAULT_MINUTES |
⬜ | Default: 120 |
All integrations fall back to mock clients when tokens are absent. IRAS runs fully end-to-end with only
ANTHROPIC_API_KEY+POSTGRES_URL.
Accepts any JSON with title + timestamp. All extra fields pass through to the AI agents.
{ "title": "High error rate on payment-service", "timestamp": "2026-05-03T10:30:00Z" }{ "incident_id": "550e8400-...", "status": "processing" }Resumes the paused graph. Approve routes to remediation. Reject routes to PagerDuty escalation.
{ "status": "ok", "env": "development" }| Signal | Tool | What it covers |
|---|---|---|
| Graph traces | LangSmith | Every node: inputs, outputs, timing, token usage |
| Agent traces | Logfire | Every LLM call: prompt, response, tool calls, validation |
| Structured logs | Python logging |
Every node emits incident_id, node_name, timestamp |
| Post-mortems | PostgreSQL | Full records queryable by severity, duration, outcome |
# docker-compose.yml
services:
iras:
build: .
ports: ["8000:8000"]
env_file: .env
depends_on:
postgres:
condition: service_healthy
postgres:
image: postgres:16
environment:
POSTGRES_USER: iras
POSTGRES_PASSWORD: secret
POSTGRES_DB: iras
healthcheck:
test: ["CMD-SHELL", "pg_isready -U iras"]
interval: 5s
retries: 5
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:docker compose up -dProduction checklist:
- Add auth to
/approveand/reject(Slack request signing or OAuth) - Set
APP_ENV=production - Configure real Slack + PagerDuty tokens
- Enable LangSmith + Logfire
- Add reverse proxy (nginx / Caddy) with TLS
- Set up PgBouncer for Postgres connection pooling
Add a context tool: Implement a client in src/iras/tools/ with a MockXClient fallback → add to ContextDeps → register @context_agent.tool.
Swap models per agent: Each agent instantiates its own pydantic_ai.Agent.
rca_agent = Agent(model="claude-opus-4-5", ...) # higher accuracy
triage_agent = Agent(model="claude-haiku-3-5", ...) # faster/cheaperAdd a notification backend: Both escalation_node and postmortem_node accept injectable deps. Implement post_message / trigger_incident and swap it in.
- Fork and create a feature branch
- Run
pytest— all 292 tests must pass - Keep coverage above 98%:
pytest --cov=src/iras --cov-fail-under=98 - Open a pull request
MIT — see LICENSE
Built with LangGraph · Pydantic AI · FastAPI · Claude
If IRAS handled your 3 AM incident, give it a ⭐