-
-
Notifications
You must be signed in to change notification settings - Fork 2
Incident Response Runbook
Status: v2 (S9.T01 — P1/P2/P3 SLAs + PagerDuty + full war-room protocol)
Canonical URL: .claude/docs/operations/incident-response-runbook.md
Cross-ref: incident-response.md (SEV definitions + decision tree), bus-factor.md, postmortem-template.md, pagerduty-setup.md, runbooks/ (per-scenario)
nSelf uses four severity levels. P1/P2/P3 map to the most common operator response patterns.
| Level | Alias | Threshold | Ack SLA | Mitigate SLA | Resolve SLA | Postmortem? | Status Page? |
|---|---|---|---|---|---|---|---|
| P0 / SEV1 | Total outage | Production down, data loss risk, or confirmed breach. >10% users blocked. | 5 min | 60 min | 4 hours | Required (48h) | YES — every 15 min |
| P1 / SEV2 | Partial outage | Single-customer or partial outage. <10% users, or one critical-tier customer blocked. Auth/billing broken. | 1 hour | 4 hours | 8 hours | Required (48h) | YES — on detect/mitigate/resolve |
| P2 / SEV3 | Degraded | User-visible degradation but workaround exists. Feature unavailable, slow response, non-critical plugin offline. | 4 hours | 24 hours | 48 hours | Optional | Optional |
| P3 / SEV4 | Monitoring | No user impact observed. Issue visible in monitoring only. Cert expiry >14 days, log volume warning. | Next business day | Next sprint | Next sprint | No | No |
| P4 | Info | Capacity warning, informational signal. No action needed. | — | — | — | No | No |
- Auth broken for ANY user → P1 minimum.
- Payment/billing broken → P1 minimum.
- Any credential or secret exposed → P0 regardless of user count.
- If uncertain between P1 and P2: pick P1 and downgrade after triage.
nSelf is a single-maintainer ecosystem. Primary on-call is aric.camarata@gmail.com (24/7). Backup on-call: per bus-factor.md (USER DECISION PENDING — nomination required for 9 critical services).
Via nself-incident-mgmt plugin UI (preferred when running):
- Open
https://your-nself-instance/incidentsorlocalhost:3833/incidents - Click "Acknowledge" on the triggered alert
- Fill in: IC name, initial hypothesis, war-room channel name
- Set status to "Investigating"
Via Slack (fallback when UI unavailable):
Post to #incidents within ack SLA:
:ack: P{N} ack — {one-line summary} — IC: {your name} — {HH:MM UTC}
Opening war room: #incident-{YYYY-MM-DD}-{slug}
Via PagerDuty (if wired — see pagerduty-setup.md):
- Accept the page from the PagerDuty mobile app or email
- Post ack note in the PagerDuty incident: "Investigating — IC: {name}"
- PagerDuty ack automatically updates nself-incident-mgmt via webhook
Every action during the incident gets a timestamped entry. Minimum data per entry:
{HH:MM UTC} | {who} | {what} | {outcome/finding}
Examples:
14:23 UTC | Aric | Checked Hasura logs — confirmed query timeout spike | 500 errors since 14:17
14:28 UTC | Aric | Restarted Hasura container | Errors clearing
14:35 UTC | Aric | Confirmed resolution — no new errors for 5 min | Mitigated
Update the timeline in nself-incident-mgmt UI or in the war-room Slack channel pinned message.
| Severity | Notify | Channel | Method | Timing |
|---|---|---|---|---|
| P0 | Primary on-call + all engineers | #incident-{slug} |
PagerDuty page + Slack | Immediately on detection |
| P1 | Primary on-call |
#incidents + #incident-{slug}
|
PagerDuty or Slack push | Within ack SLA |
| P2 | Primary on-call | #incidents |
Slack | Within ack SLA |
| P3 | Primary on-call | #ops-low |
Slack message | Business hours |
| Severity | Status Page | Customer Email | Timing |
|---|---|---|---|
| P0 | Post immediately on detection | Send within 30 min of mitigation | Update every 15 min until resolved |
| P1 | Post on detection | Send within 1h of resolution | Update on detect/mitigate/resolve |
| P2 | Optional | No (unless customer contacts support) | Single update on resolution |
| P3 | No | No | — |
Initial post:
[Investigating] {service} — {one-line symptom} — {start-time UTC}
We are investigating reports of {symptom}. Affected users may see {impact}.
Next update: {now + 15 min}.
Identified:
[Identified] Root cause located — fix in progress.
We identified {cause}. Rollout is underway; impact should clear by {ETA UTC}.
Next update: {now + 15 min}.
Resolved:
[Resolved] Service restored at {time UTC}. Duration: {N} min.
{service} is fully restored. Post-mortem within 48h at {link}.
Subject: nSelf service incident — {one-line summary}
Hi {name or "there"},
We had an issue affecting {service} from {start UTC} to {end UTC} ({N} min).
What happened: {plain-English explanation — no jargon}.
Impact on you: {specific impact if known, else "some requests may have failed"}.
What we did: {plain-English mitigation steps}.
What's next: post-mortem within 48h; we'll send it your way.
If you noticed anything we missed, reply here.
— Aric, nSelf
(Tone rule: human, no "we apologize for the inconvenience" — see GCI Outbound Human Correspondence.)
nSelf uses PagerDuty in stub mode — routing is configured but single-escalation for now.
Alert triggered (nself-alert-router → PagerDuty)
└─ Tier 1: Primary on-call (push notification + SMS) — 5 min
└─ No ack → Tier 2: Backup on-call (see bus-factor.md) — +5 min
└─ No ack → Escalation policy: email + phone — +10 min
See pagerduty-setup.md for integration key setup and test flow.
See War Room Protocol (Section 6).
Run in this order:
# Check overall health
nself health
# Check specific service
nself service status <postgres|hasura|auth|nginx|redis>
# Check recent logs (last 100 lines)
nself logs <service> --tail 100
# Check error rate spike
nself monitor --last 15m- Which nself.org subdomains affected? (
curl -I https://{subdomain}.nself.org/health) - Which plugins returning errors? (Grafana → Plugin Error Rate panel)
- Cloud customers affected? (check
np_cloud_tenantsactivity via Hasura) - License validation broken? (
curl https://ping.nself.org/health)
# What deployed in the last 2 hours?
cd /Volumes/X9/Sites/nself/cli && git log --oneline --since="2 hours ago"
# Any nself update recently?
nself version && nself update --dry-runNavigate to the matching scenario, then follow it to resolution:
| Symptom | Runbook |
|---|---|
| Postgres down / query errors / deadlock | runbooks/postgres-deadlock.md |
| Hetzner server unreachable | hetzner-failover-runbook.md |
| Hasura metadata errors / migration failure | hasura-migration-runbook.md |
| Vercel deploy needed / rollback | vercel-failover-runbook.md |
| Cloudflare DNS failure / license validation down | cloudflare-dns-failure-runbook.md |
| Stripe billing broken | stripe-failover-runbook.md |
| Secret/credential exposed | runbooks/secret-rotation.md |
| Mass data leak / GDPR trigger | runbooks/mass-data-leak.md |
| AI provider (OpenAI/Anthropic) unreachable | runbooks/ai-provider-outage.md |
| License server (ping.nself.org) down | runbooks/license-server-outage.md |
| Malicious plugin behavior detected | runbooks/malicious-plugin-response.md |
| No matching runbook | Use Root-Cause Template below as live working doc |
Rollback a bad deploy:
# Vercel — rollback to prior production deployment
vercel rollback --prod
# CLI fix pushed to prod — roll back to prior tag
cd /Volumes/X9/Sites/nself/cli
git tag --sort=-creatordate | head -5 # find prior tag
# then trigger deploy via nself deploy with prior versionRestart a crashed service:
nself service restart <service-name>Enable graceful degradation (disable a bad plugin):
nself plugin disable <plugin-name>Route around a failing dependency: See specific runbooks for Hetzner / Cloudflare / Vercel failovers.
- P0: required, within 48 hours.
- P1: required, within 48 hours.
- P2: optional but recommended for recurring issues.
- P3/P4: skip.
Use postmortem-template.md verbatim. Key sections:
## Post-Mortem: {one-line title}
**Date:** {YYYY-MM-DD}
**Severity:** P{N}
**Duration:** {start UTC} → {resolved UTC} ({N} min)
**IC:** {name}
**Authors:** {names}
### Impact
{Quantified user impact: N users affected, N min downtime, N failed requests.}
### Timeline
{Timestamped list — detection through resolution.}
### Root Cause
{Specific, factual. Use 5-Whys: keep asking "why" until you reach a process or system gap.}
### What Went Well
{Be honest — what actually helped?}
### What Went Wrong
{Be honest — what slowed response or caused the gap?}
### Action Items
| Action | Owner | Due | Status |
|---|---|---|---|
| {specific preventive or detective fix} | {name} | {YYYY-MM-DD} | open |- Blameless means no "X did the wrong thing." Focus on system and process gaps.
- Every action item gets an owner and due date. Track in
.claude/tasks/active.mduntil closed. - Action items are always one specific, verifiable fix — never vague ("improve monitoring").
- Publish within 48 hours. Send link to affected customers by email.
- P0: always.
- P1: always.
- P2: at IC discretion.
1. Create the Slack channel:
#incident-{YYYY-MM-DD}-{slug} e.g. #incident-2026-05-07-auth-down
Channel topic:
P{N} | {one-line summary} | IC: {name} | Zoom: {url} | started {HH:MM UTC}
2. Assign roles:
| Role | Responsibility | Required? |
|---|---|---|
| Incident Commander (IC) | Owns the incident. Drives triage, calls mitigation, declares resolution. | Always |
| Technical Lead | Runs commands, reads logs, executes the runbook steps. | P0/P1 |
| Scribe | Posts timestamped updates every 15 min. Updates status page. | P0/P1 |
| Comms Officer | Drafts customer messages. Handles @-mentions + support tickets. | P0 (optional P1) |
For single-maintainer: IC = Technical Lead. Scribe and Comms can be the same person or deferred until mitigation is underway.
3. Open Zoom bridge:
Use the standing bridge URL stored in vault:
source ~/.claude/vault.env && echo $INCIDENT_ZOOM_URLPaste the URL in the channel topic immediately. For P2: text-only in Slack is fine.
4. Pin the root-cause template:
Paste into the channel as a pinned message:
## Incident {YYYY-MM-DD} — {slug}
**P-level:** P{N}
**IC:** {name}
**Started:** {HH:MM UTC}
**Mitigated:** pending
**Resolved:** pending
### Symptoms
- {bullet each user-visible signal with timestamp}
### Affected surfaces
- [ ] {service / endpoint / customer segment}
### Hypotheses (live — update as ruled out or confirmed)
- H1: {hypothesis} — Evidence: {link/paste} — Status: open
- H2: …
### Actions taken
- {HH:MM} {action} → {outcome}| Severity | Internal (Slack) | External (status page) |
|---|---|---|
| P0 | Every 10 min | Every 15 min |
| P1 | Every 30 min | On state change |
| P2 | On state change | On resolution |
Update format for internal:
[HH:MM UTC] Update #{N}: {one-line current state}. Next: {what we're doing now}.
The IC declares resolution when:
- The user-visible symptom is confirmed gone (not just "looks better").
- Root cause is identified and either fixed or safely mitigated.
- No new errors in the past 10 min (P0) or 5 min (P1).
Say explicitly in the channel:
:white_check_mark: RESOLVED — P{N} resolved at {HH:MM UTC}. Duration: {N} min.
Post-mortem due: {YYYY-MM-DD HH:MM UTC}.
Then: post status page resolution, send customer email if required, and file the post-mortem task in .claude/tasks/active.md.
- nself-incident-mgmt UI — incident ack + timeline
- status.nself.org — public status page
- PagerDuty setup — alert routing + integration key
- postmortem-template.md — blameless format
- on-call-rotation.md — who is primary on-call
- vendor-contacts.md — Cloudflare / Hetzner / Vercel / Stripe escalation
- bus-factor.md — backup admin assignments
- dr-runbook.md — full disaster recovery
ɳSelf CLI v1.0.9. MIT licensed. Docs CC BY 4.0.
GitHub · Issues · Discussions · nself.org · docs.nself.org
Getting Started
Commands
- Commands, Overview
- Lifecycle: cmd-init · cmd-build · cmd-start · cmd-stop · cmd-restart · cmd-dev
- Monitoring: cmd-status · cmd-logs · cmd-health · cmd-urls · cmd-doctor · cmd-monitor · cmd-alerts · cmd-watchdog · cmd-dogfood
- Data: cmd-db · cmd-backup · cmd-dr · cmd-queue · cmd-webhooks
- Config: cmd-config · cmd-service · cmd-env · cmd-promote
- Networking: cmd-ssl · cmd-trust · cmd-dns-setup
- Security: cmd-security · cmd-secrets · cmd-waf
- Tenancy: cmd-tenant · cmd-billing
- Plugins: cmd-plugin · cmd-license
- AI: cmd-ai · cmd-claw · cmd-model
- Templates: cmd-template
- Utilities: cmd-exec · cmd-clean · cmd-reset · cmd-update · cmd-upgrade · cmd-version · cmd-admin · cmd-migrate · cmd-migrate-firebase · cmd-migrate-supabase · cmd-completion
Features
- Features, Overview
- Feature-Auth
- Feature-Storage
- Feature-Search
- Feature-Functions
- Feature-Email
- Feature-Monitoring
- Feature-Plugins
- Feature-ɳClaw, AI Assistant
- Feature-ɳChat, Messaging
- Feature-ɳTV, Media Player
- Feature-ɳFamily, Family Social
- Feature-ɳCloud, Managed Hosting
- Feature-Memory-Rooms, Knowledge Organization
- Feature-Agent-Dashboard, Agent Metrics
- Feature-Image-Generation, AI Image Generation
Configuration
- Configuration, Overview
- Config-Env-Vars
- Config-Postgres
- Config-Hasura
- Config-Auth
- Config-Nginx
- Config-Optional-Services
- Config-Custom-Services
- Config-System
Plugins (87 + 10 monitoring)
Free (25)
- plugin-backup
- plugin-content-acquisition
- plugin-content-progress
- plugin-cron
- plugin-donorbox
- plugin-feature-flags
- plugin-github
- plugin-github-runner
- plugin-invitations
- plugin-jobs
- plugin-link-preview
- plugin-mdns
- plugin-mlflow
- plugin-monitoring
- plugin-notifications
- plugin-notify
- plugin-paypal
- plugin-search
- plugin-shopify
- plugin-stripe
- plugin-subtitle-manager
- plugin-tokens
- plugin-torrent-manager
- plugin-vpn
- plugin-webhooks
Pro (62)
- plugin-access-controls
- plugin-activity-feed
- plugin-admin-api
- plugin-ai
- plugin-analytics
- plugin-auth
- plugin-backup-pro
- plugin-bots
- plugin-browser
- plugin-calendar
- plugin-cdn
- plugin-chat
- plugin-claw
- plugin-claw-budget
- plugin-claw-news
- plugin-claw-web
- plugin-cloudflare
- plugin-cms
- plugin-compliance
- plugin-cron-pro
- plugin-ddns
- plugin-devices
- plugin-documents
- plugin-donorbox-pro
- plugin-entitlements
- plugin-epg
- plugin-file-processing
- plugin-game-metadata
- plugin-geocoding
- plugin-geolocation
- plugin-google
- plugin-home
- plugin-idme
- plugin-knowledge-base
- plugin-linkedin
- plugin-livekit
- plugin-media-processing
- plugin-meetings
- plugin-moderation
- plugin-mux
- plugin-notify-pro
- plugin-object-storage
- plugin-observability
- plugin-paypal-pro
- plugin-photos
- plugin-podcast
- plugin-post
- plugin-realtime
- plugin-recording
- plugin-retro-gaming
- plugin-rom-discovery
- plugin-shopify-pro
- plugin-social
- plugin-sports
- plugin-stream-gateway
- plugin-streaming
- plugin-stripe-pro
- plugin-support
- plugin-tmdb
- plugin-voice
- plugin-web3
- plugin-workflows
Planned (26)
plugin-auditplugin-blogplugin-checkoutplugin-commerceplugin-drmplugin-exportplugin-flowplugin-importplugin-ldapplugin-mailgunplugin-mediaplugin-oauth-providersplugin-pagesplugin-postmarkplugin-rate-limitplugin-reportsplugin-samlplugin-schedulerplugin-sendgridplugin-ssoplugin-subscriptionplugin-thumbplugin-transcoderplugin-twilioplugin-wafplugin-watermark
Guides
- Guide-Production-Deployment
- Guide-SSL-Setup
- Guide-Multi-Tenancy
- Guide-Security-Hardening
- Guide-Monitoring-Setup
- Guide-Backup-Restore
- Guide-Custom-Services
- Guide-Migration-from-v1
Architecture
Reference
- API-Reference
- reference-error-codes, Error Codes
Licensing
Security
Brand
Operations
- operations/release-cascade, Release Cascade
- operations/self-healing, Self-Healing Schema
- operations/redis-tuning, Redis Pool Tuning
- operations/meilisearch-warmup, MeiliSearch Warm-Up
- operations/jwt-rotation, JWT Key Rotation
- operations/windows-wsl2-setup, Windows / WSL2 Setup
- operations/gemini-oauth-reauth, Gemini OAuth Reauth
Contributing
Admin
- USER-ACTION-QUEUE, Pending Admin Actions