| name | diagnos |
|---|---|
| description | Structured incident intake, operational triage, blast-radius analysis, and investigation framing for engineering incidents, regressions, outages, production anomalies, rollout failures, and debugging requests. Use this skill before debugging begins — whenever a report is still ambiguous, severity is unclear, ownership is uncertain, or the organization needs to determine what is actually broken, who is affected, and what investigation path should start first. Trigger on phrases like "investigate this", "users are reporting", "something is broken", "production issue", "incident", "regression", "outage", "why is this happening", "triage this", "help classify this", or anytime the incoming report is operationally noisy, emotionally escalated, or technically incomplete. |
| license | MIT |
The purpose of this skill is not to debug.
It transforms ambiguous operational noise into a structured engineering investigation — before debugging begins.
Most incidents fail before debugging even starts: severity inflation, symptom confusion, chasing secondary effects, debugging before defining the failure, assigning the wrong owner, treating customer pain as implementation pain, conflating infrastructure symptoms with application defects.
diagnos establishes operational clarity so that investigation can begin from a known position.
raw incident
↓
diagnos ← this skill
↓
disciplina
↓
monumentum
This skill produces a structured handoff. It does not debug, fix, or postmortem.
Using the skills CLI (recommended):
npx skills add naiplawan/diagnosOr install manually:
# User-level (applies across all projects)
cp -r diagnos ~/.claude/skills/diagnos
# Or project-level
cp -r diagnos /path/to/project/.claude/skills/diagnosClaude Code will pick up the skill on next session start.
Explicit invocation:
/diagnos users can't check out, something is broken
Or describe an incident and Claude will apply the skill when the report is ambiguous, operationally noisy, or severity is unclear.
Do not debug a system until you can clearly describe what is failing, who is affected, how severe it is, what changed, what systems are involved, and what is still unknown.
Severity is operational, not emotional. It is determined by customer impact and blast radius, recovery ability and data integrity risk, revenue or workflow impact, and rollout blockage — NOT by log volume, emotional escalation, executive visibility, or engineering frustration.
Unknowns must remain unknown. Do not invent root causes, impact scope, mitigation status, or ownership certainty. Prefer:
"Impact scope still under verification."
over:
"Likely isolated."
if isolation has not been confirmed.
This skill owns:
- incident intake and symptom extraction
- operational triage and blast-radius analysis
- severity framing and ownership narrowing
- timeline reconstruction and regression classification
- dependency identification and investigation framing
- missing-information detection
This skill does NOT:
- implement fixes or perform deep debugging
- produce postmortems or write leadership updates
- route workflows across skills (that is
ordinator)
Before extracting or structuring anything, check whether the report contains enough observable information to triage.
Minimum viable report requires at least two of:
- a described symptom or behavior
- an affected system, workflow, or user group
- a time reference (even approximate)
If the report falls below this threshold, push back before proceeding:
"There is not yet enough operational evidence to classify this confidently. Can you share what was observed, which users or systems are affected, and when it started?"
Do not invent structure from noise. Proceed to Phase 1 only when minimum viability is met.
Extract: exact symptoms, affected workflows, affected users/systems, timestamps, environments, release versions, regions, services, reproduction information, alerts/logs, observed failures, and operational changes.
Separate every piece of information into one of four categories — never merge them:
| Category | Meaning |
|---|---|
| Fact | directly observed |
| Assumption | inferred but unverified |
| Hypothesis | possible explanation |
| Unknown | not yet established |
What still works is often more valuable than what fails.
- Is the failure global or partial?
- Does rollback restore behavior?
- Is the issue environment-specific?
- Is this isolated to one workflow?
- Does retry succeed?
- Are reads affected or only writes?
- Is degradation continuous or intermittent?
The failure boundary narrows the search space before any hypothesis is formed.
| Level | Example |
|---|---|
| User-localized | One account broken |
| Workflow-specific | Checkout flow degraded |
| Service-wide | Entire API failing |
| Dependency-wide | Shared auth system unstable |
| Platform-wide | Infrastructure outage |
Then identify downstream systems, upstream dependencies, release coordination implications, and customer-visible effects.
Identify recent changes: deployments, migrations, config changes, feature flags, infra scaling, dependency upgrades, certificate rotations, rollout sequencing, and operational policy changes.
Do NOT assume the most recent change caused the issue. Only establish:
"The issue began after X."
not:
"X caused the issue."
unless proven.
| Severity | Definition |
|---|---|
| S0 | catastrophic platform-wide outage |
| S1 | major customer-impacting degradation |
| S2 | significant workflow impairment |
| S3 | localized regression or operational issue |
| S4 | low-impact bug or investigation |
Severity must be justified with blast radius, user impact, operational impact, recovery constraints, or data integrity implications. Avoid dramatic language.
Identify likely ownership domains: frontend, backend, platform, infra, mobile, auth, data, networking, CI/CD, release engineering, observability.
Ownership narrowing is probabilistic. State confidence explicitly:
"Most likely within the API/auth boundary based on token refresh failures."
Produce the next investigation steps with enough specificity to hand off cleanly into disciplina.
Good: "Focus next investigation on session refresh handling during reconnect events after token expiry." Bad: "Debug auth."
## Incident Summary
## Current Symptoms
## Affected Scope
## Severity Assessment
## Known Facts
## Unknowns
## Recent Changes
## Likely Ownership
## Initial Hypotheses
## Recommended Investigation Path
## Immediate MitigationsGood framing names a specific system, behavior, or time range. It narrows — it does not gesture.
Bad: "Investigate backend instability." Good: "Investigate whether the new deployment altered request timeout behavior between the API gateway and worker pool."
Bad: "The database is overloaded." Good: "Connection acquisition latency increased sharply after deployment; determine whether pool exhaustion or lock contention explains the queue growth."
Aggressively identify what is absent: no timestamps, no reproduction path, no affected environment, no deployment history, no metrics or logs, unclear impact scope, unknown rollback state.
State gaps explicitly:
"The current report lacks enough information to determine whether this is isolated or systemic."
Push back when:
- severity is clearly inflated without observable evidence
- the user demands certainty that has not been established
- the report is purely hypothetical with no observed behavior
- emotional escalation is being used as a substitute for operational detail
Do not catastrophize, do not minimize. Match the language to what has actually been confirmed.
Good: "Users intermittently lose saved progress after session refresh." Bad: "Critical customer data loss issue." — unless data was truly lost.
| Criterion | How to verify |
|---|---|
| Failure boundary is defined | Output distinguishes what works from what fails — not just what fails |
| Severity is calibrated | Severity level (S0–S4) is stated and justified with at least one of: blast radius, user impact, recovery constraint, data integrity risk |
| Ownership is narrowed | Output names at least one specific domain or system boundary, not just "the backend" |
| Uncertainty is explicit | Every unconfirmed claim is labeled Assumption, Hypothesis, or Unknown — none stated as fact |
| Investigation path is specific | Recommended Investigation Path names a system, behavior, or time range — not a generic verb like "investigate" |
| No severity inflation | Output contains no dramatic language unsupported by stated facts |
The output should make the incident feel smaller, clearer, and more structured — not more dramatic.
MIT — see LICENSE.
Packaged by @rachaphol.