Skip to content

naiplawan/diagnos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

name diagnos
description Structured incident intake, operational triage, blast-radius analysis, and investigation framing for engineering incidents, regressions, outages, production anomalies, rollout failures, and debugging requests. Use this skill before debugging begins — whenever a report is still ambiguous, severity is unclear, ownership is uncertain, or the organization needs to determine what is actually broken, who is affected, and what investigation path should start first. Trigger on phrases like "investigate this", "users are reporting", "something is broken", "production issue", "incident", "regression", "outage", "why is this happening", "triage this", "help classify this", or anytime the incoming report is operationally noisy, emotionally escalated, or technically incomplete.
license MIT

diagnos — Incident Intake & Operational Triage

The purpose of this skill is not to debug.

It transforms ambiguous operational noise into a structured engineering investigation — before debugging begins.

Most incidents fail before debugging even starts: severity inflation, symptom confusion, chasing secondary effects, debugging before defining the failure, assigning the wrong owner, treating customer pain as implementation pain, conflating infrastructure symptoms with application defects.

diagnos establishes operational clarity so that investigation can begin from a known position.

Where This Fits

raw incident
    ↓
diagnos   ← this skill
    ↓
disciplina
    ↓
monumentum

This skill produces a structured handoff. It does not debug, fix, or postmortem.


Install

Using the skills CLI (recommended):

npx skills add naiplawan/diagnos

Or install manually:

# User-level (applies across all projects)
cp -r diagnos ~/.claude/skills/diagnos

# Or project-level
cp -r diagnos /path/to/project/.claude/skills/diagnos

Claude Code will pick up the skill on next session start.

Invoking

Explicit invocation:

/diagnos users can't check out, something is broken

Or describe an incident and Claude will apply the skill when the report is ambiguous, operationally noisy, or severity is unclear.


Core Principle

Do not debug a system until you can clearly describe what is failing, who is affected, how severe it is, what changed, what systems are involved, and what is still unknown.

Severity is operational, not emotional. It is determined by customer impact and blast radius, recovery ability and data integrity risk, revenue or workflow impact, and rollout blockage — NOT by log volume, emotional escalation, executive visibility, or engineering frustration.

Unknowns must remain unknown. Do not invent root causes, impact scope, mitigation status, or ownership certainty. Prefer:

"Impact scope still under verification."

over:

"Likely isolated."

if isolation has not been confirmed.


Responsibilities

This skill owns:

  • incident intake and symptom extraction
  • operational triage and blast-radius analysis
  • severity framing and ownership narrowing
  • timeline reconstruction and regression classification
  • dependency identification and investigation framing
  • missing-information detection

This skill does NOT:

  • implement fixes or perform deep debugging
  • produce postmortems or write leadership updates
  • route workflows across skills (that is ordinator)

Viability Check — Run Before Any Phase

Before extracting or structuring anything, check whether the report contains enough observable information to triage.

Minimum viable report requires at least two of:

  • a described symptom or behavior
  • an affected system, workflow, or user group
  • a time reference (even approximate)

If the report falls below this threshold, push back before proceeding:

"There is not yet enough operational evidence to classify this confidently. Can you share what was observed, which users or systems are affected, and when it started?"

Do not invent structure from noise. Proceed to Phase 1 only when minimum viability is met.


Workflow — Seven Phases

Phase 1 — Normalize the Report

Extract: exact symptoms, affected workflows, affected users/systems, timestamps, environments, release versions, regions, services, reproduction information, alerts/logs, observed failures, and operational changes.

Separate every piece of information into one of four categories — never merge them:

Category Meaning
Fact directly observed
Assumption inferred but unverified
Hypothesis possible explanation
Unknown not yet established

Phase 2 — Define the Failure Boundary

What still works is often more valuable than what fails.

  • Is the failure global or partial?
  • Does rollback restore behavior?
  • Is the issue environment-specific?
  • Is this isolated to one workflow?
  • Does retry succeed?
  • Are reads affected or only writes?
  • Is degradation continuous or intermittent?

The failure boundary narrows the search space before any hypothesis is formed.

Phase 3 — Blast Radius Analysis

Level Example
User-localized One account broken
Workflow-specific Checkout flow degraded
Service-wide Entire API failing
Dependency-wide Shared auth system unstable
Platform-wide Infrastructure outage

Then identify downstream systems, upstream dependencies, release coordination implications, and customer-visible effects.

Phase 4 — Change Correlation

Identify recent changes: deployments, migrations, config changes, feature flags, infra scaling, dependency upgrades, certificate rotations, rollout sequencing, and operational policy changes.

Do NOT assume the most recent change caused the issue. Only establish:

"The issue began after X."

not:

"X caused the issue."

unless proven.

Phase 5 — Severity Classification

Severity Definition
S0 catastrophic platform-wide outage
S1 major customer-impacting degradation
S2 significant workflow impairment
S3 localized regression or operational issue
S4 low-impact bug or investigation

Severity must be justified with blast radius, user impact, operational impact, recovery constraints, or data integrity implications. Avoid dramatic language.

Phase 6 — Ownership Narrowing

Identify likely ownership domains: frontend, backend, platform, infra, mobile, auth, data, networking, CI/CD, release engineering, observability.

Ownership narrowing is probabilistic. State confidence explicitly:

"Most likely within the API/auth boundary based on token refresh failures."

Phase 7 — Investigation Framing

Produce the next investigation steps with enough specificity to hand off cleanly into disciplina.

Good: "Focus next investigation on session refresh handling during reconnect events after token expiry." Bad: "Debug auth."


Output Structure

## Incident Summary
## Current Symptoms
## Affected Scope
## Severity Assessment
## Known Facts
## Unknowns
## Recent Changes
## Likely Ownership
## Initial Hypotheses
## Recommended Investigation Path
## Immediate Mitigations

Investigation Framing Rules

Good framing names a specific system, behavior, or time range. It narrows — it does not gesture.

Bad: "Investigate backend instability." Good: "Investigate whether the new deployment altered request timeout behavior between the API gateway and worker pool."

Bad: "The database is overloaded." Good: "Connection acquisition latency increased sharply after deployment; determine whether pool exhaustion or lock contention explains the queue growth."


Missing Information Detection

Aggressively identify what is absent: no timestamps, no reproduction path, no affected environment, no deployment history, no metrics or logs, unclear impact scope, unknown rollback state.

State gaps explicitly:

"The current report lacks enough information to determine whether this is isolated or systemic."


Refusal Conditions

Push back when:

  • severity is clearly inflated without observable evidence
  • the user demands certainty that has not been established
  • the report is purely hypothetical with no observed behavior
  • emotional escalation is being used as a substitute for operational detail

Do not catastrophize, do not minimize. Match the language to what has actually been confirmed.

Good: "Users intermittently lose saved progress after session refresh." Bad: "Critical customer data loss issue." — unless data was truly lost.


Success Criteria

Criterion How to verify
Failure boundary is defined Output distinguishes what works from what fails — not just what fails
Severity is calibrated Severity level (S0–S4) is stated and justified with at least one of: blast radius, user impact, recovery constraint, data integrity risk
Ownership is narrowed Output names at least one specific domain or system boundary, not just "the backend"
Uncertainty is explicit Every unconfirmed claim is labeled Assumption, Hypothesis, or Unknown — none stated as fact
Investigation path is specific Recommended Investigation Path names a system, behavior, or time range — not a generic verb like "investigate"
No severity inflation Output contains no dramatic language unsupported by stated facts

The output should make the incident feel smaller, clearer, and more structured — not more dramatic.


License

MIT — see LICENSE.

Credit

Packaged by @rachaphol.

About

A Claude Code skill for structured incident intake and operational triage. Transforms ambiguous incident reports into a structured engineering investigation — before debugging begins.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors