Automated runbook execution engine with YAML decision trees, AI-generated runbooks, blast radius checks, approval workflows, and cross-module integration for Windows infrastructure.
Infrastructure teams face the same issues repeatedly -- high CPU, disk space, service crashes, replication failures -- and the remediation steps are well-known but inconsistently applied. Junior admins might skip blast radius checks. Night-shift operators might not know the right order of operations. Knowledge lives in people's heads, not in code.
Manual runbooks in wikis and Word documents go stale. They cannot make decisions, cannot check dependencies, and cannot learn from past outcomes. When an alert fires at 3 AM, you need an engine that can walk through a diagnostic decision tree, check if the target is a Domain Controller before restarting a service, ask for approval when the blast radius is high, verify the fix actually worked, and record what happened for the morning brief.
Infra-RunbookEngine turns static documentation into executable decision trees.
Each runbook is a YAML file defining a decision tree. The engine reads the YAML, walks the tree step by step, and at each node decides what to do next based on conditions, integrations, and outcomes.
Alert Fires
|
v
[Read YAML Runbook] --> Validate parameters
|
v
[Check Maintenance Window] --> Skip if in window
|
v
[Step 1: Diagnose] --> Run PowerShell script, capture output
|
v
[Step 2: Decision] --> Evaluate condition
/ \
/ \
True False
| |
v v
[Fix] [Escalate]
|
v
[Blast Radius Check] --> Is this a DC? SQL Server? How many users?
|
v
[Approval Gate] --> Console / Email / Teams / Slack
|
v
[Execute Fix] --> Run remediation script
|
v
[Verify Fix] --> Wait, then check if the issue is resolved
|
v
[Record Learning] --> Track success/failure for next time
|
v
[Generate Report] --> HTML dashboard with full execution details
This module is designed to work with the rest of the portfolio. Here is how they connect:
+-------------------------------------------------------------+
| Infra-RunbookEngine |
| (Decision & Execution) |
+-------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | ITSM-Insights |--->| AI-Generated | |
| | (Ticket History) | | Runbooks | |
| +------------------+ +------------------+ |
| |
| +------------------+ +------------------+ |
| | Infra-Change |--->| Context-Aware | |
| | Tracker | | Decisions | |
| +------------------+ +------------------+ |
| |
| +------------------+ +------------------+ |
| | Infra-LivingDoc |--->| Expected State | |
| | (Documentation) | | Comparison | |
| +------------------+ +------------------+ |
| |
| +------------------+ +------------------+ |
| | Certificate- |--->| Cert Expiry | |
| | LifecycleMonitor | | Runbooks | |
| +------------------+ +------------------+ |
| |
| +------------------+ +------------------+ |
| | Admin-Morning |<---| Runbook Results | |
| | Brief | | & Alerts | |
| +------------------+ +------------------+ |
| |
| +------------------+ +------------------+ |
| | AD-Security |--->| Security Context | |
| | Audit | | for Blast Radius | |
| +------------------+ +------------------+ |
| |
+-------------------------------------------------------------+
New-Runbook -FromTicketHistory can query any ITSM provider (ServiceNow, Jira, or CSV export) for all tickets related to a configuration item. The ticket data -- summaries, categories, resolutions -- is sent to an AI provider which analyzes patterns and generates a YAML runbook targeting the most common issues. If you have been using ITSM-Insights to analyze ticket trends, the same data source can be fed directly into runbook generation. A server that keeps getting the same three types of tickets gets a runbook that handles all three, with the right decision tree to distinguish between them.
Runbook steps can use action: integration to call Get-ServerConfigChanges from the Infra-ChangeTracker module. This is used in the high-cpu template, for example: before deciding how to remediate high CPU, the runbook checks what changed on the server in the last 24 hours. If a new application was deployed yesterday, the runbook suggests investigating the change rather than blindly restarting services. This turns runbooks from "always do the same thing" into "decide based on context."
Runbooks can query Infra-LivingDoc to compare the current state of a server against its documented expected state. If a service is stopped but the documentation says it should be running, the runbook knows this is an anomaly. If the documentation says the server is decommissioned, the runbook knows not to remediate. This prevents runbooks from fighting against intentional changes.
The certificate-expiry template integrates directly with Certificate-LifecycleMonitor's Get-CertificateRenewalStatus function. When a certificate expiry alert triggers the runbook, it checks whether auto-renewal is already configured, identifies which services use the certificate (IIS, RDP, LDAPS), and decides whether to notify or escalate based on how close the expiry date is. The health score calculation also pulls certificate expiry data to factor into the overall CI health grade.
Every runbook execution is logged to $env:USERPROFILE\.runbookengine\executions\ as a JSON file. Get-ShiftHandoff collects these execution results -- what ran, what succeeded, what failed, what escalated, what is still pending approval -- and generates a summary that can be consumed by Admin-MorningBrief. The morning brief can show: "3 runbooks ran overnight, 2 completed successfully, 1 was escalated for manual review."
Before executing any remediation step marked requires_approval: true, the engine runs Test-BlastRadius. This function checks Active Directory (via Get-ADComputer and service principal names) to detect if the target is a Domain Controller, SQL Server, Exchange server, file server, or print server. It also checks connected user sessions. A restart on a file server with 50 connected users gets flagged as High blast radius and requires approval. This security context comes from the same AD infrastructure that AD-SecurityAudit monitors.
Every integration is optional. If you do not have Infra-ChangeTracker installed, integration steps that call it will be gracefully skipped. If you do not have Certificate-LifecycleMonitor, the health score simply omits certificate factors. The core engine -- YAML parsing, decision trees, approvals, verification, learning -- works entirely on its own.
# Import the module
Import-Module .\Infra-RunbookEngine
# Preview what a runbook would do (dry run)
Invoke-Runbook -RunbookName 'high-cpu' -ComputerName 'SERVER01' -WhatIf
# Execute a runbook with console approval
Invoke-Runbook -RunbookName 'service-recovery' -ComputerName 'WEB01' `
-Parameters @{ ServiceName = 'W3SVC' } -RequireApproval -ApprovalMethod Console
# Execute with Force (skip approvals) and generate HTML report
Invoke-Runbook -RunbookName 'disk-space' -ComputerName 'FILE01' `
-Force -OutputPath 'C:\Reports'
# Create a custom runbook from a template
New-Runbook -Name 'my-cpu-check' -FromTemplate 'high-cpu' -OutputPath 'C:\Runbooks'
# Generate an AI-powered runbook from ticket history
New-Runbook -Name 'webserver-issues' -FromTicketHistory `
-ITSMProvider ServiceNow -ITSMEndpoint 'https://instance.service-now.com' `
-ITSMCredential $cred -CIName 'WEB01' -Provider OpenAI -ApiKey $apiKey
# Check health score
Get-CIHealthScore -ComputerName 'DC01','DC02' -IncludeRunbookHistory
# Generate shift handoff
Get-ShiftHandoff -HoursBack 8 -OutputPath 'C:\Reports\handoff.html'
# Check execution history
Get-RunbookStatus -Last 10
Get-RunbookStatus -ComputerName 'SERVER01' -Status Failed| Template | File | Description |
|---|---|---|
| High CPU | high-cpu.yml |
Diagnose top CPU consumers, check for known offenders, query recent changes, restart or escalate |
| Disk Space | disk-space.yml |
Check usage, find cleanable paths, remove temp/log files, identify large files, escalate if insufficient |
| Service Recovery | service-recovery.yml |
Check service state, verify dependencies, attempt restart, check event logs for crash reasons |
| Certificate Expiry | certificate-expiry.yml |
Scan for expiring certs, identify service bindings, check auto-renewal, escalate expired certs |
| DNS Resolution | dns-resolution.yml |
Check DNS service, test resolution, check forwarders, flush cache, retry and escalate |
| Replication Failure | replication-failure.yml |
Check repadmin status, identify failing partner, test network, verify time sync, force replication |
| Backup Failure | backup-failure.yml |
Check backup service, verify disk space on target, test network, review logs, attempt re-run |
| Memory Pressure | memory-pressure.yml |
Assess memory usage, detect memory leaks, check page file, restart safe services, escalate |
Runbooks are YAML files with this structure:
name: My Custom Runbook
version: "1.0"
description: What this runbook does
trigger:
metric: some_metric
threshold: 90
duration_minutes: 15
parameters:
- name: ComputerName
required: true
- name: CustomParam
default: some_value
steps:
- id: diagnose
action: script
description: Run a diagnostic script
script: |
Get-Process -ComputerName $ComputerName | Select-Object -First 5
outputs:
- process_list
- id: evaluate
action: decision
description: Check the results
condition: "$process_list.Count -gt 0"
if_true: fix_it
if_false: escalate
- id: fix_it
action: script
description: Apply the fix
requires_approval: true
blast_radius: single_service
script: |
Restart-Service -Name MyService -ComputerName $ComputerName -Force
verify:
wait_seconds: 30
check: |
(Get-Service -Name MyService -ComputerName $ComputerName).Status -eq 'Running'
- id: escalate
action: escalate
priority: high
description: Cannot resolve automatically
message: "Issue on $ComputerName requires manual intervention"| Action | Purpose | Key Fields |
|---|---|---|
script |
Run PowerShell code | script, outputs, verify, requires_approval, blast_radius |
decision |
Branch the decision tree | condition, if_true, if_false |
integration |
Call another module | module, function, parameters, outputs |
notify |
Send a notification | message, include_data |
escalate |
Escalate to human | priority, message |
| Level | Meaning | Approval |
|---|---|---|
single_service |
Affects one service on one server | Optional |
single_server |
Affects the entire server | Recommended |
multi_server |
Affects multiple servers | Required |
domain_wide |
Affects the entire domain | Required |
The New-Runbook -FromTicketHistory flow:
- Queries your ITSM for all tickets related to a CI
- Sends ticket summaries to an AI provider (Anthropic, OpenAI, Ollama, or Custom)
- The AI analyzes patterns: "80% of tickets for this server are disk space, 15% are service crashes, 5% are performance"
- Generates a YAML decision tree that handles the top issues
- Saves the runbook for review and use
Supported AI providers:
- Anthropic (Claude) -- Best for complex decision trees
- OpenAI (GPT-4o) -- Good general purpose
- Ollama -- Local/private, no data leaves your network
- Custom -- Any OpenAI-compatible endpoint
Always review AI-generated runbooks before production use.
Steps marked requires_approval: true will pause execution and request approval:
- Console: Interactive prompt at the terminal
- Email: Sends an email with runbook details and blast radius assessment
- Teams: Posts an Adaptive Card to a Teams webhook
- Slack: Posts a Block Kit message to a Slack webhook
# Console approval (interactive)
Invoke-Runbook -RunbookName 'high-cpu' -ComputerName 'DC01' `
-RequireApproval -ApprovalMethod Console
# Teams webhook approval
Invoke-Runbook -RunbookName 'service-recovery' -ComputerName 'SQL01' `
-RequireApproval -ApprovalMethod Teams `
-ApprovalContact 'https://outlook.office.com/webhook/...'
# Skip all approvals (use with caution)
Invoke-Runbook -RunbookName 'disk-space' -ComputerName 'FILE01' -ForceGet-CIHealthScore aggregates multiple data sources into a single 0-100 score:
| Factor | Max Penalty | Source |
|---|---|---|
| Failed runbook executions | -25 | Execution logs |
| Escalated issues | -15 | Execution logs |
| Recurring failures (low success rate) | -20 | Learning system |
| Recent configuration changes | -10 | Infra-ChangeTracker |
| Open ITSM tickets | -15 | ITSM provider |
| Stopped auto-start services | -20 | Live CIM check |
| Expired certificates | -20 | Certificate-LifecycleMonitor |
| Expiring certificates (within 30 days) | -10 | Certificate-LifecycleMonitor |
Grading: A = 90-100, B = 80-89, C = 70-79, D = 60-69, F = below 60
Get-CIHealthScore -ComputerName 'DC01' -IncludeRunbookHistory -IncludeChangeHistory -DaysBack 30 `
-OutputPath 'C:\Reports\dc01-health.html'Get-ShiftHandoff generates a summary of everything that happened during your shift:
- Runbook executions: what ran, what succeeded, what failed
- Escalations: what was escalated and why
- Pending approvals: what is waiting for someone to approve
- Correlation alerts: patterns detected across multiple runbooks
- Next-shift notes: what the incoming team needs to know
# Generate and email
Get-ShiftHandoff -HoursBack 8 -OutputPath 'C:\Reports\handoff.html' `
-SendEmail -SmtpServer 'mail.contoso.com' `
-EmailTo 'nightshift@contoso.com' -EmailFrom 'runbooks@contoso.com'Every time a runbook step executes, the engine records whether it succeeded or failed. Over time, this builds a knowledge base:
- Success rates per step: "Restarting the Search Indexer resolves high CPU 85% of the time"
- Failure patterns: "DNS cache flush fails on Server 2012 R2 boxes"
- Temporal patterns: "This runbook fires every Monday morning after the backup window"
The learning data is stored in $env:USERPROFILE\.runbookengine\learnings.json and can be queried through the execution history.
The engine detects when multiple runbooks fire for related systems within a time window. Known patterns:
| Combination | Possible Root Cause |
|---|---|
| DNS + Replication | Domain Controller failure or network partition |
| High CPU + Memory Pressure | Resource exhaustion, possible runaway process |
| Disk Space + Backup Failure | Disk exhaustion causing backup failures |
| Service Recovery + High CPU | Service crash-restart loop |
| Certificate Expiry + Service Recovery | Expired certificate causing service failures |
| Function | Description |
|---|---|
Invoke-Runbook |
Execute a YAML runbook as a decision tree with approvals and verification |
New-Runbook |
Create a runbook from template or generate via AI from ticket history |
Get-RunbookStatus |
Query execution history with filtering by status, target, and time |
Get-CIHealthScore |
Calculate aggregate health score (0-100) for configuration items |
Get-ShiftHandoff |
Generate shift handoff report with activity, escalations, and notes |
- PowerShell 5.1 or later
- Windows operating system (for CIM/WMI and AD cmdlets)
- Optional: PowerShell-Yaml module (for advanced YAML parsing; built-in parser handles standard runbooks)
- Optional: ActiveDirectory module (for blast radius DC detection)
- Optional: AI provider API key (for AI-generated runbooks)
- Optional: Portfolio modules for cross-module integration (all integrations degrade gracefully)
- Issues and feature requests: GitHub Issues
- Portfolio site: larro1991.github.io
- Contributions: Pull requests welcome. Please include Pester tests for new features.
MIT License - see LICENSE for details.
Copyright (c) 2025 Larry Roberts