Infra-RunbookEngine

Automated runbook execution engine with YAML decision trees, AI-generated runbooks, blast radius checks, approval workflows, and cross-module integration for Windows infrastructure.

The Problem

Infrastructure teams face the same issues repeatedly -- high CPU, disk space, service crashes, replication failures -- and the remediation steps are well-known but inconsistently applied. Junior admins might skip blast radius checks. Night-shift operators might not know the right order of operations. Knowledge lives in people's heads, not in code.

Manual runbooks in wikis and Word documents go stale. They cannot make decisions, cannot check dependencies, and cannot learn from past outcomes. When an alert fires at 3 AM, you need an engine that can walk through a diagnostic decision tree, check if the target is a Domain Controller before restarting a service, ask for approval when the blast radius is high, verify the fix actually worked, and record what happened for the morning brief.

Infra-RunbookEngine turns static documentation into executable decision trees.

How It Works

Each runbook is a YAML file defining a decision tree. The engine reads the YAML, walks the tree step by step, and at each node decides what to do next based on conditions, integrations, and outcomes.

Alert Fires
    |
    v
[Read YAML Runbook] --> Validate parameters
    |
    v
[Check Maintenance Window] --> Skip if in window
    |
    v
[Step 1: Diagnose] --> Run PowerShell script, capture output
    |
    v
[Step 2: Decision] --> Evaluate condition
   / \
  /   \
True  False
 |      |
 v      v
[Fix] [Escalate]
 |
 v
[Blast Radius Check] --> Is this a DC? SQL Server? How many users?
 |
 v
[Approval Gate] --> Console / Email / Teams / Slack
 |
 v
[Execute Fix] --> Run remediation script
 |
 v
[Verify Fix] --> Wait, then check if the issue is resolved
 |
 v
[Record Learning] --> Track success/failure for next time
 |
 v
[Generate Report] --> HTML dashboard with full execution details

Module Connections

This module is designed to work with the rest of the portfolio. Here is how they connect:

The Integration Map

+-------------------------------------------------------------+
|                    Infra-RunbookEngine                       |
|                   (Decision & Execution)                     |
+-------------------------------------------------------------+
|                                                             |
|  +------------------+    +------------------+              |
|  | ITSM-Insights    |--->| AI-Generated     |              |
|  | (Ticket History) |    | Runbooks         |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | Infra-Change     |--->| Context-Aware    |              |
|  | Tracker          |    | Decisions        |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | Infra-LivingDoc  |--->| Expected State   |              |
|  | (Documentation)  |    | Comparison       |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | Certificate-     |--->| Cert Expiry      |              |
|  | LifecycleMonitor |    | Runbooks         |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | Admin-Morning    |<---| Runbook Results  |              |
|  | Brief            |    | & Alerts         |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | AD-Security      |--->| Security Context |              |
|  | Audit            |    | for Blast Radius |              |
|  +------------------+    +------------------+              |
|                                                             |
+-------------------------------------------------------------+

ITSM-Insights --> AI-Generated Runbooks

New-Runbook -FromTicketHistory can query any ITSM provider (ServiceNow, Jira, or CSV export) for all tickets related to a configuration item. The ticket data -- summaries, categories, resolutions -- is sent to an AI provider which analyzes patterns and generates a YAML runbook targeting the most common issues. If you have been using ITSM-Insights to analyze ticket trends, the same data source can be fed directly into runbook generation. A server that keeps getting the same three types of tickets gets a runbook that handles all three, with the right decision tree to distinguish between them.

Infra-ChangeTracker --> Context-Aware Decisions

Runbook steps can use action: integration to call Get-ServerConfigChanges from the Infra-ChangeTracker module. This is used in the high-cpu template, for example: before deciding how to remediate high CPU, the runbook checks what changed on the server in the last 24 hours. If a new application was deployed yesterday, the runbook suggests investigating the change rather than blindly restarting services. This turns runbooks from "always do the same thing" into "decide based on context."

Infra-LivingDoc --> Expected State Comparison

Runbooks can query Infra-LivingDoc to compare the current state of a server against its documented expected state. If a service is stopped but the documentation says it should be running, the runbook knows this is an anomaly. If the documentation says the server is decommissioned, the runbook knows not to remediate. This prevents runbooks from fighting against intentional changes.

Certificate-LifecycleMonitor --> Cert Expiry Runbooks

The certificate-expiry template integrates directly with Certificate-LifecycleMonitor's Get-CertificateRenewalStatus function. When a certificate expiry alert triggers the runbook, it checks whether auto-renewal is already configured, identifies which services use the certificate (IIS, RDP, LDAPS), and decides whether to notify or escalate based on how close the expiry date is. The health score calculation also pulls certificate expiry data to factor into the overall CI health grade.

Admin-MorningBrief <-- Runbook Results

Every runbook execution is logged to $env:USERPROFILE\.runbookengine\executions\ as a JSON file. Get-ShiftHandoff collects these execution results -- what ran, what succeeded, what failed, what escalated, what is still pending approval -- and generates a summary that can be consumed by Admin-MorningBrief. The morning brief can show: "3 runbooks ran overnight, 2 completed successfully, 1 was escalated for manual review."

AD-SecurityAudit --> Blast Radius Checks

Before executing any remediation step marked requires_approval: true, the engine runs Test-BlastRadius. This function checks Active Directory (via Get-ADComputer and service principal names) to detect if the target is a Domain Controller, SQL Server, Exchange server, file server, or print server. It also checks connected user sessions. A restart on a file server with 50 connected users gets flagged as High blast radius and requires approval. This security context comes from the same AD infrastructure that AD-SecurityAudit monitors.

Standalone Usage

Every integration is optional. If you do not have Infra-ChangeTracker installed, integration steps that call it will be gracefully skipped. If you do not have Certificate-LifecycleMonitor, the health score simply omits certificate factors. The core engine -- YAML parsing, decision trees, approvals, verification, learning -- works entirely on its own.

Quick Start

# Import the module
Import-Module .\Infra-RunbookEngine

# Preview what a runbook would do (dry run)
Invoke-Runbook -RunbookName 'high-cpu' -ComputerName 'SERVER01' -WhatIf

# Execute a runbook with console approval
Invoke-Runbook -RunbookName 'service-recovery' -ComputerName 'WEB01' `
    -Parameters @{ ServiceName = 'W3SVC' } -RequireApproval -ApprovalMethod Console

# Execute with Force (skip approvals) and generate HTML report
Invoke-Runbook -RunbookName 'disk-space' -ComputerName 'FILE01' `
    -Force -OutputPath 'C:\Reports'

# Create a custom runbook from a template
New-Runbook -Name 'my-cpu-check' -FromTemplate 'high-cpu' -OutputPath 'C:\Runbooks'

# Generate an AI-powered runbook from ticket history
New-Runbook -Name 'webserver-issues' -FromTicketHistory `
    -ITSMProvider ServiceNow -ITSMEndpoint 'https://instance.service-now.com' `
    -ITSMCredential $cred -CIName 'WEB01' -Provider OpenAI -ApiKey $apiKey

# Check health score
Get-CIHealthScore -ComputerName 'DC01','DC02' -IncludeRunbookHistory

# Generate shift handoff
Get-ShiftHandoff -HoursBack 8 -OutputPath 'C:\Reports\handoff.html'

# Check execution history
Get-RunbookStatus -Last 10
Get-RunbookStatus -ComputerName 'SERVER01' -Status Failed

Built-in Templates

Template	File	Description
High CPU	`high-cpu.yml`	Diagnose top CPU consumers, check for known offenders, query recent changes, restart or escalate
Disk Space	`disk-space.yml`	Check usage, find cleanable paths, remove temp/log files, identify large files, escalate if insufficient
Service Recovery	`service-recovery.yml`	Check service state, verify dependencies, attempt restart, check event logs for crash reasons
Certificate Expiry	`certificate-expiry.yml`	Scan for expiring certs, identify service bindings, check auto-renewal, escalate expired certs
DNS Resolution	`dns-resolution.yml`	Check DNS service, test resolution, check forwarders, flush cache, retry and escalate
Replication Failure	`replication-failure.yml`	Check repadmin status, identify failing partner, test network, verify time sync, force replication
Backup Failure	`backup-failure.yml`	Check backup service, verify disk space on target, test network, review logs, attempt re-run
Memory Pressure	`memory-pressure.yml`	Assess memory usage, detect memory leaks, check page file, restart safe services, escalate

Creating Custom Runbooks

Runbooks are YAML files with this structure:

name: My Custom Runbook
version: "1.0"
description: What this runbook does
trigger:
  metric: some_metric
  threshold: 90
  duration_minutes: 15

parameters:
  - name: ComputerName
    required: true
  - name: CustomParam
    default: some_value

steps:
  - id: diagnose
    action: script
    description: Run a diagnostic script
    script: |
      Get-Process -ComputerName $ComputerName | Select-Object -First 5
    outputs:
      - process_list

  - id: evaluate
    action: decision
    description: Check the results
    condition: "$process_list.Count -gt 0"
    if_true: fix_it
    if_false: escalate

  - id: fix_it
    action: script
    description: Apply the fix
    requires_approval: true
    blast_radius: single_service
    script: |
      Restart-Service -Name MyService -ComputerName $ComputerName -Force
    verify:
      wait_seconds: 30
      check: |
        (Get-Service -Name MyService -ComputerName $ComputerName).Status -eq 'Running'

  - id: escalate
    action: escalate
    priority: high
    description: Cannot resolve automatically
    message: "Issue on $ComputerName requires manual intervention"

Step Actions

Action	Purpose	Key Fields
`script`	Run PowerShell code	`script`, `outputs`, `verify`, `requires_approval`, `blast_radius`
`decision`	Branch the decision tree	`condition`, `if_true`, `if_false`
`integration`	Call another module	`module`, `function`, `parameters`, `outputs`
`notify`	Send a notification	`message`, `include_data`
`escalate`	Escalate to human	`priority`, `message`

Blast Radius Levels

Level	Meaning	Approval
`single_service`	Affects one service on one server	Optional
`single_server`	Affects the entire server	Recommended
`multi_server`	Affects multiple servers	Required
`domain_wide`	Affects the entire domain	Required

AI-Generated Runbooks

The New-Runbook -FromTicketHistory flow:

Queries your ITSM for all tickets related to a CI
Sends ticket summaries to an AI provider (Anthropic, OpenAI, Ollama, or Custom)
The AI analyzes patterns: "80% of tickets for this server are disk space, 15% are service crashes, 5% are performance"
Generates a YAML decision tree that handles the top issues
Saves the runbook for review and use

Supported AI providers:

Anthropic (Claude) -- Best for complex decision trees
OpenAI (GPT-4o) -- Good general purpose
Ollama -- Local/private, no data leaves your network
Custom -- Any OpenAI-compatible endpoint

Always review AI-generated runbooks before production use.

Approval Workflows

Steps marked requires_approval: true will pause execution and request approval:

Console: Interactive prompt at the terminal
Email: Sends an email with runbook details and blast radius assessment
Teams: Posts an Adaptive Card to a Teams webhook
Slack: Posts a Block Kit message to a Slack webhook

# Console approval (interactive)
Invoke-Runbook -RunbookName 'high-cpu' -ComputerName 'DC01' `
    -RequireApproval -ApprovalMethod Console

# Teams webhook approval
Invoke-Runbook -RunbookName 'service-recovery' -ComputerName 'SQL01' `
    -RequireApproval -ApprovalMethod Teams `
    -ApprovalContact 'https://outlook.office.com/webhook/...'

# Skip all approvals (use with caution)
Invoke-Runbook -RunbookName 'disk-space' -ComputerName 'FILE01' -Force

Health Scores

Get-CIHealthScore aggregates multiple data sources into a single 0-100 score:

Factor	Max Penalty	Source
Failed runbook executions	-25	Execution logs
Escalated issues	-15	Execution logs
Recurring failures (low success rate)	-20	Learning system
Recent configuration changes	-10	Infra-ChangeTracker
Open ITSM tickets	-15	ITSM provider
Stopped auto-start services	-20	Live CIM check
Expired certificates	-20	Certificate-LifecycleMonitor
Expiring certificates (within 30 days)	-10	Certificate-LifecycleMonitor

Grading: A = 90-100, B = 80-89, C = 70-79, D = 60-69, F = below 60

Get-CIHealthScore -ComputerName 'DC01' -IncludeRunbookHistory -IncludeChangeHistory -DaysBack 30 `
    -OutputPath 'C:\Reports\dc01-health.html'

Shift Handoff

Get-ShiftHandoff generates a summary of everything that happened during your shift:

Runbook executions: what ran, what succeeded, what failed
Escalations: what was escalated and why
Pending approvals: what is waiting for someone to approve
Correlation alerts: patterns detected across multiple runbooks
Next-shift notes: what the incoming team needs to know

# Generate and email
Get-ShiftHandoff -HoursBack 8 -OutputPath 'C:\Reports\handoff.html' `
    -SendEmail -SmtpServer 'mail.contoso.com' `
    -EmailTo 'nightshift@contoso.com' -EmailFrom 'runbooks@contoso.com'

Learning Loop

Every time a runbook step executes, the engine records whether it succeeded or failed. Over time, this builds a knowledge base:

Success rates per step: "Restarting the Search Indexer resolves high CPU 85% of the time"
Failure patterns: "DNS cache flush fails on Server 2012 R2 boxes"
Temporal patterns: "This runbook fires every Monday morning after the backup window"

The learning data is stored in $env:USERPROFILE\.runbookengine\learnings.json and can be queried through the execution history.

Cross-Runbook Correlation

The engine detects when multiple runbooks fire for related systems within a time window. Known patterns:

Combination	Possible Root Cause
DNS + Replication	Domain Controller failure or network partition
High CPU + Memory Pressure	Resource exhaustion, possible runaway process
Disk Space + Backup Failure	Disk exhaustion causing backup failures
Service Recovery + High CPU	Service crash-restart loop
Certificate Expiry + Service Recovery	Expired certificate causing service failures

Functions Reference

Function	Description
`Invoke-Runbook`	Execute a YAML runbook as a decision tree with approvals and verification
`New-Runbook`	Create a runbook from template or generate via AI from ticket history
`Get-RunbookStatus`	Query execution history with filtering by status, target, and time
`Get-CIHealthScore`	Calculate aggregate health score (0-100) for configuration items
`Get-ShiftHandoff`	Generate shift handoff report with activity, escalations, and notes

Requirements

PowerShell 5.1 or later
Windows operating system (for CIM/WMI and AD cmdlets)
Optional: PowerShell-Yaml module (for advanced YAML parsing; built-in parser handles standard runbooks)
Optional: ActiveDirectory module (for blast radius DC detection)
Optional: AI provider API key (for AI-generated runbooks)
Optional: Portfolio modules for cross-module integration (all integrations degrade gracefully)

Feedback & Contributions

Issues and feature requests: GitHub Issues
Portfolio site: larro1991.github.io
Contributions: Pull requests welcome. Please include Pester tests for new features.

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Private		Private
Public		Public
Templates		Templates
Tests		Tests
Infra-RunbookEngine.psd1		Infra-RunbookEngine.psd1
Infra-RunbookEngine.psm1		Infra-RunbookEngine.psm1
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Infra-RunbookEngine

The Problem

How It Works

Module Connections

The Integration Map

ITSM-Insights --> AI-Generated Runbooks

Infra-ChangeTracker --> Context-Aware Decisions

Infra-LivingDoc --> Expected State Comparison

Certificate-LifecycleMonitor --> Cert Expiry Runbooks

Admin-MorningBrief <-- Runbook Results

AD-SecurityAudit --> Blast Radius Checks

Standalone Usage

Quick Start

Built-in Templates

Creating Custom Runbooks

Step Actions

Blast Radius Levels

AI-Generated Runbooks

Approval Workflows

Health Scores

Shift Handoff

Learning Loop

Cross-Runbook Correlation

Functions Reference

Requirements

Feedback & Contributions

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

larro1991/Infra-RunbookEngine

Folders and files

Latest commit

History

Repository files navigation

Infra-RunbookEngine

The Problem

How It Works

Module Connections

The Integration Map

ITSM-Insights --> AI-Generated Runbooks

Infra-ChangeTracker --> Context-Aware Decisions

Infra-LivingDoc --> Expected State Comparison

Certificate-LifecycleMonitor --> Cert Expiry Runbooks

Admin-MorningBrief <-- Runbook Results

AD-SecurityAudit --> Blast Radius Checks

Standalone Usage

Quick Start

Built-in Templates

Creating Custom Runbooks

Step Actions

Blast Radius Levels

AI-Generated Runbooks

Approval Workflows

Health Scores

Shift Handoff

Learning Loop

Cross-Runbook Correlation

Functions Reference

Requirements

Feedback & Contributions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages