Skip to content

Automated runbook execution engine with YAML decision trees, AI-generated runbooks, blast radius checks, approval workflows, and cross-module integration.

License

Notifications You must be signed in to change notification settings

larro1991/Infra-RunbookEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Infra-RunbookEngine

Automated runbook execution engine with YAML decision trees, AI-generated runbooks, blast radius checks, approval workflows, and cross-module integration for Windows infrastructure.

The Problem

Infrastructure teams face the same issues repeatedly -- high CPU, disk space, service crashes, replication failures -- and the remediation steps are well-known but inconsistently applied. Junior admins might skip blast radius checks. Night-shift operators might not know the right order of operations. Knowledge lives in people's heads, not in code.

Manual runbooks in wikis and Word documents go stale. They cannot make decisions, cannot check dependencies, and cannot learn from past outcomes. When an alert fires at 3 AM, you need an engine that can walk through a diagnostic decision tree, check if the target is a Domain Controller before restarting a service, ask for approval when the blast radius is high, verify the fix actually worked, and record what happened for the morning brief.

Infra-RunbookEngine turns static documentation into executable decision trees.

How It Works

Each runbook is a YAML file defining a decision tree. The engine reads the YAML, walks the tree step by step, and at each node decides what to do next based on conditions, integrations, and outcomes.

Alert Fires
    |
    v
[Read YAML Runbook] --> Validate parameters
    |
    v
[Check Maintenance Window] --> Skip if in window
    |
    v
[Step 1: Diagnose] --> Run PowerShell script, capture output
    |
    v
[Step 2: Decision] --> Evaluate condition
   / \
  /   \
True  False
 |      |
 v      v
[Fix] [Escalate]
 |
 v
[Blast Radius Check] --> Is this a DC? SQL Server? How many users?
 |
 v
[Approval Gate] --> Console / Email / Teams / Slack
 |
 v
[Execute Fix] --> Run remediation script
 |
 v
[Verify Fix] --> Wait, then check if the issue is resolved
 |
 v
[Record Learning] --> Track success/failure for next time
 |
 v
[Generate Report] --> HTML dashboard with full execution details

Module Connections

This module is designed to work with the rest of the portfolio. Here is how they connect:

The Integration Map

+-------------------------------------------------------------+
|                    Infra-RunbookEngine                       |
|                   (Decision & Execution)                     |
+-------------------------------------------------------------+
|                                                             |
|  +------------------+    +------------------+              |
|  | ITSM-Insights    |--->| AI-Generated     |              |
|  | (Ticket History) |    | Runbooks         |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | Infra-Change     |--->| Context-Aware    |              |
|  | Tracker          |    | Decisions        |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | Infra-LivingDoc  |--->| Expected State   |              |
|  | (Documentation)  |    | Comparison       |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | Certificate-     |--->| Cert Expiry      |              |
|  | LifecycleMonitor |    | Runbooks         |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | Admin-Morning    |<---| Runbook Results  |              |
|  | Brief            |    | & Alerts         |              |
|  +------------------+    +------------------+              |
|                                                             |
|  +------------------+    +------------------+              |
|  | AD-Security      |--->| Security Context |              |
|  | Audit            |    | for Blast Radius |              |
|  +------------------+    +------------------+              |
|                                                             |
+-------------------------------------------------------------+

ITSM-Insights --> AI-Generated Runbooks

New-Runbook -FromTicketHistory can query any ITSM provider (ServiceNow, Jira, or CSV export) for all tickets related to a configuration item. The ticket data -- summaries, categories, resolutions -- is sent to an AI provider which analyzes patterns and generates a YAML runbook targeting the most common issues. If you have been using ITSM-Insights to analyze ticket trends, the same data source can be fed directly into runbook generation. A server that keeps getting the same three types of tickets gets a runbook that handles all three, with the right decision tree to distinguish between them.

Infra-ChangeTracker --> Context-Aware Decisions

Runbook steps can use action: integration to call Get-ServerConfigChanges from the Infra-ChangeTracker module. This is used in the high-cpu template, for example: before deciding how to remediate high CPU, the runbook checks what changed on the server in the last 24 hours. If a new application was deployed yesterday, the runbook suggests investigating the change rather than blindly restarting services. This turns runbooks from "always do the same thing" into "decide based on context."

Infra-LivingDoc --> Expected State Comparison

Runbooks can query Infra-LivingDoc to compare the current state of a server against its documented expected state. If a service is stopped but the documentation says it should be running, the runbook knows this is an anomaly. If the documentation says the server is decommissioned, the runbook knows not to remediate. This prevents runbooks from fighting against intentional changes.

Certificate-LifecycleMonitor --> Cert Expiry Runbooks

The certificate-expiry template integrates directly with Certificate-LifecycleMonitor's Get-CertificateRenewalStatus function. When a certificate expiry alert triggers the runbook, it checks whether auto-renewal is already configured, identifies which services use the certificate (IIS, RDP, LDAPS), and decides whether to notify or escalate based on how close the expiry date is. The health score calculation also pulls certificate expiry data to factor into the overall CI health grade.

Admin-MorningBrief <-- Runbook Results

Every runbook execution is logged to $env:USERPROFILE\.runbookengine\executions\ as a JSON file. Get-ShiftHandoff collects these execution results -- what ran, what succeeded, what failed, what escalated, what is still pending approval -- and generates a summary that can be consumed by Admin-MorningBrief. The morning brief can show: "3 runbooks ran overnight, 2 completed successfully, 1 was escalated for manual review."

AD-SecurityAudit --> Blast Radius Checks

Before executing any remediation step marked requires_approval: true, the engine runs Test-BlastRadius. This function checks Active Directory (via Get-ADComputer and service principal names) to detect if the target is a Domain Controller, SQL Server, Exchange server, file server, or print server. It also checks connected user sessions. A restart on a file server with 50 connected users gets flagged as High blast radius and requires approval. This security context comes from the same AD infrastructure that AD-SecurityAudit monitors.

Standalone Usage

Every integration is optional. If you do not have Infra-ChangeTracker installed, integration steps that call it will be gracefully skipped. If you do not have Certificate-LifecycleMonitor, the health score simply omits certificate factors. The core engine -- YAML parsing, decision trees, approvals, verification, learning -- works entirely on its own.

Quick Start

# Import the module
Import-Module .\Infra-RunbookEngine

# Preview what a runbook would do (dry run)
Invoke-Runbook -RunbookName 'high-cpu' -ComputerName 'SERVER01' -WhatIf

# Execute a runbook with console approval
Invoke-Runbook -RunbookName 'service-recovery' -ComputerName 'WEB01' `
    -Parameters @{ ServiceName = 'W3SVC' } -RequireApproval -ApprovalMethod Console

# Execute with Force (skip approvals) and generate HTML report
Invoke-Runbook -RunbookName 'disk-space' -ComputerName 'FILE01' `
    -Force -OutputPath 'C:\Reports'

# Create a custom runbook from a template
New-Runbook -Name 'my-cpu-check' -FromTemplate 'high-cpu' -OutputPath 'C:\Runbooks'

# Generate an AI-powered runbook from ticket history
New-Runbook -Name 'webserver-issues' -FromTicketHistory `
    -ITSMProvider ServiceNow -ITSMEndpoint 'https://instance.service-now.com' `
    -ITSMCredential $cred -CIName 'WEB01' -Provider OpenAI -ApiKey $apiKey

# Check health score
Get-CIHealthScore -ComputerName 'DC01','DC02' -IncludeRunbookHistory

# Generate shift handoff
Get-ShiftHandoff -HoursBack 8 -OutputPath 'C:\Reports\handoff.html'

# Check execution history
Get-RunbookStatus -Last 10
Get-RunbookStatus -ComputerName 'SERVER01' -Status Failed

Built-in Templates

Template File Description
High CPU high-cpu.yml Diagnose top CPU consumers, check for known offenders, query recent changes, restart or escalate
Disk Space disk-space.yml Check usage, find cleanable paths, remove temp/log files, identify large files, escalate if insufficient
Service Recovery service-recovery.yml Check service state, verify dependencies, attempt restart, check event logs for crash reasons
Certificate Expiry certificate-expiry.yml Scan for expiring certs, identify service bindings, check auto-renewal, escalate expired certs
DNS Resolution dns-resolution.yml Check DNS service, test resolution, check forwarders, flush cache, retry and escalate
Replication Failure replication-failure.yml Check repadmin status, identify failing partner, test network, verify time sync, force replication
Backup Failure backup-failure.yml Check backup service, verify disk space on target, test network, review logs, attempt re-run
Memory Pressure memory-pressure.yml Assess memory usage, detect memory leaks, check page file, restart safe services, escalate

Creating Custom Runbooks

Runbooks are YAML files with this structure:

name: My Custom Runbook
version: "1.0"
description: What this runbook does
trigger:
  metric: some_metric
  threshold: 90
  duration_minutes: 15

parameters:
  - name: ComputerName
    required: true
  - name: CustomParam
    default: some_value

steps:
  - id: diagnose
    action: script
    description: Run a diagnostic script
    script: |
      Get-Process -ComputerName $ComputerName | Select-Object -First 5
    outputs:
      - process_list

  - id: evaluate
    action: decision
    description: Check the results
    condition: "$process_list.Count -gt 0"
    if_true: fix_it
    if_false: escalate

  - id: fix_it
    action: script
    description: Apply the fix
    requires_approval: true
    blast_radius: single_service
    script: |
      Restart-Service -Name MyService -ComputerName $ComputerName -Force
    verify:
      wait_seconds: 30
      check: |
        (Get-Service -Name MyService -ComputerName $ComputerName).Status -eq 'Running'

  - id: escalate
    action: escalate
    priority: high
    description: Cannot resolve automatically
    message: "Issue on $ComputerName requires manual intervention"

Step Actions

Action Purpose Key Fields
script Run PowerShell code script, outputs, verify, requires_approval, blast_radius
decision Branch the decision tree condition, if_true, if_false
integration Call another module module, function, parameters, outputs
notify Send a notification message, include_data
escalate Escalate to human priority, message

Blast Radius Levels

Level Meaning Approval
single_service Affects one service on one server Optional
single_server Affects the entire server Recommended
multi_server Affects multiple servers Required
domain_wide Affects the entire domain Required

AI-Generated Runbooks

The New-Runbook -FromTicketHistory flow:

  1. Queries your ITSM for all tickets related to a CI
  2. Sends ticket summaries to an AI provider (Anthropic, OpenAI, Ollama, or Custom)
  3. The AI analyzes patterns: "80% of tickets for this server are disk space, 15% are service crashes, 5% are performance"
  4. Generates a YAML decision tree that handles the top issues
  5. Saves the runbook for review and use

Supported AI providers:

  • Anthropic (Claude) -- Best for complex decision trees
  • OpenAI (GPT-4o) -- Good general purpose
  • Ollama -- Local/private, no data leaves your network
  • Custom -- Any OpenAI-compatible endpoint

Always review AI-generated runbooks before production use.

Approval Workflows

Steps marked requires_approval: true will pause execution and request approval:

  • Console: Interactive prompt at the terminal
  • Email: Sends an email with runbook details and blast radius assessment
  • Teams: Posts an Adaptive Card to a Teams webhook
  • Slack: Posts a Block Kit message to a Slack webhook
# Console approval (interactive)
Invoke-Runbook -RunbookName 'high-cpu' -ComputerName 'DC01' `
    -RequireApproval -ApprovalMethod Console

# Teams webhook approval
Invoke-Runbook -RunbookName 'service-recovery' -ComputerName 'SQL01' `
    -RequireApproval -ApprovalMethod Teams `
    -ApprovalContact 'https://outlook.office.com/webhook/...'

# Skip all approvals (use with caution)
Invoke-Runbook -RunbookName 'disk-space' -ComputerName 'FILE01' -Force

Health Scores

Get-CIHealthScore aggregates multiple data sources into a single 0-100 score:

Factor Max Penalty Source
Failed runbook executions -25 Execution logs
Escalated issues -15 Execution logs
Recurring failures (low success rate) -20 Learning system
Recent configuration changes -10 Infra-ChangeTracker
Open ITSM tickets -15 ITSM provider
Stopped auto-start services -20 Live CIM check
Expired certificates -20 Certificate-LifecycleMonitor
Expiring certificates (within 30 days) -10 Certificate-LifecycleMonitor

Grading: A = 90-100, B = 80-89, C = 70-79, D = 60-69, F = below 60

Get-CIHealthScore -ComputerName 'DC01' -IncludeRunbookHistory -IncludeChangeHistory -DaysBack 30 `
    -OutputPath 'C:\Reports\dc01-health.html'

Shift Handoff

Get-ShiftHandoff generates a summary of everything that happened during your shift:

  • Runbook executions: what ran, what succeeded, what failed
  • Escalations: what was escalated and why
  • Pending approvals: what is waiting for someone to approve
  • Correlation alerts: patterns detected across multiple runbooks
  • Next-shift notes: what the incoming team needs to know
# Generate and email
Get-ShiftHandoff -HoursBack 8 -OutputPath 'C:\Reports\handoff.html' `
    -SendEmail -SmtpServer 'mail.contoso.com' `
    -EmailTo 'nightshift@contoso.com' -EmailFrom 'runbooks@contoso.com'

Learning Loop

Every time a runbook step executes, the engine records whether it succeeded or failed. Over time, this builds a knowledge base:

  • Success rates per step: "Restarting the Search Indexer resolves high CPU 85% of the time"
  • Failure patterns: "DNS cache flush fails on Server 2012 R2 boxes"
  • Temporal patterns: "This runbook fires every Monday morning after the backup window"

The learning data is stored in $env:USERPROFILE\.runbookengine\learnings.json and can be queried through the execution history.

Cross-Runbook Correlation

The engine detects when multiple runbooks fire for related systems within a time window. Known patterns:

Combination Possible Root Cause
DNS + Replication Domain Controller failure or network partition
High CPU + Memory Pressure Resource exhaustion, possible runaway process
Disk Space + Backup Failure Disk exhaustion causing backup failures
Service Recovery + High CPU Service crash-restart loop
Certificate Expiry + Service Recovery Expired certificate causing service failures

Functions Reference

Function Description
Invoke-Runbook Execute a YAML runbook as a decision tree with approvals and verification
New-Runbook Create a runbook from template or generate via AI from ticket history
Get-RunbookStatus Query execution history with filtering by status, target, and time
Get-CIHealthScore Calculate aggregate health score (0-100) for configuration items
Get-ShiftHandoff Generate shift handoff report with activity, escalations, and notes

Requirements

  • PowerShell 5.1 or later
  • Windows operating system (for CIM/WMI and AD cmdlets)
  • Optional: PowerShell-Yaml module (for advanced YAML parsing; built-in parser handles standard runbooks)
  • Optional: ActiveDirectory module (for blast radius DC detection)
  • Optional: AI provider API key (for AI-generated runbooks)
  • Optional: Portfolio modules for cross-module integration (all integrations degrade gracefully)

Feedback & Contributions

  • Issues and feature requests: GitHub Issues
  • Portfolio site: larro1991.github.io
  • Contributions: Pull requests welcome. Please include Pester tests for new features.

License

MIT License - see LICENSE for details.

Copyright (c) 2025 Larry Roberts

About

Automated runbook execution engine with YAML decision trees, AI-generated runbooks, blast radius checks, approval workflows, and cross-module integration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •