Skip to content

net9876/agentic-cloudops-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic CloudOps Lab

Azure Terraform Python Status Remediation

Agentic CloudOps Lab is an Azure-hosted AI CloudOps control plane that performs safe Azure subscription reviews using live inventory collection, deterministic analysis, policy/runbook grounding, LLM summarization, response evaluation, and artifact generation.

The current implementation uses Microsoft Foundry Agent Service as the AI-facing supervisor and Azure Functions as the trusted deterministic backend.

This is a lab / proof-of-concept project. It is designed for learning, demos, architecture discussions, and controlled testing. It is not a production-ready CloudOps platform.


Why This Project Exists

Cloud teams often have many subscriptions, resource groups, and services with inconsistent tagging, missing diagnostic settings, avoidable cost waste, weak governance, and configuration drift.

Traditional scripts can detect these issues, but they usually produce raw reports that require manual interpretation. General-purpose AI agents can summarize findings, but they should not directly make infrastructure changes without strong controls.

This lab combines both approaches:

  • deterministic CloudOps checks for factual analysis
  • Azure Function as a trusted tool backend
  • Microsoft Foundry Agent as the AI-facing supervisor
  • Azure OpenAI for grounded summaries
  • policy and runbook grounding from Blob Storage
  • evaluator logic before returning the final response
  • safe remediation drafts that always require human approval

The goal is to demonstrate a practical enterprise pattern for agentic CloudOps without giving the agent destructive permissions.


Business / CloudOps Problem

This project demonstrates how an AI-assisted CloudOps review can help with:

  • identifying cost optimization opportunities
  • detecting missing governance controls
  • reviewing basic security configuration gaps
  • checking reliability and operational hygiene
  • producing readable summaries for engineers and managers
  • generating remediation drafts without executing them
  • preserving review artifacts for audit and follow-up

The important design decision is that the agent helps review and explain. It does not automatically delete, deallocate, reconfigure, or remediate Azure resources.


Architecture

Agentic CloudOps Architecture

High-level flow:

User / IDE / Foundry Agent
        |
        v
Microsoft Foundry Agent
        |
        | OpenAPI tool call
        v
Azure Function: /api/tools/run-cloudops-review
        |
        | Managed Identity
        v
Azure Resource Manager REST API
        |
        v
Live Azure inventory
        |
        v
Deterministic CloudOps analyzers
        |
        +--> Cost / governance checks
        +--> Security checks
        +--> Reliability checks
        +--> Tagging checks
        |
        v
Blob-backed policy/runbook retrieval
        |
        v
Azure OpenAI grounded summary
        |
        v
Evaluator Agent safety/quality validation
        |
        v
Blob artifact storage

Safety Model

Safety Flow

This project is intentionally designed as a safe review and planning system.

The agent does not execute remediation.

All generated remediation scripts are drafts. Every action remains human-approved.

Safety rules:

  • No automatic deletion.
  • No automatic deallocation.
  • No automatic networking changes.
  • No automatic Key Vault purge-protection changes.
  • No destructive action without explicit human approval.
  • Deterministic findings are the source of truth.
  • LLM output is evaluated before being returned.
  • Policy/runbook grounding is included when available.
  • Generated remediation remains in PendingApproval.

Current Status

Area Status
Azure Function backend Implemented
ARM inventory collection Implemented
Deterministic analyzers Implemented
Cost / governance / security / reliability findings Implemented
Policy/runbook retrieval Implemented with lightweight keyword retrieval
Azure OpenAI summary Implemented
Evaluator Agent Implemented
Foundry OpenAPI tool endpoint Implemented
Artifact upload to Blob Storage Implemented
Automatic remediation Not implemented by design
Production hardening Not included

The current retrieval method is intentionally lightweight. It uses keyword retrieval from Blob-backed Markdown documents. A future version could replace this with Azure AI Search hybrid/vector retrieval.


What the Solution Does

The solution performs a safe CloudOps review of an Azure subscription.

It can:

  • collect live Azure inventory from Azure Resource Manager using the Azure Function managed identity
  • analyze resources for CloudOps findings across cost governance, security, governance, reliability, and tagging
  • retrieve relevant policy/runbook documents from Blob Storage
  • generate a grounded LLM summary using Azure OpenAI
  • validate the final response with an Evaluator Agent
  • generate safe remediation draft scripts
  • upload inventory, reports, remediation drafts, and agent responses to Blob Storage
  • expose a Microsoft Foundry-compatible OpenAPI tool endpoint
  • keep all remediation actions in PendingApproval
  • avoid executing remediation automatically

Main Components

Microsoft Foundry Agent

The Foundry Agent acts as the AI-facing supervisor.

It is configured with:

  • agent instructions
  • OpenAPI tool definition
  • Function App key connection
  • tool call to run_cloudops_review

The agent decides when to call the CloudOps review tool and presents the result to the user.

Azure Function Control Plane

The Azure Function is the trusted backend.

It performs:

  • managed identity token acquisition
  • Azure Resource Manager REST calls
  • Blob Storage REST calls
  • live inventory collection
  • analyzer orchestration
  • policy/runbook retrieval
  • Azure OpenAI REST call
  • Evaluator Agent validation
  • artifact upload

The deployed Function path uses REST APIs and the Python standard library for Azure ARM, Blob Storage, and Azure OpenAI calls.

Blob Storage

Blob Storage is used for durable artifacts and lightweight knowledge storage.

Containers used by the lab:

inventory
reports
remediation
agent-responses
rules
snapshots
verification

Rules / Policy Knowledge

The rules container stores Markdown runbooks and policies used for grounding.

Example documents:

diagnostic-settings-policy.md
key-vault-protection-policy.md
app-service-security-policy.md
storage-soft-delete-policy.md
storage-lifecycle-policy.md
tagging-policy.md
remediation-safety-policy.md
cost_optimization.md

Evaluator Agent

The Evaluator Agent validates the response before it is returned.

It checks that:

  • finding total is included
  • approval / PendingApproval language is present
  • the answer does not claim remediation was executed
  • the answer does not include unsafe destructive commands
  • retrieved policy/runbook knowledge is referenced when available
  • major finding types are mentioned

A successful evaluation looks like:

{
  "agent": "EvaluatorAgent",
  "version": "phase4b-deterministic-v1",
  "overall_status": "pass",
  "failed_checks": 0,
  "warning_checks": 0
}

Foundry Tool Flow

Tool Call Flow

The Foundry-facing tool endpoint is:

POST /api/tools/run-cloudops-review

It returns a compact response designed for Foundry Agent tool calling.

Response includes:

tool_name
status
remediation_executed
approval_required
inventory
finding_summary
response_evaluation
retrieved_knowledge_sources
answer
artifacts
safety

Example behavior:

Foundry Agent
  -> calls run_cloudops_review
  -> Azure Function collects live inventory
  -> deterministic analyzers run
  -> policy/runbook retrieval runs
  -> Azure OpenAI summary is generated
  -> Evaluator Agent validates result
  -> compact tool response returns to Foundry

Important Endpoints

Base URL:

https://<function-app-name>.azurewebsites.net/api

Endpoints:

GET  /api/health
GET  /api/inventory/latest
POST /api/inventory/collect
POST /api/agent/chat
POST /api/agent/run-live
POST /api/tools/run-cloudops-review
GET  /api/artifacts/list/{container}
GET  /api/diagnostics/imports

The primary endpoint for Foundry integration is:

POST /api/tools/run-cloudops-review

Repository Structure

agentic-cloudops-lab/
  backend/
    analyzers and review logic

  cloud-control-plane/
    function_app.py
    host.json

  foundry/
    agentic-cloudops-foundry-tools.openapi.json
    cloudops_supervisor_agent_instructions.md
    PHASE5_FOUNDRY_SETUP.md

  infra/
    Terraform infrastructure

  rules/
    local source copies of policy/runbook documents

  tools/
    deployment and test scripts

  docs/
    images/
    sample-output/

  reports/
    generated local report examples, if used

If your local structure differs, treat this as the intended clean structure and adjust folder names before publishing.


Quick Start

The commands below show the expected flow. Adjust variable names and script names to match your local implementation.

1. Clone the repository

git clone https://github.com/net9876/agentic-cloudops-lab.git
cd agentic-cloudops-lab

2. Configure deployment variables

cd infra
cp example.tfvars dev.tfvars

Edit dev.tfvars with your Azure subscription, region, naming prefix, and required AI settings.

3. Deploy infrastructure

terraform init
terraform plan -var-file="dev.tfvars"
terraform apply -var-file="dev.tfvars"

4. Deploy the Azure Function

Example PowerShell flow:

cd ..\tools
.\deploy-function.ps1

5. Test the health endpoint

curl https://<function-app-name>.azurewebsites.net/api/health

6. Run a CloudOps review

curl -X POST https://<function-app-name>.azurewebsites.net/api/tools/run-cloudops-review

Sample Output

A compact review response should look similar to this:

{
  "tool_name": "run_cloudops_review",
  "status": "completed",
  "remediation_executed": false,
  "approval_required": true,
  "inventory": {
    "resource_count": 42,
    "subscription_scope": "lab-subscription"
  },
  "finding_summary": {
    "total_findings": 12,
    "security": 4,
    "cost": 3,
    "reliability": 2,
    "tagging": 3
  },
  "response_evaluation": {
    "overall_status": "pass",
    "failed_checks": 0,
    "warning_checks": 0
  },
  "safety": {
    "mode": "draft_only",
    "pending_approval": true
  }
}

See also:

docs/sample-output/cloudops-review-response.json

Security Notes

This lab follows a conservative security model:

  • use managed identity where possible
  • do not store Azure credentials in code
  • do not commit .env, .tfvars, secrets, tokens, Function keys, or local state files
  • generated remediation scripts are drafts only
  • remediation is not executed automatically
  • restrict Function App access for real environments
  • use least-privilege RBAC for managed identities
  • do not test against production subscriptions without review
  • rotate any credentials accidentally exposed during testing

Recommended .gitignore coverage:

.terraform/
*.tfstate
*.tfstate.*
*.tfvars
.env
.env.*
local.settings.json
__pycache__/
.venv/
*.zip

Cost Notes

This lab may create billable Azure resources:

  • Azure Function App
  • Storage Account and Blob transactions
  • Application Insights / Log Analytics, if enabled
  • Azure OpenAI / Foundry usage
  • supporting networking or monitoring resources, depending on your Terraform configuration

Destroy lab resources after testing.


Cleanup

If the lab was deployed with Terraform:

cd infra
terraform destroy -var-file="dev.tfvars"

Before destroying, confirm that the resource group contains only lab resources.

If artifacts were created outside Terraform, remove them manually from:

inventory
reports
remediation
agent-responses
rules
snapshots
verification

Roadmap

Potential improvements:

  • replace keyword retrieval with Azure AI Search hybrid/vector retrieval
  • add GitHub Actions validation for Terraform and Python
  • add sample screenshots and saved review artifacts
  • add a local mock mode for users without Azure OpenAI quota
  • add reusable analyzer plugins
  • add richer cost optimization checks
  • add optional dashboard for review history
  • add end-to-end demo video or GIF
  • add Azure Policy integration
  • add stricter RBAC examples

What Is Intentionally Not Included

This project intentionally does not include:

  • automatic remediation execution
  • production-grade approval workflow
  • production network isolation design
  • enterprise identity lifecycle management
  • multi-tenant SaaS controls
  • full SIEM/SOAR integration
  • guaranteed cost savings
  • support for every Azure resource type

Those areas are possible future extensions, but they are intentionally outside the current lab scope.


Suggested GitHub Repository Metadata

Recommended repository description:

Azure-hosted Agentic CloudOps lab using Foundry Agent, Azure Functions, deterministic analyzers, Azure OpenAI, and safe remediation drafts.

Recommended topics:

azure
cloudops
azure-functions
azure-openai
microsoft-foundry
agentic-ai
devops
terraform
finops
governance
sre
llmops

License

Add or update a LICENSE file before treating this as a reusable public open-source project.

MIT is usually a practical default for this type of lab, but choose the license that fits your intent.

About

AI-powered CloudOps review lab using Microsoft Foundry Agent, Azure Functions, Managed Identity, Azure OpenAI, Blob-backed policy grounding, deterministic analyzers, and evaluator guardrails.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors