Agentic CloudOps Lab

Agentic CloudOps Lab is an Azure-hosted AI CloudOps control plane that performs safe Azure subscription reviews using live inventory collection, deterministic analysis, policy/runbook grounding, LLM summarization, response evaluation, and artifact generation.

The current implementation uses Microsoft Foundry Agent Service as the AI-facing supervisor and Azure Functions as the trusted deterministic backend.

This is a lab / proof-of-concept project. It is designed for learning, demos, architecture discussions, and controlled testing. It is not a production-ready CloudOps platform.

Why This Project Exists

Cloud teams often have many subscriptions, resource groups, and services with inconsistent tagging, missing diagnostic settings, avoidable cost waste, weak governance, and configuration drift.

Traditional scripts can detect these issues, but they usually produce raw reports that require manual interpretation. General-purpose AI agents can summarize findings, but they should not directly make infrastructure changes without strong controls.

This lab combines both approaches:

deterministic CloudOps checks for factual analysis
Azure Function as a trusted tool backend
Microsoft Foundry Agent as the AI-facing supervisor
Azure OpenAI for grounded summaries
policy and runbook grounding from Blob Storage
evaluator logic before returning the final response
safe remediation drafts that always require human approval

The goal is to demonstrate a practical enterprise pattern for agentic CloudOps without giving the agent destructive permissions.

Business / CloudOps Problem

This project demonstrates how an AI-assisted CloudOps review can help with:

identifying cost optimization opportunities
detecting missing governance controls
reviewing basic security configuration gaps
checking reliability and operational hygiene
producing readable summaries for engineers and managers
generating remediation drafts without executing them
preserving review artifacts for audit and follow-up

The important design decision is that the agent helps review and explain. It does not automatically delete, deallocate, reconfigure, or remediate Azure resources.

Architecture

High-level flow:

User / IDE / Foundry Agent
        |
        v
Microsoft Foundry Agent
        |
        | OpenAPI tool call
        v
Azure Function: /api/tools/run-cloudops-review
        |
        | Managed Identity
        v
Azure Resource Manager REST API
        |
        v
Live Azure inventory
        |
        v
Deterministic CloudOps analyzers
        |
        +--> Cost / governance checks
        +--> Security checks
        +--> Reliability checks
        +--> Tagging checks
        |
        v
Blob-backed policy/runbook retrieval
        |
        v
Azure OpenAI grounded summary
        |
        v
Evaluator Agent safety/quality validation
        |
        v
Blob artifact storage

Safety Model

This project is intentionally designed as a safe review and planning system.

The agent does not execute remediation.

All generated remediation scripts are drafts. Every action remains human-approved.

Safety rules:

No automatic deletion.
No automatic deallocation.
No automatic networking changes.
No automatic Key Vault purge-protection changes.
No destructive action without explicit human approval.
Deterministic findings are the source of truth.
LLM output is evaluated before being returned.
Policy/runbook grounding is included when available.
Generated remediation remains in PendingApproval.

Current Status

Area	Status
Azure Function backend	Implemented
ARM inventory collection	Implemented
Deterministic analyzers	Implemented
Cost / governance / security / reliability findings	Implemented
Policy/runbook retrieval	Implemented with lightweight keyword retrieval
Azure OpenAI summary	Implemented
Evaluator Agent	Implemented
Foundry OpenAPI tool endpoint	Implemented
Artifact upload to Blob Storage	Implemented
Automatic remediation	Not implemented by design
Production hardening	Not included

The current retrieval method is intentionally lightweight. It uses keyword retrieval from Blob-backed Markdown documents. A future version could replace this with Azure AI Search hybrid/vector retrieval.

What the Solution Does

The solution performs a safe CloudOps review of an Azure subscription.

It can:

collect live Azure inventory from Azure Resource Manager using the Azure Function managed identity
analyze resources for CloudOps findings across cost governance, security, governance, reliability, and tagging
retrieve relevant policy/runbook documents from Blob Storage
generate a grounded LLM summary using Azure OpenAI
validate the final response with an Evaluator Agent
generate safe remediation draft scripts
upload inventory, reports, remediation drafts, and agent responses to Blob Storage
expose a Microsoft Foundry-compatible OpenAPI tool endpoint
keep all remediation actions in PendingApproval
avoid executing remediation automatically

Main Components

Microsoft Foundry Agent

The Foundry Agent acts as the AI-facing supervisor.

It is configured with:

agent instructions
OpenAPI tool definition
Function App key connection
tool call to run_cloudops_review

The agent decides when to call the CloudOps review tool and presents the result to the user.

Azure Function Control Plane

The Azure Function is the trusted backend.

It performs:

managed identity token acquisition
Azure Resource Manager REST calls
Blob Storage REST calls
live inventory collection
analyzer orchestration
policy/runbook retrieval
Azure OpenAI REST call
Evaluator Agent validation
artifact upload

The deployed Function path uses REST APIs and the Python standard library for Azure ARM, Blob Storage, and Azure OpenAI calls.

Blob Storage

Blob Storage is used for durable artifacts and lightweight knowledge storage.

Containers used by the lab:

inventory
reports
remediation
agent-responses
rules
snapshots
verification

Rules / Policy Knowledge

The rules container stores Markdown runbooks and policies used for grounding.

Example documents:

diagnostic-settings-policy.md
key-vault-protection-policy.md
app-service-security-policy.md
storage-soft-delete-policy.md
storage-lifecycle-policy.md
tagging-policy.md
remediation-safety-policy.md
cost_optimization.md

Evaluator Agent

The Evaluator Agent validates the response before it is returned.

It checks that:

finding total is included
approval / PendingApproval language is present
the answer does not claim remediation was executed
the answer does not include unsafe destructive commands
retrieved policy/runbook knowledge is referenced when available
major finding types are mentioned

A successful evaluation looks like:

{
  "agent": "EvaluatorAgent",
  "version": "phase4b-deterministic-v1",
  "overall_status": "pass",
  "failed_checks": 0,
  "warning_checks": 0
}

Foundry Tool Flow

The Foundry-facing tool endpoint is:

POST /api/tools/run-cloudops-review

It returns a compact response designed for Foundry Agent tool calling.

Response includes:

tool_name
status
remediation_executed
approval_required
inventory
finding_summary
response_evaluation
retrieved_knowledge_sources
answer
artifacts
safety

Example behavior:

Foundry Agent
  -> calls run_cloudops_review
  -> Azure Function collects live inventory
  -> deterministic analyzers run
  -> policy/runbook retrieval runs
  -> Azure OpenAI summary is generated
  -> Evaluator Agent validates result
  -> compact tool response returns to Foundry

Important Endpoints

Base URL:

https://<function-app-name>.azurewebsites.net/api

Endpoints:

GET  /api/health
GET  /api/inventory/latest
POST /api/inventory/collect
POST /api/agent/chat
POST /api/agent/run-live
POST /api/tools/run-cloudops-review
GET  /api/artifacts/list/{container}
GET  /api/diagnostics/imports

The primary endpoint for Foundry integration is:

POST /api/tools/run-cloudops-review

Repository Structure

agentic-cloudops-lab/
  backend/
    analyzers and review logic

  cloud-control-plane/
    function_app.py
    host.json

  foundry/
    agentic-cloudops-foundry-tools.openapi.json
    cloudops_supervisor_agent_instructions.md
    PHASE5_FOUNDRY_SETUP.md

  infra/
    Terraform infrastructure

  rules/
    local source copies of policy/runbook documents

  tools/
    deployment and test scripts

  docs/
    images/
    sample-output/

  reports/
    generated local report examples, if used

If your local structure differs, treat this as the intended clean structure and adjust folder names before publishing.

Quick Start

The commands below show the expected flow. Adjust variable names and script names to match your local implementation.

1. Clone the repository

git clone https://github.com/net9876/agentic-cloudops-lab.git
cd agentic-cloudops-lab

2. Configure deployment variables

cd infra
cp example.tfvars dev.tfvars

Edit dev.tfvars with your Azure subscription, region, naming prefix, and required AI settings.

3. Deploy infrastructure

terraform init
terraform plan -var-file="dev.tfvars"
terraform apply -var-file="dev.tfvars"

4. Deploy the Azure Function

Example PowerShell flow:

cd ..\tools
.\deploy-function.ps1

5. Test the health endpoint

curl https://<function-app-name>.azurewebsites.net/api/health

6. Run a CloudOps review

curl -X POST https://<function-app-name>.azurewebsites.net/api/tools/run-cloudops-review

Sample Output

A compact review response should look similar to this:

{
  "tool_name": "run_cloudops_review",
  "status": "completed",
  "remediation_executed": false,
  "approval_required": true,
  "inventory": {
    "resource_count": 42,
    "subscription_scope": "lab-subscription"
  },
  "finding_summary": {
    "total_findings": 12,
    "security": 4,
    "cost": 3,
    "reliability": 2,
    "tagging": 3
  },
  "response_evaluation": {
    "overall_status": "pass",
    "failed_checks": 0,
    "warning_checks": 0
  },
  "safety": {
    "mode": "draft_only",
    "pending_approval": true
  }
}

Security Notes

This lab follows a conservative security model:

use managed identity where possible
do not store Azure credentials in code
do not commit .env, .tfvars, secrets, tokens, Function keys, or local state files
generated remediation scripts are drafts only
remediation is not executed automatically
restrict Function App access for real environments
use least-privilege RBAC for managed identities
do not test against production subscriptions without review
rotate any credentials accidentally exposed during testing

Recommended .gitignore coverage:

.terraform/
*.tfstate
*.tfstate.*
*.tfvars
.env
.env.*
local.settings.json
__pycache__/
.venv/
*.zip

Cost Notes

This lab may create billable Azure resources:

Azure Function App
Storage Account and Blob transactions
Application Insights / Log Analytics, if enabled
Azure OpenAI / Foundry usage
supporting networking or monitoring resources, depending on your Terraform configuration

Destroy lab resources after testing.

Cleanup

If the lab was deployed with Terraform:

cd infra
terraform destroy -var-file="dev.tfvars"

Before destroying, confirm that the resource group contains only lab resources.

If artifacts were created outside Terraform, remove them manually from:

inventory
reports
remediation
agent-responses
rules
snapshots
verification

Roadmap

Potential improvements:

replace keyword retrieval with Azure AI Search hybrid/vector retrieval
add GitHub Actions validation for Terraform and Python
add sample screenshots and saved review artifacts
add a local mock mode for users without Azure OpenAI quota
add reusable analyzer plugins
add richer cost optimization checks
add optional dashboard for review history
add end-to-end demo video or GIF
add Azure Policy integration
add stricter RBAC examples

What Is Intentionally Not Included

This project intentionally does not include:

automatic remediation execution
production-grade approval workflow
production network isolation design
enterprise identity lifecycle management
multi-tenant SaaS controls
full SIEM/SOAR integration
guaranteed cost savings
support for every Azure resource type

Those areas are possible future extensions, but they are intentionally outside the current lab scope.

Suggested GitHub Repository Metadata

Recommended repository description:

Azure-hosted Agentic CloudOps lab using Foundry Agent, Azure Functions, deterministic analyzers, Azure OpenAI, and safe remediation drafts.

License

Add or update a LICENSE file before treating this as a reusable public open-source project.

MIT is usually a practical default for this type of lab, but choose the license that fits your intent.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
backend		backend
cloud-control-plane		cloud-control-plane
data		data
docs		docs
foundry		foundry
infra-test		infra-test
infra		infra
reports		reports
rules		rules
tools		tools
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Agentic CloudOps Lab

Why This Project Exists

Business / CloudOps Problem

Architecture

Safety Model

Current Status

What the Solution Does

Main Components

Microsoft Foundry Agent

Azure Function Control Plane

Blob Storage

Rules / Policy Knowledge

Evaluator Agent

Foundry Tool Flow

Important Endpoints

Repository Structure

Quick Start

1. Clone the repository

2. Configure deployment variables

3. Deploy infrastructure

4. Deploy the Azure Function

5. Test the health endpoint

6. Run a CloudOps review

Sample Output

Security Notes

Cost Notes

Cleanup

Roadmap

What Is Intentionally Not Included

Suggested GitHub Repository Metadata

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages