Skip to content

feat(iot-ops): bump AIO component versions and harden schema-registry RBAC#471

Open
bindsi wants to merge 4 commits intomainfrom
feature/aio-2604
Open

feat(iot-ops): bump AIO component versions and harden schema-registry RBAC#471
bindsi wants to merge 4 commits intomainfrom
feature/aio-2604

Conversation

@bindsi
Copy link
Copy Markdown
Member

@bindsi bindsi commented May 5, 2026

feat(iot-ops): bump AIO component versions and harden schema-registry RBAC

IMPORTANT: Before submitting, please remove all sensitive data, secrets, tokens, or confidential information. Ensure you've redacted any NDA-covered information, IP addresses, resource names, or security-related details that shouldn't be publicly disclosed.

Description

Brings the Azure IoT Operations stack to the versions supported by az iot ops 2.4.0, restores first-class blueprint parameters that had been temporarily downgraded to var workarounds, fixes a 403 AuthorizationPermissionMismatch during schema upload, and unblocks the K3s VM bootstrap on azure-cli 2.67+.

Specifically:

  • Bumps cert-manager 0.10.2 → 0.11.0, secret-sync-controller 1.3.0 → 1.4.0, and the iotoperations extension 1.3.38 → 1.3.70 in both Bicep and Terraform defaults.
  • Re-enables trustIssuerSettings, shouldDeployAioDeploymentScripts, shouldEnableOtelCollector, and shouldEnableOpcUaSimulator as real params in blueprints/full-single-node-cluster/bicep/main.bicep (DeploymentScripts now supports az CLI 2.71+).
  • Sets a deterministic OPC UA securityPki.applicationUri (per-cluster URN) on the AIO instance configuration in both Bicep and Terraform.
  • Adds a Storage Blob Data Contributor role assignment scoped to the schemas container in the schema-registry Terraform module, with a configurable blob_data_contributor_principal_id and a 30s RBAC propagation wait. Removes the implicit dependency on the data-lake module's broader role grant.
  • Switches az login --identity --client-id to az login --identity --username in the K3s bootstrap scripts (the --client-id flag was removed in azure-cli 2.67).
  • Adds docs/getting-started/upgrade-aio.md and cross-links it from the general-user getting-started guide.

Related Issue

Fixes #473

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Blueprint modification or addition
  • Component modification or addition
  • Documentation update
  • CI/CD pipeline change
  • Other (please describe):

Note: shouldEnableOtelCollector now defaults to true (previously hard-coded false). Callers that relied on OTel being off must explicitly set this to false.

Implementation Details

Component version bumps

Component File From To
cert-manager src/100-edge/109-arc-extensions/bicep/types.bicep 0.10.2 0.11.0
cert-manager src/100-edge/109-arc-extensions/terraform/variables.tf 0.10.2 0.11.0
secret-sync-controller src/100-edge/110-iot-ops/bicep/types.bicep 1.3.0 1.4.0
secret-sync-controller src/100-edge/110-iot-ops/terraform/variables.init.tf 1.3.0 1.4.0
iotoperations src/100-edge/110-iot-ops/bicep/types.bicep 1.3.38 1.3.70
iotoperations src/100-edge/110-iot-ops/terraform/variables.instance.tf 1.3.38 1.3.70

Blueprint parameter restoration

blueprints/full-single-node-cluster/bicep/main.bicep reverts the temporary var workarounds that were introduced when DeploymentScripts was pinned to az CLI < 2.71. The four affected parameters are now first-class params with default values, and iotOpsTypes is imported to provide the TrustIssuerConfig type.

Schema-registry RBAC

src/000-cloud/030-data/terraform/modules/schema-registry/main.tf adds:

  • data "azurerm_client_config" "current" to resolve the deploying principal.
  • azurerm_role_assignment.schema_container_blob_data_contributor granting Storage Blob Data Contributor on the schemas container, with principal_id = coalesce(var.blob_data_contributor_principal_id, data.azurerm_client_config.current.object_id).
  • time_sleep.wait_for_rbac_propagation (30s) extended to depend on the new role assignment.

variables.core.tf adds the optional blob_data_contributor_principal_id (string, default null).

This removes the implicit reliance on the data-lake module granting Storage Blob Data Owner at the storage-account scope and keeps the schema-registry module self-contained.

MQTT broker securityPki.applicationUri

Both iot-ops-instance modules now emit a per-cluster securityPki.applicationUri of the form urn:microsoft.com:aio:opc:ua:broker:<5-char-cluster-hash>, ensuring the OPC UA broker advertises a unique application URI:

  • Bicep: take(uniqueString(arcConnectedCluster.id), 5)
  • Terraform: substr(sha256(var.arc_connected_cluster_id), 0, 5)

K3s VM bootstrap (az login syntax)

src/100-edge/100-cncf-cluster/scripts/deploy-script-secrets.sh and k3s-device-setup.sh switch from az login --identity --client-id "$CLIENT_ID" to az login --identity --username "$CLIENT_ID" (with --allow-no-subscriptions on the latter). The --client-id flag was removed in azure-cli 2.67 and broke the linux-cluster-server-setup Arc extension on fresh runners.

Documentation

  • docs/getting-started/upgrade-aio.md is new and walks through az iot ops upgrade, Terraform refresh-only reconciliation (terraform apply -refresh-only plus a plan grep), and Bicep stateless reapply, with a troubleshooting section.
  • docs/getting-started/general-user.md cross-links the new guide from "Next Steps" and "Additional Resources". Docusaurus picks the new file up automatically via the autogenerated sidebar.

Testing Performed

  • Terraform plan/apply — full apply against the test subscription succeeded after the RBAC fix (previously failed with 403 on schema upload).
  • Blueprint deployment test
  • Unit tests
  • Integration tests
  • Bug fix includes regression test (see Test Policy)
  • Manual validation — az iot ops upgrade verified against the live cluster against the new component versions; K3s VM bootstrap script confirmed working with azure-cli 2.67+.
  • Other: npm run tf-validate and terraform-docs regenerated cleanly.

Validation Steps

  1. From the repo root, run npm run tf-validate and confirm the 030-data schema-registry module validates.
  2. Confirm regenerated docs are up to date: terraform-docs --config .terraform-docs.yml src/000-cloud/030-data/terraform/modules/schema-registry shows no diff.
  3. Apply blueprints/full-single-node-cluster/terraform against a clean subscription as the deploying principal — schema upload should succeed without manual role grants. Re-run with blob_data_contributor_principal_id set to a service principal id to confirm the override path.
  4. Build the Bicep blueprint: az bicep build -f blueprints/full-single-node-cluster/bicep/main.bicep — confirm the four restored parameters appear and trustIssuerSettings, shouldDeployAioDeploymentScripts, shouldEnableOtelCollector, shouldEnableOpcUaSimulator can be overridden.
  5. Trigger a fresh K3s VM Arc bootstrap and confirm linux-cluster-server-setup succeeds (azure-cli 2.67+ runner).
  6. Follow docs/getting-started/upgrade-aio.md end-to-end against an existing AIO instance to verify the upgrade path.

Checklist

  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed
  • I have run terraform fmt on all Terraform code
  • I have run terraform validate on all Terraform code
  • I have run az bicep format on all Bicep code
  • I have run az bicep build to validate all Bicep code
  • I have checked for any sensitive data/tokens that should not be committed
  • Lint checks pass (run applicable linters for changed file types)

Security Review

  • No credentials, secrets, or tokens are hardcoded or logged
  • RBAC and identity changes follow least-privilege principles — new role is scoped to a single blob container, not the storage account, and is configurable per-deployment.
  • No new network exposure or public endpoints introduced without justification
  • Dependency additions or updates have been reviewed for known vulnerabilities — version bumps follow upstream az iot ops 2.4.0 supported matrix.
  • Container image changes use pinned digests or SHA references — N/A; only AIO extension version strings updated.

Additional Notes

  • Behavioral default change: shouldEnableOtelCollector defaults to true (matching upstream defaults) where it was hard-coded false while DeploymentScripts was pinned. Existing pipelines that don't want OTel must pass shouldEnableOtelCollector: false.
  • Existing deployments upgrading to the bumped versions should follow docs/getting-started/upgrade-aio.md (terraform apply -refresh-only after az iot ops upgrade) to reconcile state without recreating resources.
  • The new blob_data_contributor_principal_id variable is optional; CI deployments running as the deploying user/SP need no changes.

Screenshots (if applicable)

bindsi and others added 2 commits May 5, 2026 09:56
- add trust issuer settings and deployment script parameters
- bump version for cert-manager and secret sync controller
- enhance MQTT broker configurations with new application URI
- update README and variable files for consistency

Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>
…butor role

- add role assignment for blob data contributor on schemas container
- update README and variables to reflect new role and its purpose
- create upgrade guide for Azure IoT Operations

Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>
bindsi and others added 2 commits May 5, 2026 16:17
… README files for consistency

Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

📚 Documentation Health Report

Generated on: 2026-05-05 15:24:51 UTC

📈 Documentation Statistics

Category File Count
Main Documentation 218
Infrastructure Components 196
Blueprints 39
GitHub Resources 43
AI Assistant Guides (Copilot) 17
Total 513

🏗️ Three-Tree Architecture Status

  • ✅ Bicep Documentation Tree: Auto-generated navigation
  • ✅ Terraform Documentation Tree: Auto-generated navigation
  • ✅ README Documentation Tree: Manual README organization

🔍 Quality Metrics

  • Frontmatter Validation:
    success
  • Link Validation: success

This report is automatically generated by the Documentation Automation workflow.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

📚 Documentation Health Report

Generated on: 2026-05-05 15:41:48 UTC

📈 Documentation Statistics

Category File Count
Main Documentation 218
Infrastructure Components 196
Blueprints 39
GitHub Resources 43
AI Assistant Guides (Copilot) 17
Total 513

🏗️ Three-Tree Architecture Status

  • ✅ Bicep Documentation Tree: Auto-generated navigation
  • ✅ Terraform Documentation Tree: Auto-generated navigation
  • ✅ README Documentation Tree: Manual README organization

🔍 Quality Metrics

  • Frontmatter Validation:
    success
  • Link Validation: success

This report is automatically generated by the Documentation Automation workflow.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

📚 Documentation Health Report

Generated on: 2026-05-05 15:44:58 UTC

📈 Documentation Statistics

Category File Count
Main Documentation 218
Infrastructure Components 196
Blueprints 39
GitHub Resources 43
AI Assistant Guides (Copilot) 17
Total 513

🏗️ Three-Tree Architecture Status

  • ✅ Bicep Documentation Tree: Auto-generated navigation
  • ✅ Terraform Documentation Tree: Auto-generated navigation
  • ✅ README Documentation Tree: Manual README organization

🔍 Quality Metrics

  • Frontmatter Validation:
    success
  • Link Validation: success

This report is automatically generated by the Documentation Automation workflow.

Copy link
Copy Markdown
Collaborator

@katriendg katriendg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review — AIO 2604 Release Support

Thanks for the thorough work here — the version bumps, schema-registry RBAC hardening, securityPki.applicationUri, az login --username migration, and the new upgrade doc are all well-implemented and consistent between Terraform and Bicep. Appreciate the additional fixes for the 403 and the azure-cli 2.67+ breakage. 🙌

Below are suggestions to make this fully "throughout" before merge.


1. PR Title — Suggest referencing AIO 2604

The branch is feature/aio-2604 and the description references az iot ops 2.4.0, but the title doesn't convey the AIO 2604 release or the matching version 1.3.70. Suggest something like:

feat(iot-ops): upgrade AIO 2604 release (1.3.70), harden schema-registry RBAC

2. docs/getting-started/upgrade-aio.md — Add version matrix and official supported-versions link

The upgrade guide doesn't correlate the az iot ops CLI version with the AIO release or extension versions. Users need this to know which CLI extension maps to which component versions.

Suggested addition (after "The reconciliation steps differ…"):

## Version matrix

This repository currently targets the **AIO 2604** release. The table below maps `az iot ops` CLI versions to the component versions pinned in edge-ai:

| CLI extension (`azure-iot-ops`) | AIO release | cert-manager | secret-sync-controller | iotOperations |
|---------------------------------|-------------|--------------|------------------------|---------------|
| 2.4.0                           | 2604        | 0.11.0       | 1.4.0                  | 1.3.70        |

For the full upstream compatibility matrix, see [Supported versions — Azure IoT Operations](https://learn.microsoft.com/en-us/azure/iot-operations/deploy-iot-ops/howto-upgrade?tabs=portal#supported-versions).

Also add a References section at the end:

## References

- [Supported versions — Azure IoT Operations](https://learn.microsoft.com/en-us/azure/iot-operations/deploy-iot-ops/howto-upgrade?tabs=portal#supported-versions)
- [Upgrade Azure IoT Operations — Official guide](https://learn.microsoft.com/en-us/azure/iot-operations/deploy-iot-ops/howto-upgrade)

3. docs/getting-started/upgrade-aio.md — Document behavior for pinned edge-ai releases

The doc assumes users always pull the latest main. In practice, teams pin to a tagged release. The reconciliation steps should call out what happens when the user is on a pinned release with older version defaults than what az iot ops upgrade installed.

Suggested callout after Terraform step 3:

Pinned releases: If your team pins to a specific edge-ai release tag rather than main, the version defaults in that release may be older than what az iot ops upgrade installed. In that case, after -refresh-only, terraform plan will show no diff for the AIO extensions (state matches Azure). However, if you later move to a newer edge-ai release with higher version pins, the next apply will attempt to upgrade again. To stay aligned, either upgrade edge-ai to the release that matches the AIO versions you upgraded to, or override the version variables in your terraform.tfvars.

And similarly after Bicep step 2:

Pinned releases: If your team pins to a specific edge-ai release tag, ensure the version parameters passed to the blueprint match or exceed what az iot ops upgrade installed. If they are lower, the next deployment will attempt to downgrade the extensions. Override the version parameters explicitly or upgrade to an edge-ai release that includes the newer defaults.


4. blueprints/full-single-node-cluster/bicep/main.bicep — Stale comments (Lines 157, 170, 176, 180)

Four comments reading // Currently disable setting shouldDeployAioDeploymentScripts, remove when DeploymentScripts supports AZ CLI 2.71+ (post May 4) remain but the code below them is now un-disabled (they are real params). These should be removed — they contradict the current implementation.


5. blueprints/only-edge-iot-ops/bicep/main.bicep + blueprints/full-multi-node-cluster/bicep/main.bicep — Incomplete param restoration

The full-single-node-cluster blueprint correctly converts the temporary var workarounds back to first-class params. However, the same workarounds still exist in these two blueprints:

Blueprint Still uses var workaround
only-edge-iot-ops/bicep/main.bicep trustIssuerSettings, shouldDeployAioDeploymentScripts, shouldEnableOtelCollector, shouldEnableOpcUaSimulator
full-multi-node-cluster/bicep/main.bicep trustIssuerSettings, shouldDeployAioDeploymentScripts, shouldEnableOtelCollector, shouldEnableOpcUaSimulator

Since the "post May 4" condition (DeploymentScripts supports AZ CLI 2.71+) is now met, these should be restored consistently. Without this, only full-single-node-cluster users can override these params while other blueprint consumers remain locked to hard-coded values.

Apply the same pattern: remove the var lines, uncomment/restore as params with matching defaults, and add the iotOpsTypes import where needed.


✅ What Looks Good

  • Version bumps consistent across TF/Bicep: cert-manager 0.11.0, secret-sync 1.4.0, iotoperations 1.3.70
  • securityPki.applicationUri — both produce urn:microsoft.com:aio:opc:ua:broker:<5-char-hash>
  • blob_data_contributor_principal_id — optional with coalesce() fallback, container-scoped (least-privilege)
  • az login --identity --username — complete across all scripts, no --client-id references remain
  • upgrade-aio.md — comprehensive with TF/Bicep reconciliation and troubleshooting
  • No old version references (0.10.2, 1.3.0, 1.3.38) found anywhere in the codebase

⚠️ Minor Awareness Items (non-blocking)

  1. Behavioral default change: shouldEnableOtelCollector now defaults to true. Existing pipelines relying on OTel being off will need shouldEnableOtelCollector: false. PR description documents this.
  2. Hash algorithm difference: Bicep uniqueString() vs Terraform sha256() will produce different 5-char applicationUri suffixes for the same cluster ID. Acceptable for single-framework deployments but worth noting for cross-framework scenarios.

@katriendg
Copy link
Copy Markdown
Collaborator

And just noticed, is it not CLI 2.5 and not 2.4 as documented? https://github.com/Azure/azure-iot-ops-cli-extension/releases/tag/v2.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bump AIO component versions and fix schema-registry RBAC + K3s bootstrap

2 participants