Skip to content

Fix azurebackup protect/vault-create: MSI default, terminal status, VM timeout#2470

Merged
shrja-ms merged 8 commits into
microsoft:mainfrom
shrja-ms:user/azurebackup-protect-bugfixes
Apr 28, 2026
Merged

Fix azurebackup protect/vault-create: MSI default, terminal status, VM timeout#2470
shrja-ms merged 8 commits into
microsoft:mainfrom
shrja-ms:user/azurebackup-protect-bugfixes

Conversation

@shrja-ms
Copy link
Copy Markdown
Contributor

@shrja-ms shrja-ms commented Apr 23, 2026

Summary

Three independent bug fixes for the azurebackup tool, surfaced and validated through end-to-end manual testing against a live Azure subscription across every supported workload (IaaS VM, SQL-in-VM, Azure Files, PostgreSQL Flex, Azure Disk, Blob, ADLS, ElasticSAN, AKS).

Bugs fixed

Bug #1 — DPP vault created without Managed Identity → every subsequent protect call fails

azurebackup vault create --vault-type dpp used to create Data Protection backup vaults with no Managed Identity. Every later azurebackup protecteditem protect against that vault then failed server-side with VaultMSIUnauthorized, because the backend needs the vault's MSI to read the datasource. This diverged from both the Portal and az dataprotection backup-vault create (both of which default MSI on).

Fix: new DPP vaults are now created with a System-Assigned Managed Identity by default. RSV path is unaffected. Users who need to add MSI to an existing DPP vault can run azurebackup vault update --identity-type SystemAssigned.

Bug #2protect returned a synthetic "Accepted" and an unusable jobId

azurebackup protecteditem protect returned status: "Accepted" with a base64-encoded operation GUID as jobId. Callers could not resolve that id via azurebackup job get, so the protect outcome was opaque — callers could not tell whether protection actually succeeded, failed, or was still running.

Fix: surface the real terminal outcome:

  • RSV path: after submitting the protect request, poll the underlying ConfigureBackup backup job via BackupJobResource.GetAsync (every 10 s, budget ~12 min) and return its real terminal status (Completed, CompletedWithWarnings, Failed, Cancelled, or InProgress if the poll budget is exceeded), the real jobId, and any error message extracted from the job's ErrorDetails (AOT-safe typed switch across IaasVmBackupJob, IaasVmBackupJobV2, StorageBackupJob, WorkloadBackupJob).
  • DPP path: CreateOrUpdateAsync(WaitUntil.Completed, …), then GET the BackupInstance and return the actual protectionStatus (e.g. ProtectionConfigured). DPP responses no longer carry a misleading jobId — DPP protection is not a backup job; for on-demand backup / restore monitoring, azurebackup job get / list remain the right commands.
  • ProtectResult record extended with optional ProtectionStatus and ErrorMessage.

Bug #3 — IaaS VM protect times out on freshly created VMs

The RSV VM branch used to call RefreshProtectionContainerAsync and then loop up to 36×5 s (= 180 s) waiting for the VM's protection container to appear in RSV before submitting the protect request. Freshly created VMs failed this pre-check and the command surfaced a timeout, even though az backup protection enable-for-vm works fine on the same VM.

Fix: removed the pre-discovery loop. The Recovery Services backend registers the container automatically as part of accepting the protect request — the same behaviour az backup protection enable-for-vm relies on.

Code changes (9 files, +543 / -69)

File Change
tools/.../src/Services/DppBackupOperations.cs MSI-on-create; WaitUntil.Completed + read-back of BackupInstance; ProtectResult carries ProtectionStatus / ErrorMessage; removed ExtractJobIdFromOperation
tools/.../src/Services/RsvBackupOperations.cs Removed VM pre-discovery loop; new BuildRsvProtectResultAsync / WaitForJobAsync / ExtractJobErrorMessage / FirstNonEmpty helpers (AOT-safe typed switch)
tools/.../src/Models/ProtectResult.cs Record extended with ProtectionStatus and ErrorMessage
tools/.../src/Commands/Vault/VaultCreateCommand.cs Description mentions DPP MSI default
tools/.../src/Commands/ProtectedItem/ProtectedItemProtectCommand.cs Description explains terminal-status semantics for RSV and DPP
tools/.../tests/...UnitTests/ProtectedItem/ProtectedItemProtectCommandTests.cs 5 new tests for terminal-status surfacing
tools/.../tests/...LiveTests/AzureBackupCommandTests.cs Extended VaultCreate_CreatesDppVault_Successfully to assert identityType=SystemAssigned; new ProtectedItemProtect_DppVault_DiskProtection_Succeeds_E2E
tools/.../tests/test-resources.bicep ${baseName}-disk + Disk Backup Reader / Disk Snapshot Contributor role assignments for the DPP vault MSI
servers/Azure.Mcp.Server/CHANGELOG.md, docs/azmcp-commands.md Release notes + command doc updates

Test coverage

  • Unit: 334/334 pass in Azure.Mcp.Tools.AzureBackup.UnitTests (+5 new tests covering DPP Succeeded / DPP Failed with error / RSV terminal job / RSV failed with error / RSV polling-budget expiry).
  • Live: extended VaultCreate_CreatesDppVault_Successfully to assert MSI presence; new ProtectedItemProtect_DppVault_DiskProtection_Succeeds_E2E runs the full flow through MCP (create disk policy, protect disk, assert protectionStatus on success or errorMessage on failure, assert no jobId for DPP).
  • Build: full solution dotnet build Microsoft.Mcp.slnx — 0 warnings, 0 errors.
  • Format: dotnet format --verify-no-changes clean.
  • Spell: Invoke-Cspell.ps1 clean on all newly-added lines (pre-existing unrelated warnings on main remain untouched).

Manual E2E verification (subscription f0c630e0-…, vault azbackup-mcp-bv-20260423)

Workload Path Result Notes
DPP vault create new vault identityType=SystemAssigned, principalId populated Bug #1
IaaS VM (fresh VM) RSV Completed in ~95 s Bug #3 — no 180 s pre-poll
SQL-in-VM (master) RSV Completed in 112 s Full E2E: policy via MCP → az backup container register (workload prerequisite) → protectableitem list via MCP → protect via MCP
Azure Files (AFS) RSV Failed surfaced with real jobId + error message Bug #2 — status truth
Azure Disk (fresh) DPP SucceededProtectionConfigured Bug #1 + Bug #2
PostgreSQL Flex (existing) DPP ProtectionConfigured (no regression) Re-verified
Blob (existing) DPP Failed surfaced correctly (UserErrorDppDatasourceAlreadyProtected) Bug #2 — errors now readable
ADLS DPP Failed + UserErrorUnsupportedStorageAccountType (HNS) Service-side; fix correctly reports the truth
ElasticSAN DPP Failed + PluginUnknownError Service-side; fix correctly reports the truth

Invoking Livetests

Copilot submitted PRs are not trustworthy by default. Users with write access to the repo need to validate the contents of this PR before leaving a comment with the text /azp run mcp - pullrequest - live. This will trigger the necessary livetest workflows to complete required validation.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes several end-to-end issues in the azurebackup tool by (1) ensuring new DPP vaults are usable immediately (MSI enabled), (2) returning real/terminal outcomes from protecteditem protect instead of a synthetic “Accepted”, and (3) removing a VM container pre-discovery loop that caused timeouts on newly created VMs.

Changes:

  • Enable System-Assigned Managed Identity by default when creating DPP backup vaults.
  • Improve protecteditem protect outcome reporting: RSV polls ConfigureBackup to terminal status; DPP waits for completion and reads back the backup instance protection status.
  • Extend unit/live tests and test resources to validate the new behavior; update docs/changelog accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/Azure.Mcp.Tools.AzureBackup/tests/test-resources.bicep Adds a managed disk + RBAC assignments needed to E2E test DPP disk protection via vault MSI.
tools/Azure.Mcp.Tools.AzureBackup/tests/Azure.Mcp.Tools.AzureBackup.UnitTests/ProtectedItem/ProtectedItemProtectCommandTests.cs Adds unit tests asserting DPP omits jobId and surfaces protectionStatus / errors; RSV surfaces terminal job status.
tools/Azure.Mcp.Tools.AzureBackup/tests/Azure.Mcp.Tools.AzureBackup.LiveTests/AzureBackupCommandTests.cs Adds live assertions for DPP vault MSI default and a new DPP disk protect E2E test.
tools/Azure.Mcp.Tools.AzureBackup/src/Services/RsvBackupOperations.cs Removes VM container pre-discovery loop; adds polling helpers to return terminal RSV job outcome + error details.
tools/Azure.Mcp.Tools.AzureBackup/src/Services/DppBackupOperations.cs Defaults DPP vault identity to SystemAssigned; switches protect to WaitUntil.Completed + read-back of backup instance protection status.
tools/Azure.Mcp.Tools.AzureBackup/src/Models/ProtectResult.cs Extends protect response model to include optional ProtectionStatus and ErrorMessage.
tools/Azure.Mcp.Tools.AzureBackup/src/Commands/Vault/VaultCreateCommand.cs Updates command description to document DPP MSI default behavior.
servers/Azure.Mcp.Server/docs/azmcp-commands.md Updates command reference text for vault create and protecteditem protect semantics.
servers/Azure.Mcp.Server/CHANGELOG.md Adds bug-fix release notes for the MSI default, terminal status surfacing, and VM timeout fix.

Comment thread tools/Azure.Mcp.Tools.AzureBackup/src/Services/RsvBackupOperations.cs Outdated
Three independent bug fixes for azurebackup tooling, surfaced via
end-to-end testing against a live subscription.

Bug microsoft#1 — `azurebackup vault create --vault-type dpp` created Data
Protection backup vaults without a Managed Identity. Every subsequent
`protecteditem protect` against the new vault then failed server-side
with `VaultMSIUnauthorized`. Fixed by defaulting new DPP vaults to a
System-Assigned Managed Identity, matching Portal and CLI behaviour.

Bug microsoft#2 — `azurebackup protecteditem protect` returned a synthetic
"Accepted" response with a base64-encoded job ID that callers could
not resolve. Fixed to surface the true terminal outcome:
 - RSV: poll the underlying ConfigureBackup job (max 12 min) and
   return its real status (`Completed`, `CompletedWithWarnings`,
   `Failed`, `Cancelled`, or `InProgress` if polling times out) with
   the real job ID and error details (AOT-safe typed switch).
 - DPP: wait for the LRO to complete, read back the BackupInstance,
   and return the actual `protectionStatus`. DPP responses no longer
   carry a misleading `jobId` — DPP protection is not a backup job.
 - `ProtectResult` record extended with optional `ProtectionStatus`
   and `ErrorMessage` fields.

Bug microsoft#3 — `azurebackup protecteditem protect` for IaaS VM performed a
pre-emptive 180 s container-discovery poll that timed out on freshly
created VMs. Removed — the Recovery Services backend registers the
container as part of accepting the protect request, matching
`az backup protection enable-for-vm`.

Tests:
 - Unit: 334/334 pass; +5 new tests for terminal-status surfacing
   (DPP Succeeded/Failed, RSV Completed/Failed/InProgress).
 - Live: extended `VaultCreate_CreatesDppVault_Successfully` to assert
   System-Assigned MSI; added
   `ProtectedItemProtect_DppVault_DiskProtection_Succeeds_E2E` which
   creates a disk policy, protects a disk via MCP, and asserts
   `protectionStatus` on success or `errorMessage` on failure with no
   `jobId` for DPP.
 - Live infra: added `${baseName}-disk` (4 GiB Standard_LRS) and the
   Disk Backup Reader / Disk Snapshot Contributor role assignments
   needed by the DPP vault MSI.

Manual E2E verification (live sub `f0c630e0-…`):
 - DPP vault create: identity = SystemAssigned, principalId populated.
 - RSV VM (fresh) protect: Completed in ~95 s (no 180 s pre-poll).
 - RSV SQL-in-VM (master) protect: Completed in 112 s.
 - RSV AFS protect: Failed surfaced with real job id + error message.
 - DPP Disk (fresh) protect: Succeeded → ProtectionConfigured.
 - DPP ADLS/ESAN: Failed surfaced with service-side error message
   (HNS unsupported, plugin error) — fix correctly reports the truth.

Build: full solution 0 warnings, 0 errors.
Spell: clean for all newly-added lines (pre-existing unrelated
warnings on main remain untouched).
@shrja-ms shrja-ms force-pushed the user/azurebackup-protect-bugfixes branch from 94726dd to 043d001 Compare April 23, 2026 09:31
- Fix VaultCreate_CreatesDppVault assertion: vault_get returns 'vaults' array, not a singular 'vault' property.

- Add GeneralRegexSanitizers for @microsoft.com UPNs (createdBy/lastModifiedBy in x-ms-arm-resource-system-data) and the recording tenant id (x-ms-operation-identifier and async cert URLs).

- Re-record VaultCreate_CreatesDppVault_Successfully and ProtectedItemProtect_DppVault_DiskProtection_Succeeds_E2E and update assets.json tag.
Copy link
Copy Markdown
Contributor

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three solid bug fixes. The MSI default matches Portal/CLI behavior, terminal status replaces the opaque Accepted, and removing the pre-discovery loop for VM protection is a clean subtraction-only fix.

Test coverage is thorough - unit tests cover all new status paths (DPP success/failure, RSV completed/failed/in-progress) and the E2E test validates the full DPP disk protection flow including MSI dependency.

One observation on the DPP protect error attribution below.

Split the single try-catch in DppBackupOperations.ProtectItemAsync so that
a transient failure on the backup instance re-read (GetAsync) no longer
misattributes the status as 'Failed' when the CreateOrUpdateAsync
protection operation actually succeeded. The re-read is now wrapped in
its own try-catch that swallows transient errors while still surfacing
the correct 'Succeeded' status to the caller.
@shrja-ms shrja-ms requested a review from jongio April 27, 2026 11:03
Copy link
Copy Markdown
Contributor

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addresses my earlier feedback on the DPP protect error attribution. The split into two separate try-catch blocks is the right fix - protection failures route through the first catch and return "Failed", while transient re-read failures are absorbed in the second catch with "Succeeded" and a null protectionStatus.

Control flow is sound: operation is declared before the first try block, and the first catch returns early, so the compiler enforces that operation is definitely assigned before the re-read block executes. The second catch correctly narrows to RequestFailedException since that's what the Azure SDK throws for service errors.

CI is red on native builds (linux_x64, linux_arm64, macos_x64, windows_x64) and Build Analyze. These don't look caused by this 1-file change (a try-catch restructure doesn't affect AOT), but they're blocking the PR. Worth investigating before merge.

@anannya03
Copy link
Copy Markdown
Contributor

There're some unit test failures causing some of the build failures- /mnt/vss/_work/1/s/tools/Azure.Mcp.Tools.AzureBackup/tests/Azure.Mcp.Tools.AzureBackup.UnitTests/ProtectedItem/ProtectedItemProtectCommandTests.cs(154,9): error CS0103: The name '_backupService' does not exist in the current context [/mnt/vss/_work/1/s/tools/Azure.Mcp.Tools.AzureBackup/tests/Azure.Mcp.Tools.AzureBackup.UnitTests/Azure.Mcp.Tools.AzureBackup.UnitTests.csproj]

Please fix them

Comment thread tools/Azure.Mcp.Tools.AzureBackup/src/Commands/Vault/VaultCreateCommand.cs Outdated
@shrja-ms shrja-ms requested review from anannya03 and jongio April 28, 2026 07:48
Comment thread docs/bug-bash/policy-create-followup-emails.md Outdated
@github-project-automation github-project-automation Bot moved this from Untriaged to In Progress in Azure MCP Server Apr 28, 2026
Comment thread fix-disk-test.ps1 Outdated
Comment thread tmp-assets-main.json Outdated
@shrja-ms shrja-ms requested a review from anannya03 April 28, 2026 11:11
Comment thread tools/Azure.Mcp.Tools.AzureBackup/src/Services/DppBackupOperations.cs Outdated
Comment thread tools/Azure.Mcp.Tools.AzureBackup/src/Services/RsvBackupOperations.cs Outdated
Comment thread tools/Azure.Mcp.Tools.AzureBackup/tests/test-resources.bicep Outdated
@shrja-ms shrja-ms force-pushed the user/azurebackup-protect-bugfixes branch from 582b725 to 8f6d73e Compare April 28, 2026 11:25
@shrja-ms shrja-ms requested a review from anannya03 April 28, 2026 11:28
@shrja-ms shrja-ms enabled auto-merge (squash) April 28, 2026 11:34
@shrja-ms shrja-ms merged commit 89e000c into microsoft:main Apr 28, 2026
16 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Azure MCP Server Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants