Fix azurebackup protect/vault-create: MSI default, terminal status, VM timeout#2470
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes several end-to-end issues in the azurebackup tool by (1) ensuring new DPP vaults are usable immediately (MSI enabled), (2) returning real/terminal outcomes from protecteditem protect instead of a synthetic “Accepted”, and (3) removing a VM container pre-discovery loop that caused timeouts on newly created VMs.
Changes:
- Enable System-Assigned Managed Identity by default when creating DPP backup vaults.
- Improve
protecteditem protectoutcome reporting: RSV pollsConfigureBackupto terminal status; DPP waits for completion and reads back the backup instance protection status. - Extend unit/live tests and test resources to validate the new behavior; update docs/changelog accordingly.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tools/Azure.Mcp.Tools.AzureBackup/tests/test-resources.bicep | Adds a managed disk + RBAC assignments needed to E2E test DPP disk protection via vault MSI. |
| tools/Azure.Mcp.Tools.AzureBackup/tests/Azure.Mcp.Tools.AzureBackup.UnitTests/ProtectedItem/ProtectedItemProtectCommandTests.cs | Adds unit tests asserting DPP omits jobId and surfaces protectionStatus / errors; RSV surfaces terminal job status. |
| tools/Azure.Mcp.Tools.AzureBackup/tests/Azure.Mcp.Tools.AzureBackup.LiveTests/AzureBackupCommandTests.cs | Adds live assertions for DPP vault MSI default and a new DPP disk protect E2E test. |
| tools/Azure.Mcp.Tools.AzureBackup/src/Services/RsvBackupOperations.cs | Removes VM container pre-discovery loop; adds polling helpers to return terminal RSV job outcome + error details. |
| tools/Azure.Mcp.Tools.AzureBackup/src/Services/DppBackupOperations.cs | Defaults DPP vault identity to SystemAssigned; switches protect to WaitUntil.Completed + read-back of backup instance protection status. |
| tools/Azure.Mcp.Tools.AzureBackup/src/Models/ProtectResult.cs | Extends protect response model to include optional ProtectionStatus and ErrorMessage. |
| tools/Azure.Mcp.Tools.AzureBackup/src/Commands/Vault/VaultCreateCommand.cs | Updates command description to document DPP MSI default behavior. |
| servers/Azure.Mcp.Server/docs/azmcp-commands.md | Updates command reference text for vault create and protecteditem protect semantics. |
| servers/Azure.Mcp.Server/CHANGELOG.md | Adds bug-fix release notes for the MSI default, terminal status surfacing, and VM timeout fix. |
Three independent bug fixes for azurebackup tooling, surfaced via end-to-end testing against a live subscription. Bug microsoft#1 — `azurebackup vault create --vault-type dpp` created Data Protection backup vaults without a Managed Identity. Every subsequent `protecteditem protect` against the new vault then failed server-side with `VaultMSIUnauthorized`. Fixed by defaulting new DPP vaults to a System-Assigned Managed Identity, matching Portal and CLI behaviour. Bug microsoft#2 — `azurebackup protecteditem protect` returned a synthetic "Accepted" response with a base64-encoded job ID that callers could not resolve. Fixed to surface the true terminal outcome: - RSV: poll the underlying ConfigureBackup job (max 12 min) and return its real status (`Completed`, `CompletedWithWarnings`, `Failed`, `Cancelled`, or `InProgress` if polling times out) with the real job ID and error details (AOT-safe typed switch). - DPP: wait for the LRO to complete, read back the BackupInstance, and return the actual `protectionStatus`. DPP responses no longer carry a misleading `jobId` — DPP protection is not a backup job. - `ProtectResult` record extended with optional `ProtectionStatus` and `ErrorMessage` fields. Bug microsoft#3 — `azurebackup protecteditem protect` for IaaS VM performed a pre-emptive 180 s container-discovery poll that timed out on freshly created VMs. Removed — the Recovery Services backend registers the container as part of accepting the protect request, matching `az backup protection enable-for-vm`. Tests: - Unit: 334/334 pass; +5 new tests for terminal-status surfacing (DPP Succeeded/Failed, RSV Completed/Failed/InProgress). - Live: extended `VaultCreate_CreatesDppVault_Successfully` to assert System-Assigned MSI; added `ProtectedItemProtect_DppVault_DiskProtection_Succeeds_E2E` which creates a disk policy, protects a disk via MCP, and asserts `protectionStatus` on success or `errorMessage` on failure with no `jobId` for DPP. - Live infra: added `${baseName}-disk` (4 GiB Standard_LRS) and the Disk Backup Reader / Disk Snapshot Contributor role assignments needed by the DPP vault MSI. Manual E2E verification (live sub `f0c630e0-…`): - DPP vault create: identity = SystemAssigned, principalId populated. - RSV VM (fresh) protect: Completed in ~95 s (no 180 s pre-poll). - RSV SQL-in-VM (master) protect: Completed in 112 s. - RSV AFS protect: Failed surfaced with real job id + error message. - DPP Disk (fresh) protect: Succeeded → ProtectionConfigured. - DPP ADLS/ESAN: Failed surfaced with service-side error message (HNS unsupported, plugin error) — fix correctly reports the truth. Build: full solution 0 warnings, 0 errors. Spell: clean for all newly-added lines (pre-existing unrelated warnings on main remain untouched).
94726dd to
043d001
Compare
This reverts commit ce9e17c.
- Fix VaultCreate_CreatesDppVault assertion: vault_get returns 'vaults' array, not a singular 'vault' property. - Add GeneralRegexSanitizers for @microsoft.com UPNs (createdBy/lastModifiedBy in x-ms-arm-resource-system-data) and the recording tenant id (x-ms-operation-identifier and async cert URLs). - Re-record VaultCreate_CreatesDppVault_Successfully and ProtectedItemProtect_DppVault_DiskProtection_Succeeds_E2E and update assets.json tag.
jongio
left a comment
There was a problem hiding this comment.
Three solid bug fixes. The MSI default matches Portal/CLI behavior, terminal status replaces the opaque Accepted, and removing the pre-discovery loop for VM protection is a clean subtraction-only fix.
Test coverage is thorough - unit tests cover all new status paths (DPP success/failure, RSV completed/failed/in-progress) and the E2E test validates the full DPP disk protection flow including MSI dependency.
One observation on the DPP protect error attribution below.
Split the single try-catch in DppBackupOperations.ProtectItemAsync so that a transient failure on the backup instance re-read (GetAsync) no longer misattributes the status as 'Failed' when the CreateOrUpdateAsync protection operation actually succeeded. The re-read is now wrapped in its own try-catch that swallows transient errors while still surfacing the correct 'Succeeded' status to the caller.
jongio
left a comment
There was a problem hiding this comment.
Addresses my earlier feedback on the DPP protect error attribution. The split into two separate try-catch blocks is the right fix - protection failures route through the first catch and return "Failed", while transient re-read failures are absorbed in the second catch with "Succeeded" and a null protectionStatus.
Control flow is sound: operation is declared before the first try block, and the first catch returns early, so the compiler enforces that operation is definitely assigned before the re-read block executes. The second catch correctly narrows to RequestFailedException since that's what the Azure SDK throws for service errors.
CI is red on native builds (linux_x64, linux_arm64, macos_x64, windows_x64) and Build Analyze. These don't look caused by this 1-file change (a try-catch restructure doesn't affect AOT), but they're blocking the PR. Worth investigating before merge.
|
There're some unit test failures causing some of the build failures- /mnt/vss/_work/1/s/tools/Azure.Mcp.Tools.AzureBackup/tests/Azure.Mcp.Tools.AzureBackup.UnitTests/ProtectedItem/ProtectedItemProtectCommandTests.cs(154,9): error CS0103: The name '_backupService' does not exist in the current context [/mnt/vss/_work/1/s/tools/Azure.Mcp.Tools.AzureBackup/tests/Azure.Mcp.Tools.AzureBackup.UnitTests/Azure.Mcp.Tools.AzureBackup.UnitTests.csproj] Please fix them |
582b725 to
8f6d73e
Compare
Summary
Three independent bug fixes for the
azurebackuptool, surfaced and validated through end-to-end manual testing against a live Azure subscription across every supported workload (IaaS VM, SQL-in-VM, Azure Files, PostgreSQL Flex, Azure Disk, Blob, ADLS, ElasticSAN, AKS).Bugs fixed
Bug #1 — DPP vault created without Managed Identity → every subsequent
protectcall failsazurebackup vault create --vault-type dppused to create Data Protection backup vaults with no Managed Identity. Every laterazurebackup protecteditem protectagainst that vault then failed server-side withVaultMSIUnauthorized, because the backend needs the vault's MSI to read the datasource. This diverged from both the Portal andaz dataprotection backup-vault create(both of which default MSI on).Fix: new DPP vaults are now created with a System-Assigned Managed Identity by default. RSV path is unaffected. Users who need to add MSI to an existing DPP vault can run
azurebackup vault update --identity-type SystemAssigned.Bug #2 —
protectreturned a synthetic "Accepted" and an unusablejobIdazurebackup protecteditem protectreturnedstatus: "Accepted"with a base64-encoded operation GUID asjobId. Callers could not resolve that id viaazurebackup job get, so the protect outcome was opaque — callers could not tell whether protection actually succeeded, failed, or was still running.Fix: surface the real terminal outcome:
ConfigureBackupbackup job viaBackupJobResource.GetAsync(every 10 s, budget ~12 min) and return its real terminal status (Completed,CompletedWithWarnings,Failed,Cancelled, orInProgressif the poll budget is exceeded), the realjobId, and any error message extracted from the job'sErrorDetails(AOT-safe typedswitchacrossIaasVmBackupJob,IaasVmBackupJobV2,StorageBackupJob,WorkloadBackupJob).CreateOrUpdateAsync(WaitUntil.Completed, …), thenGETthe BackupInstance and return the actualprotectionStatus(e.g.ProtectionConfigured). DPP responses no longer carry a misleadingjobId— DPP protection is not a backup job; for on-demand backup / restore monitoring,azurebackup job get/listremain the right commands.ProtectResultrecord extended with optionalProtectionStatusandErrorMessage.Bug #3 — IaaS VM protect times out on freshly created VMs
The RSV VM branch used to call
RefreshProtectionContainerAsyncand then loop up to 36×5 s (= 180 s) waiting for the VM's protection container to appear in RSV before submitting the protect request. Freshly created VMs failed this pre-check and the command surfaced a timeout, even thoughaz backup protection enable-for-vmworks fine on the same VM.Fix: removed the pre-discovery loop. The Recovery Services backend registers the container automatically as part of accepting the protect request — the same behaviour
az backup protection enable-for-vmrelies on.Code changes (9 files, +543 / -69)
tools/.../src/Services/DppBackupOperations.csWaitUntil.Completed+ read-back ofBackupInstance;ProtectResultcarriesProtectionStatus/ErrorMessage; removedExtractJobIdFromOperationtools/.../src/Services/RsvBackupOperations.csBuildRsvProtectResultAsync/WaitForJobAsync/ExtractJobErrorMessage/FirstNonEmptyhelpers (AOT-safe typedswitch)tools/.../src/Models/ProtectResult.csProtectionStatusandErrorMessagetools/.../src/Commands/Vault/VaultCreateCommand.cstools/.../src/Commands/ProtectedItem/ProtectedItemProtectCommand.cstools/.../tests/...UnitTests/ProtectedItem/ProtectedItemProtectCommandTests.cstools/.../tests/...LiveTests/AzureBackupCommandTests.csVaultCreate_CreatesDppVault_Successfullyto assertidentityType=SystemAssigned; newProtectedItemProtect_DppVault_DiskProtection_Succeeds_E2Etools/.../tests/test-resources.bicep${baseName}-disk+ Disk Backup Reader / Disk Snapshot Contributor role assignments for the DPP vault MSIservers/Azure.Mcp.Server/CHANGELOG.md,docs/azmcp-commands.mdTest coverage
Azure.Mcp.Tools.AzureBackup.UnitTests(+5 new tests covering DPP Succeeded / DPP Failed with error / RSV terminal job / RSV failed with error / RSV polling-budget expiry).VaultCreate_CreatesDppVault_Successfullyto assert MSI presence; newProtectedItemProtect_DppVault_DiskProtection_Succeeds_E2Eruns the full flow through MCP (create disk policy, protect disk, assertprotectionStatuson success orerrorMessageon failure, assert nojobIdfor DPP).dotnet build Microsoft.Mcp.slnx— 0 warnings, 0 errors.dotnet format --verify-no-changesclean.Invoke-Cspell.ps1clean on all newly-added lines (pre-existing unrelated warnings onmainremain untouched).Manual E2E verification (subscription
f0c630e0-…, vaultazbackup-mcp-bv-20260423)identityType=SystemAssigned, principalId populatedCompletedin ~95 smaster)Completedin 112 saz backup container register(workload prerequisite) →protectableitem listvia MCP → protect via MCPFailedsurfaced with realjobId+ error messageSucceeded→ProtectionConfiguredProtectionConfigured(no regression)Failedsurfaced correctly (UserErrorDppDatasourceAlreadyProtected)Failed+UserErrorUnsupportedStorageAccountType(HNS)Failed+PluginUnknownErrorInvoking Livetests
Copilot submitted PRs are not trustworthy by default. Users with
writeaccess to the repo need to validate the contents of this PR before leaving a comment with the text/azp run mcp - pullrequest - live. This will trigger the necessary livetest workflows to complete required validation.