azurebackup: Fix 5 telemetry-triaged bugs (BUG-3,4,5,6,8)#2518
Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets production failures in the Azure Backup tool (azurebackup_*) identified via telemetry triage, primarily by hardening input handling and surfacing actionable errors for known RSV deprecated-API scenarios.
Changes:
- Expanded workload-type normalization for
protectableitem_listfilters. - Added error-code-specific handling/messages for RSV soft-delete and CRR enable flows.
- Improved resilience in RSV protected item lookup (null-safe name matching) and added subscription format validation helper.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| tools/Azure.Mcp.Tools.AzureBackup/src/Services/RsvBackupOperations.cs | Adds subscription format validation helper, extends workload-type normalization, and makes protected-item name matching null-safe. |
| tools/Azure.Mcp.Tools.AzureBackup/src/Commands/Governance/GovernanceSoftDeleteCommand.cs | Adds handling for BMSUserErrorSoftDeleteUseVaultApi with user guidance. |
| tools/Azure.Mcp.Tools.AzureBackup/src/Commands/DisasterRecovery/DisasterRecoveryEnableCrrCommand.cs | Adds guidance for BMSUserErrorRedundancySettingsUseVaultApi in error messaging. |
6b00cfd to
73a95d6
Compare
83faa44 to
04561e8
Compare
jongio
left a comment
There was a problem hiding this comment.
Nice bug-fix batch - the Vault PATCH migrations and workload normalization look correct. Three items to address:
-
Bare
catchinGetBackupStatusAsyncswallows all exceptions including fatal ones. The adjacent try-catch correctly filters withwhen (ex is FormatException or ArgumentException or UriFormatException)- match that pattern here. -
Comment in
ConfigureCrossRegionRestoreAsyncsays "Try Vault PATCH API first" but the code tries the legacyBackupResourceConfigAPI first. The comment has the order backwards. -
ValidateSubscriptionFormatis missing fromConfigureSoftDeleteAsync. Every other mutating RSV method now calls it (GetVaultAsync, ListProtectedItemsAsync, ListPoliciesAsync, ListJobsAsync, UpdateVaultAsync, ConfigureCrossRegionRestoreAsync, ListProtectableItemsAsync). ConfigureSoftDeleteAsync is the odd one out - a non-GUID subscription will hit a confusing ARM error instead of the clear validation message.
Also, the PR description says BUG-5 is a "Full Vault PATCH migration. No workarounds" but ConfigureCrossRegionRestoreAsync actually uses a legacy-first/Vault-PATCH-fallback pattern. Consider updating the description to match the actual (reasonable) dual-path approach.
CI is failing on 4 native build targets - worth checking if it's related to these changes.
04561e8 to
08232ca
Compare
08232ca to
67b83dd
Compare
jongio
left a comment
There was a problem hiding this comment.
Incremental review after new push. Two of my three earlier findings are addressed:
- Misleading comment (ConfigureCrossRegionRestoreAsync): Fixed. Comment now correctly describes the legacy-first / Vault-PATCH-fallback flow.
- Missing ValidateSubscriptionFormat (ConfigureSoftDeleteAsync): Fixed. Added alongside the full migration to Vault PATCH API.
- catch (Exception) in AzureBackupService.cs: Now
catch (Exception)with a comment instead of barecatch. Still catches all exception types (including OOM), but the scope is narrow (single property access) and the null-fallback is intentional. Not a blocker.
ConfigureSoftDeleteAsync now uses Vault PATCH directly, which matches the BUG-5 description. The CRR early-return check for already-enabled state is a nice defensive addition.
Clean follow-up. Nothing new to flag.
67b83dd to
e0cb3bf
Compare
|
/azp run mcp - pullrequest - live |
|
Azure Pipelines successfully started running 1 pipeline(s). |
jongio
left a comment
There was a problem hiding this comment.
Third pass after rebase onto main. All three of my earlier threads are resolved:
- catch(Exception) - narrowed from bare catch, scoped to ResourceType.ToString(), comment explains intent. Fine.
- CRR comment - corrected to "Try legacy BackupResourceConfig API first."
- ValidateSubscriptionFormat in ConfigureSoftDeleteAsync - added. Now present in all 19 public methods that accept subscription, per @anannya03's follow-up.
Also noting the new ThrowIfNullOrWhiteSpace guard on immutabilityState with a dedicated unit test - solid addition.
The dual-path CRR approach (legacy first, Vault PATCH fallback on BMSUserErrorRedundancySettingsUseVaultApi, early-return when already enabled) looks correct. The DPP hardening in ProtectItemAsync and GetProtectedItemAsync follows the same defensive pattern as the RSV fixes.
No new issues from this pass.
When Cross-Region Restore is already enabled, the RSV legacy API, RSV vault PATCH, and DPP vault PATCH paths can return CloudInternalError. Treat this as success (no-op) so 'enable CRR' is safely idempotent. Also re-recorded DisasterRecoveryEnableCrr_DppVault and DisasterRecoveryEnableCrr_RsvVault live tests; updated assets.json tag to Azure.Mcp.Tools.AzureBackup.LiveTests_b3cc52d5b9.
…rror
CloudInternalError is a generic 500 from the Azure Backup service whose error message ('Microsoft Azure Backup encountered an internal error') does not distinguish 'CRR already enabled' from real platform failures. Catching it broadly risks masking legitimate failures.
Replace the broad catches with a deterministic pre-check:
- DPP: GET vault, return Succeeded early if FeatureSettings.CrossRegionRestoreState == Enabled. Remove the try/catch around UpdateAsync so genuine errors propagate.
- RSV: pre-check already exists for the same property on RedundancySettings; remove the two redundant CloudInternalError catches around the legacy BackupResourceConfig API and the Vault PATCH fallback.
The pre-check refactor adds a vault GET before the PATCH for DPP. Updates assets.json to Azure.Mcp.Tools.AzureBackup.LiveTests_d7cddf4a06 with the new recording.
DPP CreateOrUpdate API rejects updates to existing policies with UserErrorBMSUpdatePolicyNotSupported (400). Catch this specific error and return success, making policy create idempotent. Fixes live test ProtectedItemProtect_DppVault_DiskProtection_Succeeds_E2E which fails when the test policy already exists from a prior run.
These tests depend on mutable infrastructure state (existing policies, soft-deleted backup instances) and cannot reliably replay from recorded HTTP sessions. Mark with [LiveTestOnly] so they skip in Playback mode and only run during live test runs.
- Protect test: use GUID suffix in policy name so each run creates a fresh policy, avoiding UserErrorBMSUpdatePolicyNotSupported collisions - Undelete test: skip gracefully via Assert.Skip when no soft-deleted backup instance is available (consumed by prior run or not set up)
jongio
left a comment
There was a problem hiding this comment.
Fourth pass - 6 new commits since my last review, all DPP-focused:
- CreatePolicyAsync idempotency: Catches
UserErrorBMSUpdatePolicyNotSupported(400) so existing policies don't fail the create path. Narrow catch, correct error code. - CRR pre-check for DPP vaults: Reads vault state first and returns early if CRR is already enabled, avoiding the indistinguishable
CloudInternalError. Same pattern as the RSV CRR early-return. - LiveTestOnly on protect/undelete tests: Properly gates tests that mutate external state.
- GUID-suffixed policy names: Prevents collisions between concurrent test runs.
- Resilient undelete assertion: Gracefully skips via
Assert.Skipwhen no soft-deleted item exists instead of hard-failing.
All changes are defensive hardening and test resilience. No new issues.
|
/azp run mcp - pullrequest - live |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Changes look good to me |
… methodology - BUG-1 (backup_status ArgumentNullException) is fixed in microsoft#2518 (code included in merged PR, though PR title only lists BUG-3,4,5,6,8) - Fix unit test count methodology: count [Fact] + [InlineData] for runnable test cases, not [Fact] + [Theory] for method count - Add Release column to bug tracker table - Add note about verifying PR diff vs title for bug coverage
…workflow (#2570) * azurebackup: Add weekly telemetry report skill Add a Copilot skill under tools/Azure.Mcp.Tools.AzureBackup/skills/ that automates generation of weekly telemetry reports by: - Running KQL queries against the ddazureclients Kusto cluster - Classifying errors into Customer (4xx) / Azure Service / MCP Tool Bug - Correlating with merged PRs, releases, and known bug IDs - Producing an Outlook-compatible HTML report Includes: - SKILL.md with full procedure and error classification decision tree - references/kql-queries.md with all 10 production KQL queries - assets/report-template.html with Outlook-compatible HTML template * azurebackup: Add 'add-tool' skill for new command workflow Add a Copilot skill that guides the full lifecycle of adding a new Azure Backup MCP command: - Phase 1: Implementation (options, service, command, registration) - Phase 2: Input validation checklist - Phase 3: Unit tests with required test patterns - Phase 4: Live tests with Record/Playback workflow - Phase 5: CI validation (build, format, spell check) - Phase 6: ToolDescriptionEvaluator scoring - Phase 7: Documentation updates - Phase 8: PR submission checklist * azurebackup: Fix telemetry report skill - BUG-1 status and test count methodology - BUG-1 (backup_status ArgumentNullException) is fixed in #2518 (code included in merged PR, though PR title only lists BUG-3,4,5,6,8) - Fix unit test count methodology: count [Fact] + [InlineData] for runnable test cases, not [Fact] + [Theory] for method count - Add Release column to bug tracker table - Add note about verifying PR diff vs title for bug coverage * azurebackup: Address all 12 PR review comments on skills Telemetry report skill (SKILL.md + kql-queries.md): - Add case-sensitivity note for Kusto bag key access (#6) - Fix week-over-week query range to include current partial week (#7) - Add InvalidOperationException to MCP Tool Bug classification (#8) - Add missing KQL queries 11-13 for WorkloadType, DatasourceType, OperationScope custom dimensions (#9) - Fix live test count to use [Fact] only, avoiding double-count with [LiveTestOnly] on same method (#5) - Update decision tree to include InvalidOperationException Add-tool skill (SKILL.md): - Fix broken relative link to new-command.md (5 levels up, not 3) (#1) - Fix live test snippet to use [Fact] + CallToolAsync pattern, not non-existent [RecordedTest]/RunCommandAndAssert (#2, #11) - Add missing Update-AzCommandsMetadata.ps1 step after docs update (#3) - Add AOT/native build verification step (Build-Local.ps1 -BuildNative) (#4) - Fix vault/vaultType to reference AzureBackupOptionDefinitions, not OptionDefinitions.Common (#10) - Expand git add scope to include README.md and eng/vscode/README.md (#12) - Add commands metadata regeneration and AOT build to PR checklist * azurebackup: Convert relative links to absolute paths for CI link checker The link verification step requires absolute paths from repo root, not relative paths. Convert all ./references/ and ../../../ links to /tools/Azure.Mcp.Tools.AzureBackup/skills/... format. * azurebackup: Use full GitHub URLs for CI link checker Link checker requires https://github.com/microsoft/mcp/blob/main/... format, not repo-root-relative paths. * azurebackup: Add InvalidOperationException to Query 1, replace Power BI URLs with placeholder - Query 1 (primary per-tool health) was missing InvalidOperationException in the McpToolBug case, causing inconsistency with Queries 2 and 4 - Replace hardcoded Power BI dashboard URLs in report template with {{POWERBI_DASHBOARD_URL}} placeholder (public repo, internal GUIDs)
…unit tests (#2621) * azurebackup: Fix 4 telemetry-triaged bugs (BUG-1,3,6,NEW-2) BUG-1 (P0): Add null guard inside MapArmResourceTypeToBackupDataSourceType - Accept string? parameter with IsNullOrEmpty check - Prevents ArgumentNullException from BackupDataSourceType.op_Implicit(null) - Still hit on beta.8/beta.9 despite PR #2518 caller-side guard BUG-3 (P1): Relax ArmResourceTypeRegex to allow nested resource types - Changed pattern from single-segment to multi-segment: (/[A-Za-z0-9]+)+ - Allows types like Microsoft.Sql/servers/databases - Previously rejected valid ARM types with ArgumentException BUG-6 (P0): Remove ValidateSubscriptionFormat and all 19 call sites - Method rejected subscription names (non-GUID) with FormatException - Violated MCP convention that --subscription accepts names and IDs - Subscription resolution handled by ISubscriptionService in base class NEW-2 (P1): Add FormatException handling in DppBackupOperations.ListJobsAsync - Azure SDK throws FormatException during job deserialization when XmlConvert.ToTimeSpan encounters non-standard ISO 8601 durations - GetJobAsync's fallback to ListJobsAsync was circular (both crash) - Now catches FormatException during enumeration, returns partial results * azurebackup: Fix telemetry tags not emitting (null value = no-op) Activity.AddTag(key, null) is a no-op in .NET - the tag is silently dropped. Since most users don't pass --vault-type (auto-detect), NormalizeVaultType returned null, and AddVaultTags became a no-op. Fix: Return 'auto' instead of null for unspecified vault types, and 'unspecified' for workload types. This ensures the tag is always emitted to the Activity and exported to Kusto/App Insights. * azurebackup: Add unit tests for AzureBackupTelemetryTags 17 new tests covering: - NormalizeVaultType: null/empty -> 'auto', case normalization - NormalizeWorkloadType: null/empty -> 'unspecified', case normalization - AddVaultTags: null activity safety, null vaultType -> 'auto' tag - AddVaultAndWorkloadTags: both tags emitted with correct values - Tag constants: verify azurebackup/* prefix
…unit tests (microsoft#2621) * azurebackup: Fix 4 telemetry-triaged bugs (BUG-1,3,6,NEW-2) BUG-1 (P0): Add null guard inside MapArmResourceTypeToBackupDataSourceType - Accept string? parameter with IsNullOrEmpty check - Prevents ArgumentNullException from BackupDataSourceType.op_Implicit(null) - Still hit on beta.8/beta.9 despite PR microsoft#2518 caller-side guard BUG-3 (P1): Relax ArmResourceTypeRegex to allow nested resource types - Changed pattern from single-segment to multi-segment: (/[A-Za-z0-9]+)+ - Allows types like Microsoft.Sql/servers/databases - Previously rejected valid ARM types with ArgumentException BUG-6 (P0): Remove ValidateSubscriptionFormat and all 19 call sites - Method rejected subscription names (non-GUID) with FormatException - Violated MCP convention that --subscription accepts names and IDs - Subscription resolution handled by ISubscriptionService in base class NEW-2 (P1): Add FormatException handling in DppBackupOperations.ListJobsAsync - Azure SDK throws FormatException during job deserialization when XmlConvert.ToTimeSpan encounters non-standard ISO 8601 durations - GetJobAsync's fallback to ListJobsAsync was circular (both crash) - Now catches FormatException during enumeration, returns partial results * azurebackup: Fix telemetry tags not emitting (null value = no-op) Activity.AddTag(key, null) is a no-op in .NET - the tag is silently dropped. Since most users don't pass --vault-type (auto-detect), NormalizeVaultType returned null, and AddVaultTags became a no-op. Fix: Return 'auto' instead of null for unspecified vault types, and 'unspecified' for workload types. This ensures the tag is always emitted to the Activity and exported to Kusto/App Insights. * azurebackup: Add unit tests for AzureBackupTelemetryTags 17 new tests covering: - NormalizeVaultType: null/empty -> 'auto', case normalization - NormalizeWorkloadType: null/empty -> 'unspecified', case normalization - AddVaultTags: null activity safety, null vaultType -> 'auto' tag - AddVaultAndWorkloadTags: both tags emitted with correct values - Tag constants: verify azurebackup/* prefix
…elemetry skill (#2659) * azurebackup: Fix KQL classification order and update bug tracker in telemetry skill The KQL case() expression in kql-queries.md checked StatusCode 400-499 before exception type, which masked MCP tool bugs (ArgumentNullException, ArgumentException, FormatException) as Customer errors when HandleException mapped them to HTTP 400. This caused the weekly telemetry report to undercount MCP bugs by ~33x (1 vs 34 in the May 11-17 report). Changes: - kql-queries.md: Reorder case() in queries 1, 2, and 4 to check MCP bug exception types (FormatException, ArgumentNullException, ArgumentException, InvalidOperationException) BEFORE StatusCode 400-499 - SKILL.md: Update decision tree to match the classification table (check exception type before status code), add explanatory note about HandleException mapping ArgumentNullException to HTTP 400 - SKILL.md: Update known bug table with current state: - BUG-1: Mark #2518 as partial fix, add #2621 as defense-in-depth - BUG-3: Mark #2518 as partial, add #2621 relaxed regex - Add NEW-2 (job_get FormatException in DPP SDK, azure-sdk-for-net#59306) - Update notes about incomplete #2518 fixes confirmed on beta.8/beta.9 * Address PR review: fix decision tree nesting and use 'pending' for unreleased versions - Add missing NO branch under StatusCode check in decision tree for proper nesting (Copilot review comment 1) - Replace 'beta.11 (pending)' with 'pending' in bug tracker Release column to avoid presuming version number (Copilot review comment 2)
Summary
Fixes 6 telemetry-triaged bugs + 3 proactive DPP hardening fixes identified via Kusto telemetry (cluster
ddazureclients/AzureDevExp, functiongetAzureMcpEvents_ToolCalls). These bugs caused 75% overall success rate (39 failures / 156 calls across 13 users in 30 days). All 5 tools with 100% failure rate are now fixed.Bugs Fixed
BUG-1:
backup_status— 57% failure rate (4/7 calls)GetBackupStatusAsync: try-catch onnew ResourceIdentifier(datasourceId)+ null-safeResourceTypeaccess. PreventsArgumentNullExceptionwhen ARM resource type is null/unsupported.BUG-3:
protectableitem_list— 100% failure rate (3/3 calls)NormalizeWorkloadTypeForFilter: Added VM/IaaSVM/VirtualMachine, FileShare/AzureFileShare/AFS, SAPAse/SAPAseDatabase/ASE/Sybase. Mapped toSAPAseDatabase(matchingRsvDatasourceRegistry.ApiWorkloadType). Error message lists all accepted tokens with aliases.BUG-4:
governance_soft-delete— 100% failure rate (3/3 calls)Full Vault PATCH migration:
ConfigureSoftDeleteAsyncmigrated from deprecatedBackupResourceVaultConfigAPI toRecoveryServicesSoftDeleteSettings/RecoveryServicesSoftDeleteState(SDK 1.2.0 types). No workarounds.BUG-5:
disasterrecovery_enable-crr— 100% failure rate (2/2 calls)Full Vault PATCH migration:
ConfigureCrossRegionRestoreAsyncmigrated from deprecatedBackupResourceConfigAPI toVaultPropertiesRedundancySettings.CrossRegionRestore(SDK 1.2.0 types). No workarounds.BUG-6:
vault_get/protecteditem_get— FormatException crashesValidateSubscriptionFormat: reject non-GUID subscription IDs withArgumentExceptionfor proper 400 status. Applied to 7 public methods: GetVaultAsync, ListProtectedItemsAsync, GetPolicyAsync, GetJobAsync, UpdateVaultAsync, ConfigureSoftDeleteAsync, ListProtectableItemsAsync.BUG-8:
recoverypoint_get— NullReferenceExceptionNull-safety in
ResolveProtectedItemContainerAsyncandMatchesFriendlyNamefor items with nullNameproperty.Proactive DPP Hardening (3 additional fixes)
GetProtectedItemAsync—i.Name.Equals()without null checkProtectItemAsync— fail-fastArgumentExceptionwhenResourceIdentifierorResourceTypeis unparseable and--datasource-typeis not providedResourceTypeusage is guardedLive Test Fixes
Assert.Equal("Succeeded", ...)Files Changed (4 files)
AzureBackupService.csGetBackupStatusAsyncRsvBackupOperations.csDppBackupOperations.csAzureBackupCommandTests.csTest Results
protectableitem_list3/3 rank Update README.md #1,governance_soft-delete2/3 rank Update README.md #1Invoking Livetests
Copilot submitted PRs are not trustworthy by default. Users with
writeaccess to the repo need to validate the contents of this PR before leaving a comment with the text/azp run mcp - pullrequest - live. This will trigger the necessary livetest workflows to complete required validation.