Skip to content

Migrate: Improved Error Handling on Failed SetSlotRange#1653

Merged
vazois merged 8 commits intomicrosoft:mainfrom
kevinmichaelbowersox:users/kbowersox/verbose-migration-handling
Apr 3, 2026
Merged

Migrate: Improved Error Handling on Failed SetSlotRange#1653
vazois merged 8 commits intomicrosoft:mainfrom
kevinmichaelbowersox:users/kbowersox/verbose-migration-handling

Conversation

@kevinmichaelbowersox
Copy link
Copy Markdown
Member

@kevinmichaelbowersox kevinmichaelbowersox commented Mar 31, 2026

Improve error handling in TrySetSlotRanges during slot migration

Summary

Refactors MigrateSession.TrySetSlotRanges to replace .ContinueWith(...).WaitAsync().Result async pattern with a more traceable and debuggable implementation. Adds explicit timeout/cancellation handling and ensures MigrateState.FAIL is consistently set on all error paths.

Motivation

The previous implementation used ContinueWith(TaskContinuationOptions.OnlyOnRanToCompletion) chained with .WaitAsync().Result. This pattern had several issues:

  • Silent failures: If the task faulted or was canceled, the OnlyOnRanToCompletion continuation never ran, and the resulting TaskCanceledException was caught by the generic catch (Exception) block — which did not set Status = MigrateState.FAIL, leaving the migration in an indeterminate state.
  • Poor diagnostics: The generic catch logged "An error occurred" with no context about which slots were affected or whether the failure was a timeout vs. an unexpected error.
  • Unnecessary indirection: Wrapping the result check in a continuation added complexity without benefit.

Changes

MigrateSession.cs

  • Replace .ContinueWith(...).WaitAsync(_timeout, _cts.Token).Result with direct task.WaitAsync(_timeout, _cts.Token).GetAwaiter().GetResult()
  • Add dedicated catch (TaskCanceledException) handler for timeout/cancellation scenarios
  • Add catch (AggregateException aex) when (aex.InnerException is TaskCanceledException) for wrapped timeout exceptions
  • Set Status = MigrateState.FAIL in all error paths (the old generic catch missed this)
  • Add structured log messages that include the affected slot ranges and timeout values
  • Add trace logging before the SETSLOTRANGE call for better migration observability

ClusterMigrateTests.cs

  • Add ClusterMigrateSetSlotRangeResilience test that exercises the TrySetSlotRanges code path through a full slot migration:
    • Sets up a 2-node cluster and creates 50 keys in a slot
    • Migrates the slot from source to target (internally calls TrySetSlotRanges for IMPORTING and NODE states)
    • Verifies key count and data integrity on the target node
    • Verifies slot ownership transferred correctly

Testing

  • New test ClusterMigrateSetSlotRangeResilience passes
  • Existing migration tests remain green

Fixes #1655 1655

Copilot AI review requested due to automatic review settings March 31, 2026 18:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors cluster slot-migration error handling by simplifying how TrySetSlotRanges waits for SetSlotRange completion, adding more contextual logging, and ensuring migration failures consistently transition to MigrateState.FAIL. It also adds a new cluster migration test intended to exercise the TrySetSlotRanges code path during a full slot migration.

Changes:

  • Refactor MigrateSession.TrySetSlotRanges to avoid ContinueWith(...).WaitAsync().Result and improve logging and failure-state handling.
  • Add dedicated cancellation/timeout handling around the SetSlotRange wait.
  • Add a new end-to-end cluster migration test that migrates a slot and validates key/data/ownership on the destination.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
libs/cluster/Server/Migration/MigrateSession.cs Refactors TrySetSlotRanges waiting/error handling and augments logs; ensures FAIL status on more error paths.
test/Garnet.test.cluster/ClusterMigrateTests.cs Adds a new migration resilience test validating data/ownership after slot migration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@vazois vazois self-requested a review March 31, 2026 18:55
@vazois vazois merged commit 144477c into microsoft:main Apr 3, 2026
47 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slot migration fails silently when SetSlotRange times out or faults

3 participants