Migrate: Improved Error Handling on Failed SetSlotRange#1653
Merged
vazois merged 8 commits intomicrosoft:mainfrom Apr 3, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors cluster slot-migration error handling by simplifying how TrySetSlotRanges waits for SetSlotRange completion, adding more contextual logging, and ensuring migration failures consistently transition to MigrateState.FAIL. It also adds a new cluster migration test intended to exercise the TrySetSlotRanges code path during a full slot migration.
Changes:
- Refactor
MigrateSession.TrySetSlotRangesto avoidContinueWith(...).WaitAsync().Resultand improve logging and failure-state handling. - Add dedicated cancellation/timeout handling around the
SetSlotRangewait. - Add a new end-to-end cluster migration test that migrates a slot and validates key/data/ownership on the destination.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| libs/cluster/Server/Migration/MigrateSession.cs | Refactors TrySetSlotRanges waiting/error handling and augments logs; ensures FAIL status on more error paths. |
| test/Garnet.test.cluster/ClusterMigrateTests.cs | Adds a new migration resilience test validating data/ownership after slot migration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
vazois
reviewed
Mar 31, 2026
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
vazois
approved these changes
Apr 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improve error handling in
TrySetSlotRangesduring slot migrationSummary
Refactors
MigrateSession.TrySetSlotRangesto replace.ContinueWith(...).WaitAsync().Resultasync pattern with a more traceable and debuggable implementation. Adds explicit timeout/cancellation handling and ensuresMigrateState.FAILis consistently set on all error paths.Motivation
The previous implementation used
ContinueWith(TaskContinuationOptions.OnlyOnRanToCompletion)chained with.WaitAsync().Result. This pattern had several issues:OnlyOnRanToCompletioncontinuation never ran, and the resultingTaskCanceledExceptionwas caught by the genericcatch (Exception)block — which did not setStatus = MigrateState.FAIL, leaving the migration in an indeterminate state."An error occurred"with no context about which slots were affected or whether the failure was a timeout vs. an unexpected error.Changes
MigrateSession.cs
.ContinueWith(...).WaitAsync(_timeout, _cts.Token).Resultwith directtask.WaitAsync(_timeout, _cts.Token).GetAwaiter().GetResult()catch (TaskCanceledException)handler for timeout/cancellation scenarioscatch (AggregateException aex) when (aex.InnerException is TaskCanceledException)for wrapped timeout exceptionsStatus = MigrateState.FAILin all error paths (the old genericcatchmissed this)SETSLOTRANGEcall for better migration observabilityClusterMigrateTests.cs
ClusterMigrateSetSlotRangeResiliencetest that exercises theTrySetSlotRangescode path through a full slot migration:TrySetSlotRangesfor IMPORTING and NODE states)Testing
ClusterMigrateSetSlotRangeResiliencepassesFixes #1655 1655