Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 32 additions & 7 deletions nexus/tests/integration_tests/crucible_replacements.rs
Original file line number Diff line number Diff line change
Expand Up @@ -546,9 +546,12 @@ mod region_replacement {
//
// If `wait_for_request_state` has the same expected start and end
// state (as it does above), it's possible to exit that function
// having not yet started the saga yet, and this requires an
// additional `wait_for_condition` to wait for the expected recorded
// step.
// having not yet started the saga yet. To check the saga has
// started, we'll wait for the replacement request step to be
// created by it. If the saga is still in progress, the step may be
// recorded while the saga is still `Driving`, so we wait for the
// saga to return to `Running` as evidence the step has been fully
// driven.

let most_recent_step = wait_for_condition(
|| {
Expand All @@ -565,12 +568,34 @@ mod region_replacement {
.await
.unwrap()
{
Some(step) => Ok(step),
Some(step) => {
// The saga has either started or completed. To
// tell if it's completed and we can move on,
// check on the replacement state again. If
// we're not `Running`, the saga is still in
// progress.
let state = datastore
.get_region_replacement_request_by_id(
&opctx,
replacement_request_id,
)
.await
.unwrap()
.replacement_state;

if state == RegionReplacementState::Running {
Ok(step)
} else {
// The replacement step is in progress, but
// not done yet. We're still waiting, but
// probably not for long.
Err(CondCheckError::<()>::NotYet)
Comment on lines +589 to +592
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having said a whole lot of words, the real critical point would to be able to see this somehow. I tried to coax evidence of that:

  • added a panic!("race hit!") here locally
  • ran the test as while true; do pbind -e 1 -- cargo nextest run test_delete_volume_region_replacement_state_replacement_done; done
  • in another shell, ran pbind -e 1 bash -c 'while true; do :; done

while this extremely slowed things down (.. more than the 50% I'd expected), I have yet to see the panic get hit.

the possibility that simply calling get_region_replacement_request_by_id adds enough time to consistently resolve the race is not lost on me...

}
}

None => {
// The saga either has not started yet or is
// still running - see the comment before this
// check for more info.
// The saga has not started, so we're not done
// waiting.
Err(CondCheckError::<()>::NotYet)
}
}
Expand Down
Loading