test_repair.sh needs to support new async upstairs #466

leftwo · 2022-09-30T22:49:07Z

    Note: I've just seen

Error: Initial volume verify failed: Error in range 500 -> 600
Exit on verify fail, loop: 0, choice: 0
Check /tmp/verify_out.txt for details

locally from test_repair.sh, 👎

Originally posted by @jmpesp in #464 (comment)

The text was updated successfully, but these errors were encountered:

leftwo · 2022-09-30T22:50:57Z

I'm seeing the same thing from a stock workspace on main.

Looking into this test now and trying to determine if it is broken, or if the new async stuff did something.

jmpesp · 2022-10-01T00:36:29Z

It's almost certainly the use of crucible::join_all in repair_workload - as of the async PR we don't make guarantees about submission order anymore, and the futures list in that function contains a mix of flushes, reads, and writes.

jmpesp · 2022-10-01T00:39:52Z

Interestingly, if I dump when there's an error, I see all green:

leftwo · 2022-10-01T01:09:52Z

Yeah, after the repair is done it will be all green. It's just the wrong green.

leftwo · 2022-10-01T01:15:22Z

Yeah, if the work is not actually submitted to the upstairs side when

let future = guest.flush(None);

Then this test will not give us valid results.
Because, it basically creates a bunch of work, then waits for all the acks to come back.
Before we did have order, now we don't.

I can see that the flush does not happen when the test expects it to, so, this test will need a re-write.
guest., then this test is not going to work correctly.

The old behavior submitted a job (meaning it's order was in) even if you did not block_wait on that job.

leftwo · 2022-10-01T21:52:05Z

So, with the new async behavior. we should

Turn this test off for now in CI
Come up with a test of a similar nature

The idea/goal for the test_repair.sh test is to have a bunch of IO going to the downstairs, and to interrupt things and then verify that a repair will "fix" whatever does not match between the downstairs. Because during good times the three downstairs all work at the same speed, it's hard to produce a case where they actually have different data. To help us create an out of date downstairs, we can start one of the three downstairs with "--lossy" which will reduce the rate at which it completes IO, which will help us create the scenario we are looking to test.

Here is the rough high level test flow. The test starts with all downstairs up and does an initial fill to all blocks.
Then, the loop begins:

Restart one downstairs at random with --lossy
Do some mix of IOs, and be sure the last few are not a flush.
Wait for the ACK on all IOs.
Hard stop the upstairs.
The result at this point should be one downstairs that, because of lossy, did not complete all the work assigned to it.
Test continues:
Restart the one downstairs without lossy
Start the upstairs (which will do a repair)
Read and verify the whole volume, making sure that every block contains data from at least the last flush.

The key here is to know at what order the jobs were completed in, and to use that completion order to determine:

What the last write + flush was.
What the final write was.
The data for that block should be in that range.

I think the only way crutest can do this, without a refactor of the IO logging, is if it waits on the ack from each IO before issuing the next. It still might produce the outcome we desire as lossy is pretty slow.

leftwo · 2022-10-02T00:19:11Z

Proposed fix
In testing now.

leftwo · 2022-10-03T17:06:43Z

fixed with #467

leftwo self-assigned this Sep 30, 2022

leftwo changed the title ~~test_repair.sh is failing~~ test_repair.sh needs to support new async upstairs Oct 3, 2022

leftwo closed this as completed Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_repair.sh needs to support new async upstairs #466

test_repair.sh needs to support new async upstairs #466

leftwo commented Sep 30, 2022

leftwo commented Sep 30, 2022

jmpesp commented Oct 1, 2022

jmpesp commented Oct 1, 2022

leftwo commented Oct 1, 2022

leftwo commented Oct 1, 2022

leftwo commented Oct 1, 2022

leftwo commented Oct 2, 2022

leftwo commented Oct 3, 2022

test_repair.sh needs to support new async upstairs #466

test_repair.sh needs to support new async upstairs #466

Comments

leftwo commented Sep 30, 2022

leftwo commented Sep 30, 2022

jmpesp commented Oct 1, 2022

jmpesp commented Oct 1, 2022

leftwo commented Oct 1, 2022

leftwo commented Oct 1, 2022

leftwo commented Oct 1, 2022

leftwo commented Oct 2, 2022

leftwo commented Oct 3, 2022