Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_repair.sh needs to support new async upstairs #466

Closed
leftwo opened this issue Sep 30, 2022 · 8 comments
Closed

test_repair.sh needs to support new async upstairs #466

leftwo opened this issue Sep 30, 2022 · 8 comments
Assignees

Comments

@leftwo
Copy link
Contributor

leftwo commented Sep 30, 2022

    Note: I've just seen
Error: Initial volume verify failed: Error in range 500 -> 600
Exit on verify fail, loop: 0, choice: 0
Check /tmp/verify_out.txt for details

locally from test_repair.sh, 馃憥

Originally posted by @jmpesp in #464 (comment)

@leftwo
Copy link
Contributor Author

leftwo commented Sep 30, 2022

I'm seeing the same thing from a stock workspace on main.

Looking into this test now and trying to determine if it is broken, or if the new async stuff did something.

@leftwo leftwo self-assigned this Sep 30, 2022
@jmpesp
Copy link
Contributor

jmpesp commented Oct 1, 2022

It's almost certainly the use of crucible::join_all in repair_workload - as of the async PR we don't make guarantees about submission order anymore, and the futures list in that function contains a mix of flushes, reads, and writes.

@jmpesp
Copy link
Contributor

jmpesp commented Oct 1, 2022

Interestingly, if I dump when there's an error, I see all green:

Screenshot_2022-09-30_20-38-27

@leftwo
Copy link
Contributor Author

leftwo commented Oct 1, 2022

Yeah, after the repair is done it will be all green. It's just the wrong green.

@leftwo
Copy link
Contributor Author

leftwo commented Oct 1, 2022

Yeah, if the work is not actually submitted to the upstairs side when

let future = guest.flush(None);

Then this test will not give us valid results.
Because, it basically creates a bunch of work, then waits for all the acks to come back.
Before we did have order, now we don't.

I can see that the flush does not happen when the test expects it to, so, this test will need a re-write.
guest., then this test is not going to work correctly.

The old behavior submitted a job (meaning it's order was in) even if you did not block_wait on that job.

@leftwo
Copy link
Contributor Author

leftwo commented Oct 1, 2022

So, with the new async behavior. we should

  1. Turn this test off for now in CI
  2. Come up with a test of a similar nature

The idea/goal for the test_repair.sh test is to have a bunch of IO going to the downstairs, and to interrupt things and then verify that a repair will "fix" whatever does not match between the downstairs. Because during good times the three downstairs all work at the same speed, it's hard to produce a case where they actually have different data. To help us create an out of date downstairs, we can start one of the three downstairs with "--lossy" which will reduce the rate at which it completes IO, which will help us create the scenario we are looking to test.

Here is the rough high level test flow. The test starts with all downstairs up and does an initial fill to all blocks.
Then, the loop begins:

  • Restart one downstairs at random with --lossy
  • Do some mix of IOs, and be sure the last few are not a flush.
  • Wait for the ACK on all IOs.
  • Hard stop the upstairs.
    The result at this point should be one downstairs that, because of lossy, did not complete all the work assigned to it.
    Test continues:
  • Restart the one downstairs without lossy
  • Start the upstairs (which will do a repair)
  • Read and verify the whole volume, making sure that every block contains data from at least the last flush.

The key here is to know at what order the jobs were completed in, and to use that completion order to determine:

  • What the last write + flush was.
  • What the final write was.
    The data for that block should be in that range.

I think the only way crutest can do this, without a refactor of the IO logging, is if it waits on the ack from each IO before issuing the next. It still might produce the outcome we desire as lossy is pretty slow.

@leftwo
Copy link
Contributor Author

leftwo commented Oct 2, 2022

Proposed fix
In testing now.

@leftwo leftwo changed the title test_repair.sh is failing test_repair.sh needs to support new async upstairs Oct 3, 2022
@leftwo
Copy link
Contributor Author

leftwo commented Oct 3, 2022

fixed with #467

@leftwo leftwo closed this as completed Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants