-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fog 393 view server fixes #11
Fog 393 view server fixes #11
Conversation
None | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the future, we should use this function in ingest instead of duping this logic
it's not used in this revision though I think, I ended up not using it. we could remove it for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far! It is the type of code that is prone to off-by-one/edge case issues so the more unit tests we can get, the more confident we can be in it. I haven't noticed anything obviously wrong with it.
fog/view/server/src/db_fetcher.rs
Outdated
} | ||
|
||
sleep(Duration::from_secs(1)); // Supposedly enough time for at least some blocks to get picked up. | ||
|
||
assert!(db_fetcher.get_pending_fetched_records().is_empty()); | ||
|
||
// Report a missing block range. | ||
db.report_missed_block_range(&BlockRange::new(20, 30)) | ||
// TODO The last scanned block index gets updated even though we have a hole. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garbageslam I wonder what your thoughts on this are. There is a weird case where if somehow some blocks were skipped, last_scanned_block
and highest_known_block_index
would still go up. I think this is fine but wanted to check with you since this differs from the previous behavior, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is okay -- i think ingest only actually scans consecutive sequences.
if we wanted to change this, we could make the database enforce that consecutive sequences are scanned.
i don't think this is the wrong behavior for the view server to have in this scenario, i think if we want to change this, we need to change it in the database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe its bad because, if there are gaps like this, reporting the key as lost won't fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk, i can think of a few ways to try to fix this in the database if it actually happened, but i don't think it will actually happen because i think ingest asserts that it always processes consecutive blocks
fog/view/server/src/db_fetcher.rs
Outdated
assert!(missing_block_ranges.is_empty()); | ||
|
||
// Report a missing block range. This should have no effect. | ||
db.report_missed_block_range(&BlockRange::new(30, 40)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version doesn't skip blocks that overlap with missed block ranges, instead it is only the pubkey expiry that controls stopping scanning for a given key.
I want to check my understanding, @garbageslam is this flow correct?
- We have an active ingress key with pubkey_expiry = 100, and it has scanned all the way to block 50.
- Everything crashes and by the time we restarted, the network moved to block 75.
- A new ingress key is created with start=75
- We report missed block range 50-100, since we did advertise that we will scan until block 100 but failed
- We set retired=true for the first ingress key, and this causes the view server to stop trying to load blocks from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm looking at https://github.com/garbageslam/fog/blob/1d924ff2a0663b432fbee854915c65ee02fb3246/fog/view/server/src/block_tracker.rs#L30-L32 and it looks like it will keep returning block 51
is the next one even if we are retired, because it is lower than the pubkey expiry. @garbageslam what are your thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this behavior is correct -- the reason is, if the new ingress key starts at 75, Bob may have an RNG and some transactions that were scanned by that ingest say at blocks 77, 88, 99.
He may find those transactions by view key scanning, but if he later gets a TxOut in block 124, that will be the 4th output of the RNG. But from point of view of fog view protocol, he is still on the first output of his RNG. So even though those blocks are reported missing (because of a different key), fog view still needs to scan them, or else Bob will get blocked on rng value 4 and not find TxOut in block 124
@garbageslam some notes about tests:
|
@eranrund as I'm working through these tests, I'm hitting some situations where, neither choice really seems ideal. In this test for instance
The code previously had "expected_state" = empty, but I think it should be, we need to load block 123 Because retired doesn't mean "lost" or "missing", it just means the operator is trying to shut down this key. But we already published it so it has been used and we have to scan with it or report missing blocks. As I'm working through these tests, I'm thinking that, the best thing may be to fold into this PR a change I talked about and wanted to stage for later -- make the ingress keys have a boolean flag "lost" in addition to "retired", and make this the only pathway to report missed blocks to the DB. When a key is lost, the range "last_scanned" -> "pubkey_expiry" is missed. Lost means, there are no longer any blocks coming for this key. Then, next_blocks would take into account this field. I think this would simplify the logic in the block tracker, so it may make it easier to develop this PR and develop a good body of tests for it and build confidence in the change. But it also increases the scope of the PR. But that's the route I'm thinking to take right now. LMK what you think |
@garbageslam I think that sounds reasonable. What would be the 1-2 line description of WIth regards to next_blocks() returning the start_block. The idea of it returning last_scanned_block and not start_block was to not have the view server (inside db_fetcher) try and fetch a block that was not scanned yet. If it defaults to the start_block, then this will repeatedly try to load it, even if we know (by last scanned block) that it never got written which is suboptimal. |
when retired is set, it means we are trying to retire -- we are no longer publishing reports for the key. but we still may have the key and we may still be scanning the last few blocks with it. things may get retired during "normal operation" of the system, like if we do an enclave upgrade. retired doesn't mean the users need to download blocks. lost means, any blocks we were supposed to scan that we didn't scan, are missed and need to be downloaded by the user. in both cases they are only set manually |
that is more than 1 line, i will work on it :) |
I see, thank you for the explanation. So would it be correct to say that we are retiring while |
yes exactly |
Maybe not in this PR, but it seems to me that we should move from two booleans columns ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think overall this looks excellent. My biggest concern after reading this is not having enough confidence that some odd cases behave properly, specifically:
- How things behave when there are overlapping ingests with separate keys. I think this should never happen because that would also mess up the report? However nothing is going to enforce that as far as I can tell.
- How would things behave if there is a period of time during which no ingest keys are scanning.
// If the next block index we are checking doesn't exist yet, then we definitely | ||
// can't advance the highest processed block count. | ||
// This breaks the loop if ingress_keys set is empty. | ||
if highest_known_block_count < next_block_count { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe highest_known_block_count
does not advance when there are missing ranges/gaps. What happens if there's a gap, e.g. key 1 processed blocks 0-100 and then crashed. the network then did 100 blocks with no key being active. key 2 only started at block 200 because it took us time to restart ingest for whatever reason. wouldn't this cause the view server to never advance past 100 since no key provides 101?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i see what you are saying
in an earlier version, i was using rec.last_scanned_block to decide the highest_known_block_count, then i cut that because there were comments in other tests that that might cause problems and i wasn't sure it was needed
but i think if we don't use that at least for highest_processed_block_count, it's wrong for the reason you say,
and probably it should use that in next_blocks, for the optimization that you mentioned in a different comment
going to think about this, and start by writing tests that would trigger the bad scenario and confirm that it is an issue. then make this change and confirm that it fixes the test if that is indeed the case.
The view server will do this, instead of, or in addition to, the ingestable ranges Make view server use ingress key data instead of ingestable ranges The point of this is to update the block tracker so that it doesn't mind if some blocks were never scanned, as long as we were never obligated to scan them. This largely rips the ingestable ranges out of the view server, and simplifies some logic along the way. It also has to rip out an assumption that a block always contains ETxOutRecords which isn't necessarily true. Fix issue recovery db interface issue (option vec vs. vec) - This allows us to distinguish "the block has not been scanned yet" from "the block did not produce any ETxOutRecords, which is important - Get passing fog conformance tests (!) fix three node cluster tests building Remove highest_known_block_index argument to block_tracker since it can be inferred Instead, make the block tracker infer this value. Add "lost" status to ingress keys, remove "highest" arg to block tracker Tests still need to be changed The next step I think should happen is, block tracker should infer missed block ranges from the ingress key records (by looking at which ones are lost) Removed "missed blocks" argument to block tracker, and plumbing to pass it to that object This list is removed now since missed blocks only occur when an ingress key is lost in the new model. fixes to block tracker test, small simplification in block tracker logic Fixes to logic around lost keys and last scanned block, also trying to fix view server tests still get passing view server tests fix fog sql unit test Update fog/recovery_db_iface/src/types.rs Co-authored-by: Eran Rundstein <eran@rundste.in> Update fog/recovery_db_iface/src/lib.rs Co-authored-by: Eran Rundstein <eran@rundste.in> Update fog/test_infra/src/db_tests.rs Co-authored-by: Eran Rundstein <eran@rundste.in> fix a panic in SQL code, extend db test case add test coverage on "covers_block_index" function fix tests, logging, fix conformance tests fix clippy
410a8fe
to
82ad53a
Compare
squashed and rebased 28 commits (!) |
this is now green after rebase, and i have kicked a deploy. i'm going to go back and iterate on unit tests now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm in a good place to give detailed commentary on the actual work here, but I did notice some low-hanging fruit, namely the displaydoc 1.7 -> 2.0 change seems to have been reverted, and the hex2bin
usage is unnecessary, since hex
does the same thing (and can be build in non-std, AFAIK).
this failed in deploy like this: activate-fog-ingest:
fog-ingest1:
fog-ingest2:
|
I have tried to make it so that it doesn't now test for equality of URIs, but tests for equality of responder ids. So hopefully that is simpler, and the debugging output should be better |
i think what happened is i messed up the submodule when i rebased |
this passed deploy |
@eranrund thank you for detailed review comments! I think I have address these concerns with four new unit tests, in most recent commit:
and
and
I believe the test I think that, your analysis is right that, if there is a gap, highest_known_block_index won't cross the gap, and can only get larger if new blocks are reported as processed after the gap. However, the thing that controls if new blocks get processed is the I agree that it might be slightly better if we use the max of In regards to point (1): I think this is now covered effectively by the tests:
I think the short answer is, if there are two keys, then either of them can block us. And the calculation of whether either of them is blocking us is made independently. It's not blocking us if the calculation in |
fortunately it doesn't seem to affect the test outcome, i think because in this case, the last scanned block value didn't matter for whether the server can make progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for so thoroughly addressing my comments regarding testing. Aside from one discovery in one of the tests this looks good to me.
fog/view/server/src/db_fetcher.rs
Outdated
retired: false, | ||
lost: false, | ||
}, | ||
last_scanned_block: Some(49), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was initially confusing to me, but the 49 value comes from MAX(block_index) on the processed blocks db table, which makes sense since we did add blocks [40-50) above. I would document this for future us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fog/view/server/src/db_fetcher.rs
Outdated
|
||
// Retire our key at block 45, and provide blocks 30-39 (we previously provided | ||
// 40-49) | ||
// We should only get block data for blocks 30-34. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the 34 coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its a typo, i think its actually 44 , and then its because thats the last block before expiry at 45. but i'm still confirming that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reached a similar conclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have updated the comment, i think it was a typo
Co-authored-by: Eran Rundstein <eran@rundste.in>
Co-authored-by: Eran Rundstein <eran@rundste.in>
thank you for detailed review! i think those are addressed now, please let me know if you see anything else |
i will try re-reading the PR again also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
This is supposed to make it so that view server does not get stuck on blocks that fog never promised to scan.
The main change is in the fog-view block tracker. To support the new idea, we make it not use ingestable ranges, but track ingress keys directly, and load blocks from database based on ingress key rather than ingest invocation id.
There are a couple additional places we try to unwind assumptions it was making like, every block leads to at least one ETxOutRecord, which isn't strictly speaking true. There is a code path in ingest enclave where a non-conforming client can send data that leads to no ETxOutRecord being produced, if the plaintext of encrypted fog hint isn't a valid ristretto.
Because it changes the DB API, a lot of unit tests need to be fixed up after this change. A bunch more need to be added for the new behavior as well.