Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fog 393 view server fixes #11

Merged
merged 14 commits into from
Mar 25, 2021

Conversation

cbeck88
Copy link
Contributor

@cbeck88 cbeck88 commented Mar 11, 2021

This is supposed to make it so that view server does not get stuck on blocks that fog never promised to scan.

The main change is in the fog-view block tracker. To support the new idea, we make it not use ingestable ranges, but track ingress keys directly, and load blocks from database based on ingress key rather than ingest invocation id.

There are a couple additional places we try to unwind assumptions it was making like, every block leads to at least one ETxOutRecord, which isn't strictly speaking true. There is a code path in ingest enclave where a non-conforming client can send data that leads to no ETxOutRecord being produced, if the plaintext of encrypted fog hint isn't a valid ristretto.

Because it changes the DB API, a lot of unit tests need to be fixed up after this change. A bunch more need to be added for the new behavior as well.

@cbeck88 cbeck88 requested review from eranrund, sugargoat and a team March 11, 2021 16:36
None
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future, we should use this function in ingest instead of duping this logic

it's not used in this revision though I think, I ended up not using it. we could remove it for now

Copy link
Contributor

@eranrund eranrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far! It is the type of code that is prone to off-by-one/edge case issues so the more unit tests we can get, the more confident we can be in it. I haven't noticed anything obviously wrong with it.

fog/sql_recovery_db/src/lib.rs Outdated Show resolved Hide resolved
fog/sql_recovery_db/src/lib.rs Outdated Show resolved Hide resolved
fog/sql_recovery_db/src/lib.rs Outdated Show resolved Hide resolved
fog/view/server/src/block_tracker.rs Outdated Show resolved Hide resolved
fog/view/server/src/block_tracker.rs Outdated Show resolved Hide resolved
fog/view/server/src/db_fetcher.rs Outdated Show resolved Hide resolved
fog/view/server/src/server.rs Outdated Show resolved Hide resolved
fog/view/server/src/server.rs Outdated Show resolved Hide resolved
fog/view/server/src/db_fetcher.rs Outdated Show resolved Hide resolved
}

sleep(Duration::from_secs(1)); // Supposedly enough time for at least some blocks to get picked up.

assert!(db_fetcher.get_pending_fetched_records().is_empty());

// Report a missing block range.
db.report_missed_block_range(&BlockRange::new(20, 30))
// TODO The last scanned block index gets updated even though we have a hole.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garbageslam I wonder what your thoughts on this are. There is a weird case where if somehow some blocks were skipped, last_scanned_block and highest_known_block_index would still go up. I think this is fine but wanted to check with you since this differs from the previous behavior, I think

Copy link
Contributor Author

@cbeck88 cbeck88 Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is okay -- i think ingest only actually scans consecutive sequences.

if we wanted to change this, we could make the database enforce that consecutive sequences are scanned.

i don't think this is the wrong behavior for the view server to have in this scenario, i think if we want to change this, we need to change it in the database

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe its bad because, if there are gaps like this, reporting the key as lost won't fix it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk, i can think of a few ways to try to fix this in the database if it actually happened, but i don't think it will actually happen because i think ingest asserts that it always processes consecutive blocks

assert!(missing_block_ranges.is_empty());

// Report a missing block range. This should have no effect.
db.report_missed_block_range(&BlockRange::new(30, 40))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version doesn't skip blocks that overlap with missed block ranges, instead it is only the pubkey expiry that controls stopping scanning for a given key.

I want to check my understanding, @garbageslam is this flow correct?

  1. We have an active ingress key with pubkey_expiry = 100, and it has scanned all the way to block 50.
  2. Everything crashes and by the time we restarted, the network moved to block 75.
  3. A new ingress key is created with start=75
  4. We report missed block range 50-100, since we did advertise that we will scan until block 100 but failed
  5. We set retired=true for the first ingress key, and this causes the view server to stop trying to load blocks from it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm looking at https://github.com/garbageslam/fog/blob/1d924ff2a0663b432fbee854915c65ee02fb3246/fog/view/server/src/block_tracker.rs#L30-L32 and it looks like it will keep returning block 51 is the next one even if we are retired, because it is lower than the pubkey expiry. @garbageslam what are your thoughts on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this behavior is correct -- the reason is, if the new ingress key starts at 75, Bob may have an RNG and some transactions that were scanned by that ingest say at blocks 77, 88, 99.
He may find those transactions by view key scanning, but if he later gets a TxOut in block 124, that will be the 4th output of the RNG. But from point of view of fog view protocol, he is still on the first output of his RNG. So even though those blocks are reported missing (because of a different key), fog view still needs to scan them, or else Bob will get blocked on rng value 4 and not find TxOut in block 124

@eranrund
Copy link
Contributor

@garbageslam some notes about tests:

  • db_fetcher appears to have a flaky test basic_single_ingress_key
  • all the highest_fully_processed_block_ tests in block_fetcher.rs need to be reviewed. The addition of highest_known_block_index means there are more cases to check for.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 12, 2021

@eranrund as I'm working through these tests, I'm hitting some situations where, neither choice really seems ideal.

In this test for instance next_blocks_single_key_retired_hasnt_scanned:

        let mut block_tracker = BlockTracker::new(logger);

        let key = CompressedRistrettoPublic::from_random(&mut rng); 
        let rec = IngressPublicKeyRecord {
            key,
            status: IngressPublicKeyStatus {
                start_block: 123,
                pubkey_expiry: 173,
                retired: true,
            },
            last_scanned_block: None,
        };

        // This is the expected state because, even though the key is retired,
        // we promised to scan from 123 to 173. So the next thing we need is 123,
        // until we 
        // unless those blocks are declared missed.
        let expected_state = HashMap::from_iter(vec![(key, 123)]);

The code previously had "expected_state" = empty, but I think it should be, we need to load block 123

Because retired doesn't mean "lost" or "missing", it just means the operator is trying to shut down this key. But we already published it so it has been used and we have to scan with it or report missing blocks.

As I'm working through these tests, I'm thinking that, the best thing may be to fold into this PR a change I talked about and wanted to stage for later -- make the ingress keys have a boolean flag "lost" in addition to "retired", and make this the only pathway to report missed blocks to the DB. When a key is lost, the range "last_scanned" -> "pubkey_expiry" is missed. Lost means, there are no longer any blocks coming for this key. Then, next_blocks would take into account this field.

I think this would simplify the logic in the block tracker, so it may make it easier to develop this PR and develop a good body of tests for it and build confidence in the change. But it also increases the scope of the PR. But that's the route I'm thinking to take right now. LMK what you think

@eranrund
Copy link
Contributor

@garbageslam I think that sounds reasonable. What would be the 1-2 line description of retired in that case? When do we set it to true and what does that represent?

WIth regards to next_blocks() returning the start_block. The idea of it returning last_scanned_block and not start_block was to not have the view server (inside db_fetcher) try and fetch a block that was not scanned yet. If it defaults to the start_block, then this will repeatedly try to load it, even if we know (by last scanned block) that it never got written which is suboptimal.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 12, 2021

when retired is set, it means we are trying to retire -- we are no longer publishing reports for the key. but we still may have the key and we may still be scanning the last few blocks with it. things may get retired during "normal operation" of the system, like if we do an enclave upgrade. retired doesn't mean the users need to download blocks.

lost means, any blocks we were supposed to scan that we didn't scan, are missed and need to be downloaded by the user.
lost shouldn't happen during normal operation of the system, it happens if all the replicas crash

in both cases they are only set manually

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 12, 2021

that is more than 1 line, i will work on it :)

@eranrund
Copy link
Contributor

I see, thank you for the explanation. So would it be correct to say that we are retiring while last_scanned_block < (pubkey_expiry - 1) and are actually retired when last_scanned_block == (pubkey_expiry - 1)? This is based on pubkey_expiry been the first block index that would not get scanned.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 12, 2021

yes exactly

@eranrund
Copy link
Contributor

Maybe not in this PR, but it seems to me that we should move from two booleans columns (retired and lost) into a status column that could be one of the following: pending (maybe? before anything got scanned. maybe not needed), active, lost, retiring, retired.

@cbeck88 cbeck88 marked this pull request as ready for review March 24, 2021 16:48
Copy link
Contributor

@eranrund eranrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this looks excellent. My biggest concern after reading this is not having enough confidence that some odd cases behave properly, specifically:

  1. How things behave when there are overlapping ingests with separate keys. I think this should never happen because that would also mess up the report? However nothing is going to enforce that as far as I can tell.
  2. How would things behave if there is a period of time during which no ingest keys are scanning.

fog/recovery_db_iface/src/lib.rs Outdated Show resolved Hide resolved
fog/recovery_db_iface/src/types.rs Outdated Show resolved Hide resolved
fog/sql_recovery_db/src/lib.rs Outdated Show resolved Hide resolved
fog/test_infra/src/db_tests.rs Show resolved Hide resolved
fog/test_infra/src/db_tests.rs Outdated Show resolved Hide resolved
fog/view/server/src/block_tracker.rs Outdated Show resolved Hide resolved
// If the next block index we are checking doesn't exist yet, then we definitely
// can't advance the highest processed block count.
// This breaks the loop if ingress_keys set is empty.
if highest_known_block_count < next_block_count {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe highest_known_block_count does not advance when there are missing ranges/gaps. What happens if there's a gap, e.g. key 1 processed blocks 0-100 and then crashed. the network then did 100 blocks with no key being active. key 2 only started at block 200 because it took us time to restart ingest for whatever reason. wouldn't this cause the view server to never advance past 100 since no key provides 101?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i see what you are saying

in an earlier version, i was using rec.last_scanned_block to decide the highest_known_block_count, then i cut that because there were comments in other tests that that might cause problems and i wasn't sure it was needed

but i think if we don't use that at least for highest_processed_block_count, it's wrong for the reason you say,
and probably it should use that in next_blocks, for the optimization that you mentioned in a different comment

going to think about this, and start by writing tests that would trigger the bad scenario and confirm that it is an issue. then make this change and confirm that it fixes the test if that is indeed the case.

fog/view/server/src/block_tracker.rs Show resolved Hide resolved
fog/view/server/src/block_tracker.rs Outdated Show resolved Hide resolved
fog/view/server/src/db_fetcher.rs Show resolved Hide resolved
The view server will do this, instead of, or in addition to, the
ingestable ranges

Make view server use ingress key data instead of ingestable ranges

The point of this is to update the block tracker so that it doesn't
mind if some blocks were never scanned, as long as we were never
obligated to scan them.

This largely rips the ingestable ranges out of the view server,
and simplifies some logic along the way. It also has to rip out
an assumption that a block always contains ETxOutRecords which
isn't necessarily true.

Fix issue recovery db interface issue (option vec vs. vec)

- This allows us to distinguish "the block has not been scanned yet"
  from "the block did not produce any ETxOutRecords, which is important
- Get passing fog conformance tests (!)

fix three node cluster tests building

Remove highest_known_block_index argument to block_tracker since it can be inferred

Instead, make the block tracker infer this value.

Add "lost" status to ingress keys, remove "highest" arg to block tracker

Tests still need to be changed

The next step I think should happen is, block tracker should infer
missed block ranges from the ingress key records (by looking at which
ones are lost)

Removed "missed blocks" argument to block tracker, and plumbing to pass it to that object

This list is removed now since missed blocks only occur when an ingress key is lost
in the new model.

fixes to block tracker test, small simplification in block tracker logic

Fixes to logic around lost keys and last scanned block,

also trying to fix view server tests still

get passing view server tests

fix fog sql unit test

Update fog/recovery_db_iface/src/types.rs

Co-authored-by: Eran Rundstein <eran@rundste.in>

Update fog/recovery_db_iface/src/lib.rs

Co-authored-by: Eran Rundstein <eran@rundste.in>

Update fog/test_infra/src/db_tests.rs

Co-authored-by: Eran Rundstein <eran@rundste.in>

fix a panic in SQL code, extend db test case

add test coverage on "covers_block_index" function

fix tests, logging, fix conformance tests

fix clippy
@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 25, 2021

squashed and rebased 28 commits (!)

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 25, 2021

this is now green after rebase, and i have kicked a deploy. i'm going to go back and iterate on unit tests now

Copy link
Contributor

@jcape jcape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm in a good place to give detailed commentary on the actual work here, but I did notice some low-hanging fruit, namely the displaydoc 1.7 -> 2.0 change seems to have been reverted, and the hex2bin usage is unnecessary, since hex does the same thing (and can be build in non-std, AFAIK).

Cargo.lock Outdated Show resolved Hide resolved
fog/ingest/client/src/config.rs Show resolved Hide resolved
@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 25, 2021

this failed in deploy like this:

activate-fog-ingest:

+ /usr/local/bin/fog_ingest_client --uri insecure-fog-ingest://fog-ingest1.diogenes.svc.cluster.local:3226 --retry-seconds 30 activate
2021-03-25 03:01:34.236118362 UTC WARN Creating insecure gRPC connection to fog-ingest1.diogenes.svc.cluster.local:3226, mc.fog.cxn: insecure-fog-ingest://fog-ingest1.diogenes.svc.cluster.local:3226/, mc.module: mc_util_grpc::grpcio_extensions, mc.src: mobilecoin/util/grpc/src/grpcio_extensions.rs:46
thread 'main' panicked at 'rpc failed: Operation { error: Grpc(RpcFailure(RpcStatus { status: 9-FAILED_PRECONDITION, details: Some("activate: Peer backup error: Failed to set peer insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/ peers list") })), total_delay: 30s, tries: 301 }', fog/ingest/client/src/main.rs:96:43

fog-ingest1:

2021-03-25 02:45:37.132532252 UTC ERRO Tried to send our peer list to peer insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/, but despite successful status the peer list is still wrong! Expected: [insecure-igp://fog-ingest1.diogenes.svc.cluster.local:8090/, insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/], Found: [insecure-igp://fog-ingest1.diogenes.svc.cluster.local:8090/, insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/], mc.local_node_id: fog-ingest1.diogenes.svc.cluster.local:8090, mc.app: fog_ingest_server, mc.local_node_id: fog-ingest1.diogenes.svc.cluster.local:8090, mc.module: fog_ingest_server::controller, mc.src: fog/ingest/server/src/controller.rs:1305
2021-03-25 02:45:37.132715153 UTC ERRO activate: Peer backup error: Failed to set peer insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/ peers list, rpc_request_id: 111, rpc_client_id: 9044d643dc4fe0ab9e1151a4e4398f334060789e0c8472cb6624842b3d297bbb, mc.local_node_id: fog-ingest1.diogenes.svc.cluster.local:8090, mc.app: fog_ingest_server, mc.local_node_id: fog-ingest1.diogenes.svc.cluster.local:8090, mc.module: mc_util_grpc, mc.src: mobilecoin/util/grpc/src/lib.rs:153

fog-ingest2:

2021-03-25 03:09:48.841313215 UTC WARN The new set of peers did not contain a URI with our responder-id. We added our URI to the set: [insecure-igp://fog-ingest1.diogenes.svc.cluster.local:8090/, insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/] <-- insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/, mc.local_node_id: fog-ingest2.diogenes.svc.cluster.local:8090, mc.app: fog_ingest_server, mc.local_node_id: fog-ingest2.diogenes.svc.cluster.local:8090, mc.module: fog_ingest_server::controller, mc.src: fog/ingest/server/src/controller.rs:823

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 25, 2021

I have tried to make it so that it doesn't now test for equality of URIs, but tests for equality of responder ids. So hopefully that is simpler, and the debugging output should be better

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 25, 2021

i think what happened is i messed up the submodule when i rebased

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 25, 2021

this passed deploy

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 25, 2021

@eranrund thank you for detailed review comments!

I think I have address these concerns with four new unit tests, in most recent commit:

My biggest concern after reading this is not having enough confidence that some odd cases behave properly, specifically:

  1. How things behave when there are overlapping ingests with separate keys. I think this should never happen because that would also mess up the report? However nothing is going to enforce that as far as I can tell.
  2. How would things behave if there is a period of time during which no ingest keys are scanning.

and

I believe highest_known_block_count does not advance when there are missing ranges/gaps. What happens if there's a gap, e.g. key 1 processed blocks 0-100 and then crashed. the network then did 100 blocks with no key being active. key 2 only started at block 200 because it took us time to restart ingest for whatever reason. wouldn't this cause the view server to never advance past 100 since no key provides 101?

and

This test relates to my comment above with how the view server advances to the next block. I think if the ingress keys are not all consecutive and somehow there is a gap then the view server will stop progressing (unless I misunderstood something).

I believe the test highest_fully_processed_block_tracks_retired_key_followed_by_gap
and highest_fully_processed_block_tracks_retired_key_concurrent_with_active_both_lost
both speak to what happens in point (2) and these last two comments.

I think that, your analysis is right that, if there is a gap, highest_known_block_index won't cross the gap, and can only get larger if new blocks are reported as processed after the gap.

However, the thing that controls if new blocks get processed is the next_blocks function, and next_blocks is not limited by hihgest processed block count and doesn't care about the limit. As long as there is a new key with a start_block after the gap, next_blocks will say we should load the start block for that key. And once we do load that block and report it processed, highest_known_block_count will get larger, and then highest_processed_block_count can get larger, and so cross the gap. This is what happens in the tests right now.

I agree that it might be slightly better if we use the max of last_scanned_block as well when determining where to cap the highest processed block count, because those blocks are also "known", and it makes it easier to see that the thing won't get stuck, which is good. However, if it's not a bug right now, I'd like to do that in a later ticket, because I think this version works as is.

In regards to point (1): I think this is now covered effectively by the tests:

  • highest_fully_processed_block_tracks_multiple_recs
  • `highest_fully_processed_block_tracks_multiple_recs2
  • highest_fully_processed_block_tracks_retired_key_concurrent_with_active
  • and to some extent highest_fully_processed_block_tracks_retired_key_concurrent_with_active_both_lost

I think the short answer is, if there are two keys, then either of them can block us. And the calculation of whether either of them is blocking us is made independently. It's not blocking us if the calculation in covers_block_index says that key doesn't need to provide the next block, otherwise it is and we have to wait for it.

fortunately it doesn't seem to affect the test outcome, i think
because in this case, the last scanned block value didn't matter
for whether the server can make progress
Copy link
Contributor

@eranrund eranrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for so thoroughly addressing my comments regarding testing. Aside from one discovery in one of the tests this looks good to me.

fog/ingest/server/src/worker.rs Outdated Show resolved Hide resolved
fog/view/server/src/block_tracker.rs Outdated Show resolved Hide resolved
fog/view/server/src/block_tracker.rs Outdated Show resolved Hide resolved
fog/view/server/src/db_fetcher.rs Outdated Show resolved Hide resolved
retired: false,
lost: false,
},
last_scanned_block: Some(49),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was initially confusing to me, but the 49 value comes from MAX(block_index) on the processed blocks db table, which makes sense since we did add blocks [40-50) above. I would document this for future us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// Retire our key at block 45, and provide blocks 30-39 (we previously provided
// 40-49)
// We should only get block data for blocks 30-34.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the 34 coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its a typo, i think its actually 44 , and then its because thats the last block before expiry at 45. but i'm still confirming that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reached a similar conclusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have updated the comment, i think it was a typo

cbeck88 and others added 3 commits March 25, 2021 20:57
Co-authored-by: Eran Rundstein <eran@rundste.in>
Co-authored-by: Eran Rundstein <eran@rundste.in>
@cbeck88 cbeck88 requested a review from eranrund March 25, 2021 19:06
@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 25, 2021

thank you for detailed review! i think those are addressed now, please let me know if you see anything else

@cbeck88
Copy link
Contributor Author

cbeck88 commented Mar 25, 2021

i will try re-reading the PR again also

Copy link
Contributor

@eranrund eranrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

@cbeck88 cbeck88 merged commit 78386df into mobilecoinfoundation:master Mar 25, 2021
@cbeck88 cbeck88 deleted the fog-393-view-server-fixes branch March 25, 2021 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants