Fog 393 view server fixes #11

cbeck88 · 2021-03-11T16:36:32Z

This is supposed to make it so that view server does not get stuck on blocks that fog never promised to scan.

The main change is in the fog-view block tracker. To support the new idea, we make it not use ingestable ranges, but track ingress keys directly, and load blocks from database based on ingress key rather than ingest invocation id.

There are a couple additional places we try to unwind assumptions it was making like, every block leads to at least one ETxOutRecord, which isn't strictly speaking true. There is a code path in ingest enclave where a non-conforming client can send data that leads to no ETxOutRecord being produced, if the plaintext of encrypted fog hint isn't a valid ristretto.

Because it changes the DB API, a lot of unit tests need to be fixed up after this change. A bunch more need to be added for the new behavior as well.

tools/fog-local-network/fog_conformance_tests.py

fog/view/server/src/server.rs

cbeck88 · 2021-03-11T18:23:35Z

fog/recovery_db_iface/src/types.rs

+            None
+        }
+    }
+}


in the future, we should use this function in ingest instead of duping this logic

it's not used in this revision though I think, I ended up not using it. we could remove it for now

fog/view/server/src/block_tracker.rs

fog/view/server/src/db_fetcher.rs

eranrund

Looks great so far! It is the type of code that is prone to off-by-one/edge case issues so the more unit tests we can get, the more confident we can be in it. I haven't noticed anything obviously wrong with it.

fog/sql_recovery_db/src/lib.rs

fog/view/server/src/block_tracker.rs

fog/view/server/src/db_fetcher.rs

fog/view/server/src/server.rs

tools/fog-local-network/fog_conformance_tests.py

fog/view/server/src/db_fetcher.rs

eranrund · 2021-03-12T01:38:31Z

fog/view/server/src/db_fetcher.rs

        }

        sleep(Duration::from_secs(1)); // Supposedly enough time for at least some blocks to get picked up.

        assert!(db_fetcher.get_pending_fetched_records().is_empty());

-        // Report a missing block range.
-        db.report_missed_block_range(&BlockRange::new(20, 30))
+        // TODO The last scanned block index gets updated even though we have a hole.


@garbageslam I wonder what your thoughts on this are. There is a weird case where if somehow some blocks were skipped, last_scanned_block and highest_known_block_index would still go up. I think this is fine but wanted to check with you since this differs from the previous behavior, I think

i think this is okay -- i think ingest only actually scans consecutive sequences.

if we wanted to change this, we could make the database enforce that consecutive sequences are scanned.

i don't think this is the wrong behavior for the view server to have in this scenario, i think if we want to change this, we need to change it in the database

maybe its bad because, if there are gaps like this, reporting the key as lost won't fix it?

idk, i can think of a few ways to try to fix this in the database if it actually happened, but i don't think it will actually happen because i think ingest asserts that it always processes consecutive blocks

eranrund · 2021-03-12T01:42:56Z

fog/view/server/src/db_fetcher.rs

+        assert!(missing_block_ranges.is_empty());
+
+        // Report a missing block range. This should have no effect.
+        db.report_missed_block_range(&BlockRange::new(30, 40))


The current version doesn't skip blocks that overlap with missed block ranges, instead it is only the pubkey expiry that controls stopping scanning for a given key.

I want to check my understanding, @garbageslam is this flow correct?

We have an active ingress key with pubkey_expiry = 100, and it has scanned all the way to block 50.

Everything crashes and by the time we restarted, the network moved to block 75.

A new ingress key is created with start=75

We report missed block range 50-100, since we did advertise that we will scan until block 100 but failed

We set retired=true for the first ingress key, and this causes the view server to stop trying to load blocks from it

Hm, I'm looking at https://github.com/garbageslam/fog/blob/1d924ff2a0663b432fbee854915c65ee02fb3246/fog/view/server/src/block_tracker.rs#L30-L32 and it looks like it will keep returning block 51 is the next one even if we are retired, because it is lower than the pubkey expiry. @garbageslam what are your thoughts on that?

i think this behavior is correct -- the reason is, if the new ingress key starts at 75, Bob may have an RNG and some transactions that were scanned by that ingest say at blocks 77, 88, 99.
He may find those transactions by view key scanning, but if he later gets a TxOut in block 124, that will be the 4th output of the RNG. But from point of view of fog view protocol, he is still on the first output of his RNG. So even though those blocks are reported missing (because of a different key), fog view still needs to scan them, or else Bob will get blocked on rng value 4 and not find TxOut in block 124

eranrund · 2021-03-12T02:39:56Z

@garbageslam some notes about tests:

db_fetcher appears to have a flaky test basic_single_ingress_key
all the highest_fully_processed_block_ tests in block_fetcher.rs need to be reviewed. The addition of highest_known_block_index means there are more cases to check for.

cbeck88 · 2021-03-12T17:32:13Z

@eranrund as I'm working through these tests, I'm hitting some situations where, neither choice really seems ideal.

In this test for instance next_blocks_single_key_retired_hasnt_scanned:

        let mut block_tracker = BlockTracker::new(logger);

        let key = CompressedRistrettoPublic::from_random(&mut rng); 
        let rec = IngressPublicKeyRecord {
            key,
            status: IngressPublicKeyStatus {
                start_block: 123,
                pubkey_expiry: 173,
                retired: true,
            },
            last_scanned_block: None,
        };

        // This is the expected state because, even though the key is retired,
        // we promised to scan from 123 to 173. So the next thing we need is 123,
        // until we 
        // unless those blocks are declared missed.
        let expected_state = HashMap::from_iter(vec![(key, 123)]);

The code previously had "expected_state" = empty, but I think it should be, we need to load block 123

Because retired doesn't mean "lost" or "missing", it just means the operator is trying to shut down this key. But we already published it so it has been used and we have to scan with it or report missing blocks.

As I'm working through these tests, I'm thinking that, the best thing may be to fold into this PR a change I talked about and wanted to stage for later -- make the ingress keys have a boolean flag "lost" in addition to "retired", and make this the only pathway to report missed blocks to the DB. When a key is lost, the range "last_scanned" -> "pubkey_expiry" is missed. Lost means, there are no longer any blocks coming for this key. Then, next_blocks would take into account this field.

I think this would simplify the logic in the block tracker, so it may make it easier to develop this PR and develop a good body of tests for it and build confidence in the change. But it also increases the scope of the PR. But that's the route I'm thinking to take right now. LMK what you think

eranrund · 2021-03-12T17:54:15Z

@garbageslam I think that sounds reasonable. What would be the 1-2 line description of retired in that case? When do we set it to true and what does that represent?

WIth regards to next_blocks() returning the start_block. The idea of it returning last_scanned_block and not start_block was to not have the view server (inside db_fetcher) try and fetch a block that was not scanned yet. If it defaults to the start_block, then this will repeatedly try to load it, even if we know (by last scanned block) that it never got written which is suboptimal.

cbeck88 · 2021-03-12T17:58:40Z

when retired is set, it means we are trying to retire -- we are no longer publishing reports for the key. but we still may have the key and we may still be scanning the last few blocks with it. things may get retired during "normal operation" of the system, like if we do an enclave upgrade. retired doesn't mean the users need to download blocks.

lost means, any blocks we were supposed to scan that we didn't scan, are missed and need to be downloaded by the user.
lost shouldn't happen during normal operation of the system, it happens if all the replicas crash

in both cases they are only set manually

cbeck88 · 2021-03-12T17:58:55Z

that is more than 1 line, i will work on it :)

eranrund · 2021-03-12T18:07:56Z

I see, thank you for the explanation. So would it be correct to say that we are retiring while last_scanned_block < (pubkey_expiry - 1) and are actually retired when last_scanned_block == (pubkey_expiry - 1)? This is based on pubkey_expiry been the first block index that would not get scanned.

cbeck88 · 2021-03-12T18:11:36Z

yes exactly

eranrund · 2021-03-12T18:16:03Z

Maybe not in this PR, but it seems to me that we should move from two booleans columns (retired and lost) into a status column that could be one of the following: pending (maybe? before anything got scanned. maybe not needed), active, lost, retiring, retired.

eranrund

I think overall this looks excellent. My biggest concern after reading this is not having enough confidence that some odd cases behave properly, specifically:

How things behave when there are overlapping ingests with separate keys. I think this should never happen because that would also mess up the report? However nothing is going to enforce that as far as I can tell.
How would things behave if there is a period of time during which no ingest keys are scanning.

fog/recovery_db_iface/src/lib.rs

fog/recovery_db_iface/src/types.rs

fog/sql_recovery_db/src/lib.rs

fog/test_infra/src/db_tests.rs

fog/view/server/src/block_tracker.rs

eranrund · 2021-03-24T18:35:04Z

fog/view/server/src/block_tracker.rs

+            // If the next block index we are checking doesn't exist yet, then we definitely
+            // can't advance the highest processed block count.
+            // This breaks the loop if ingress_keys set is empty.
+            if highest_known_block_count < next_block_count {


I believe highest_known_block_count does not advance when there are missing ranges/gaps. What happens if there's a gap, e.g. key 1 processed blocks 0-100 and then crashed. the network then did 100 blocks with no key being active. key 2 only started at block 200 because it took us time to restart ingest for whatever reason. wouldn't this cause the view server to never advance past 100 since no key provides 101?

i think i see what you are saying

in an earlier version, i was using rec.last_scanned_block to decide the highest_known_block_count, then i cut that because there were comments in other tests that that might cause problems and i wasn't sure it was needed

but i think if we don't use that at least for highest_processed_block_count, it's wrong for the reason you say,
and probably it should use that in next_blocks, for the optimization that you mentioned in a different comment

going to think about this, and start by writing tests that would trigger the bad scenario and confirm that it is an issue. then make this change and confirm that it fixes the test if that is indeed the case.

fog/view/server/src/block_tracker.rs

fog/view/server/src/db_fetcher.rs

The view server will do this, instead of, or in addition to, the ingestable ranges Make view server use ingress key data instead of ingestable ranges The point of this is to update the block tracker so that it doesn't mind if some blocks were never scanned, as long as we were never obligated to scan them. This largely rips the ingestable ranges out of the view server, and simplifies some logic along the way. It also has to rip out an assumption that a block always contains ETxOutRecords which isn't necessarily true. Fix issue recovery db interface issue (option vec vs. vec) - This allows us to distinguish "the block has not been scanned yet" from "the block did not produce any ETxOutRecords, which is important - Get passing fog conformance tests (!) fix three node cluster tests building Remove highest_known_block_index argument to block_tracker since it can be inferred Instead, make the block tracker infer this value. Add "lost" status to ingress keys, remove "highest" arg to block tracker Tests still need to be changed The next step I think should happen is, block tracker should infer missed block ranges from the ingress key records (by looking at which ones are lost) Removed "missed blocks" argument to block tracker, and plumbing to pass it to that object This list is removed now since missed blocks only occur when an ingress key is lost in the new model. fixes to block tracker test, small simplification in block tracker logic Fixes to logic around lost keys and last scanned block, also trying to fix view server tests still get passing view server tests fix fog sql unit test Update fog/recovery_db_iface/src/types.rs Co-authored-by: Eran Rundstein <eran@rundste.in> Update fog/recovery_db_iface/src/lib.rs Co-authored-by: Eran Rundstein <eran@rundste.in> Update fog/test_infra/src/db_tests.rs Co-authored-by: Eran Rundstein <eran@rundste.in> fix a panic in SQL code, extend db test case add test coverage on "covers_block_index" function fix tests, logging, fix conformance tests fix clippy

cbeck88 · 2021-03-25T01:24:17Z

squashed and rebased 28 commits (!)

cbeck88 · 2021-03-25T01:54:18Z

this is now green after rebase, and i have kicked a deploy. i'm going to go back and iterate on unit tests now

jcape

I'm not sure I'm in a good place to give detailed commentary on the actual work here, but I did notice some low-hanging fruit, namely the displaydoc 1.7 -> 2.0 change seems to have been reverted, and the hex2bin usage is unnecessary, since hex does the same thing (and can be build in non-std, AFAIK).

Cargo.lock

fog/ingest/client/src/config.rs

cbeck88 · 2021-03-25T06:47:19Z

this failed in deploy like this:

activate-fog-ingest:

+ /usr/local/bin/fog_ingest_client --uri insecure-fog-ingest://fog-ingest1.diogenes.svc.cluster.local:3226 --retry-seconds 30 activate
2021-03-25 03:01:34.236118362 UTC WARN Creating insecure gRPC connection to fog-ingest1.diogenes.svc.cluster.local:3226, mc.fog.cxn: insecure-fog-ingest://fog-ingest1.diogenes.svc.cluster.local:3226/, mc.module: mc_util_grpc::grpcio_extensions, mc.src: mobilecoin/util/grpc/src/grpcio_extensions.rs:46
thread 'main' panicked at 'rpc failed: Operation { error: Grpc(RpcFailure(RpcStatus { status: 9-FAILED_PRECONDITION, details: Some("activate: Peer backup error: Failed to set peer insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/ peers list") })), total_delay: 30s, tries: 301 }', fog/ingest/client/src/main.rs:96:43

fog-ingest1:

2021-03-25 02:45:37.132532252 UTC ERRO Tried to send our peer list to peer insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/, but despite successful status the peer list is still wrong! Expected: [insecure-igp://fog-ingest1.diogenes.svc.cluster.local:8090/, insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/], Found: [insecure-igp://fog-ingest1.diogenes.svc.cluster.local:8090/, insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/], mc.local_node_id: fog-ingest1.diogenes.svc.cluster.local:8090, mc.app: fog_ingest_server, mc.local_node_id: fog-ingest1.diogenes.svc.cluster.local:8090, mc.module: fog_ingest_server::controller, mc.src: fog/ingest/server/src/controller.rs:1305
2021-03-25 02:45:37.132715153 UTC ERRO activate: Peer backup error: Failed to set peer insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/ peers list, rpc_request_id: 111, rpc_client_id: 9044d643dc4fe0ab9e1151a4e4398f334060789e0c8472cb6624842b3d297bbb, mc.local_node_id: fog-ingest1.diogenes.svc.cluster.local:8090, mc.app: fog_ingest_server, mc.local_node_id: fog-ingest1.diogenes.svc.cluster.local:8090, mc.module: mc_util_grpc, mc.src: mobilecoin/util/grpc/src/lib.rs:153

fog-ingest2:

2021-03-25 03:09:48.841313215 UTC WARN The new set of peers did not contain a URI with our responder-id. We added our URI to the set: [insecure-igp://fog-ingest1.diogenes.svc.cluster.local:8090/, insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/] <-- insecure-igp://fog-ingest2.diogenes.svc.cluster.local:8090/, mc.local_node_id: fog-ingest2.diogenes.svc.cluster.local:8090, mc.app: fog_ingest_server, mc.local_node_id: fog-ingest2.diogenes.svc.cluster.local:8090, mc.module: fog_ingest_server::controller, mc.src: fog/ingest/server/src/controller.rs:823

cbeck88 · 2021-03-25T07:23:05Z

I have tried to make it so that it doesn't now test for equality of URIs, but tests for equality of responder ids. So hopefully that is simpler, and the debugging output should be better

cbeck88 · 2021-03-25T07:25:24Z

i think what happened is i messed up the submodule when i rebased

cbeck88 · 2021-03-25T08:35:47Z

this passed deploy

cbeck88 · 2021-03-25T11:20:12Z

@eranrund thank you for detailed review comments!

I think I have address these concerns with four new unit tests, in most recent commit:

My biggest concern after reading this is not having enough confidence that some odd cases behave properly, specifically:

How things behave when there are overlapping ingests with separate keys. I think this should never happen because that would also mess up the report? However nothing is going to enforce that as far as I can tell.

How would things behave if there is a period of time during which no ingest keys are scanning.

and

I believe highest_known_block_count does not advance when there are missing ranges/gaps. What happens if there's a gap, e.g. key 1 processed blocks 0-100 and then crashed. the network then did 100 blocks with no key being active. key 2 only started at block 200 because it took us time to restart ingest for whatever reason. wouldn't this cause the view server to never advance past 100 since no key provides 101?

and

This test relates to my comment above with how the view server advances to the next block. I think if the ingress keys are not all consecutive and somehow there is a gap then the view server will stop progressing (unless I misunderstood something).

I believe the test highest_fully_processed_block_tracks_retired_key_followed_by_gap
and highest_fully_processed_block_tracks_retired_key_concurrent_with_active_both_lost
both speak to what happens in point (2) and these last two comments.

I think that, your analysis is right that, if there is a gap, highest_known_block_index won't cross the gap, and can only get larger if new blocks are reported as processed after the gap.

However, the thing that controls if new blocks get processed is the next_blocks function, and next_blocks is not limited by hihgest processed block count and doesn't care about the limit. As long as there is a new key with a start_block after the gap, next_blocks will say we should load the start block for that key. And once we do load that block and report it processed, highest_known_block_count will get larger, and then highest_processed_block_count can get larger, and so cross the gap. This is what happens in the tests right now.

I agree that it might be slightly better if we use the max of last_scanned_block as well when determining where to cap the highest processed block count, because those blocks are also "known", and it makes it easier to see that the thing won't get stuck, which is good. However, if it's not a bug right now, I'd like to do that in a later ticket, because I think this version works as is.

In regards to point (1): I think this is now covered effectively by the tests:

highest_fully_processed_block_tracks_multiple_recs
`highest_fully_processed_block_tracks_multiple_recs2
highest_fully_processed_block_tracks_retired_key_concurrent_with_active
and to some extent highest_fully_processed_block_tracks_retired_key_concurrent_with_active_both_lost

I think the short answer is, if there are two keys, then either of them can block us. And the calculation of whether either of them is blocking us is made independently. It's not blocking us if the calculation in covers_block_index says that key doesn't need to provide the next block, otherwise it is and we have to wait for it.

fortunately it doesn't seem to affect the test outcome, i think because in this case, the last scanned block value didn't matter for whether the server can make progress

eranrund

Thank you for so thoroughly addressing my comments regarding testing. Aside from one discovery in one of the tests this looks good to me.

fog/ingest/server/src/worker.rs

fog/view/server/src/block_tracker.rs

fog/view/server/src/db_fetcher.rs

eranrund · 2021-03-25T18:41:20Z

fog/view/server/src/db_fetcher.rs

+                    retired: false,
+                    lost: false,
+                },
+                last_scanned_block: Some(49),


This was initially confusing to me, but the 49 value comes from MAX(block_index) on the processed blocks db table, which makes sense since we did add blocks [40-50) above. I would document this for future us.

eranrund · 2021-03-25T18:42:38Z

fog/view/server/src/db_fetcher.rs

+
+        // Retire our key at block 45, and provide blocks 30-39 (we previously provided
+        // 40-49)
+        // We should only get block data for blocks 30-34.


Where's the 34 coming from?

i think its a typo, i think its actually 44 , and then its because thats the last block before expiry at 45. but i'm still confirming that

I reached a similar conclusion.

i have updated the comment, i think it was a typo

Co-authored-by: Eran Rundstein <eran@rundste.in>

cbeck88 · 2021-03-25T19:07:04Z

thank you for detailed review! i think those are addressed now, please let me know if you see anything else

cbeck88 · 2021-03-25T19:07:13Z

i will try re-reading the PR again also

eranrund

🙏

cbeck88 requested review from eranrund, sugargoat and a team March 11, 2021 16:36

cbeck88 commented Mar 11, 2021

View reviewed changes

tools/fog-local-network/fog_conformance_tests.py Show resolved Hide resolved

cbeck88 commented Mar 11, 2021

View reviewed changes

fog/view/server/src/server.rs Show resolved Hide resolved

cbeck88 commented Mar 11, 2021

View reviewed changes

fog/view/server/src/block_tracker.rs Show resolved Hide resolved

cbeck88 commented Mar 11, 2021

View reviewed changes

fog/view/server/src/db_fetcher.rs Show resolved Hide resolved

eranrund reviewed Mar 11, 2021

View reviewed changes

eranrund reviewed Mar 12, 2021

View reviewed changes

cbeck88 marked this pull request as ready for review March 24, 2021 16:48

eranrund reviewed Mar 24, 2021

View reviewed changes

cbeck88 force-pushed the fog-393-view-server-fixes branch from 410a8fe to 82ad53a Compare March 25, 2021 01:24

cargo fmt

4694f32

jcape reviewed Mar 25, 2021

View reviewed changes

Cargo.lock Outdated Show resolved Hide resolved

fog/ingest/client/src/config.rs Show resolved Hide resolved

cbeck88 added 2 commits March 25, 2021 09:18

add some sleeps to help db fetcher test

9d42220

try to fix up set-peers implementation to fix deploy

73b5628

cbeck88 added 4 commits March 25, 2021 09:25

fix mobilecoin submodule

2cce7bc

cargo.lock and fix clippy

e38cfd3

fix displaydoc comment string

9e7fdbc

fixup hex decode thing

e4b2637

flesh out remaining tests

26c5df5

cbeck88 requested review from eranrund, jcape and jgreat March 25, 2021 11:45

extend a test, address remaining eran comments

9a2b62d

sugargoat approved these changes Mar 25, 2021

View reviewed changes

fix mistakes in a block tracker test code that eran spotted

4d72e66

fortunately it doesn't seem to affect the test outcome, i think because in this case, the last scanned block value didn't matter for whether the server can make progress

eranrund suggested changes Mar 25, 2021

View reviewed changes

cbeck88 and others added 3 commits March 25, 2021 20:57

Update fog/ingest/server/src/worker.rs

a9acafe

Co-authored-by: Eran Rundstein <eran@rundste.in>

Update fog/view/server/src/db_fetcher.rs

a655ed1

Co-authored-by: Eran Rundstein <eran@rundste.in>

fixup some code comments in test code

231be45

cbeck88 requested a review from eranrund March 25, 2021 19:06

eranrund approved these changes Mar 25, 2021

View reviewed changes

cbeck88 merged commit 78386df into mobilecoinfoundation:master Mar 25, 2021

cbeck88 deleted the fog-393-view-server-fixes branch March 25, 2021 19:42

Fog 393 view server fixes #11

Fog 393 view server fixes #11

Conversation

cbeck88 commented Mar 11, 2021

Choose a reason for hiding this comment

eranrund left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbeck88 Mar 25, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eranrund commented Mar 12, 2021

cbeck88 commented Mar 12, 2021

eranrund commented Mar 12, 2021

cbeck88 commented Mar 12, 2021

cbeck88 commented Mar 12, 2021

eranrund commented Mar 12, 2021

cbeck88 commented Mar 12, 2021

eranrund commented Mar 12, 2021

eranrund left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbeck88 commented Mar 25, 2021

cbeck88 commented Mar 25, 2021

jcape left a comment

Choose a reason for hiding this comment

cbeck88 commented Mar 25, 2021

cbeck88 commented Mar 25, 2021

cbeck88 commented Mar 25, 2021

cbeck88 commented Mar 25, 2021

cbeck88 commented Mar 25, 2021

eranrund left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbeck88 commented Mar 25, 2021

cbeck88 commented Mar 25, 2021

eranrund left a comment

Choose a reason for hiding this comment

cbeck88 Mar 25, 2021 •

edited