fix FOG-180: enclave report cache updated too frequently #116

cbeck88 · 2021-07-29T01:46:44Z

this commit attempts to make it so that ingest server updates
the enclave report cache only when necessary

This is:
(1) At least once before anything happens
(2) Whenever the ingress key has changed
(3) Whenever we notice that the report doesn't match the key
we think we are publishing to the database.

Additionally, a background thread is created that updates it
at least as frequently as consensus (18 hours)

Before this commit, we update it at least once per block.

There are some questions about whether DNS failures with looking
for IAS were causing instability in the service.
This commit doesn't speak directly to that, but it
can't hurt to hit up IAS less often than we were doing.

this commit attempts to make it so that ingest server updates the enclave report cache only when necessary This is: (1) At least once before anything happens (2) Whenever the ingress key has changed (3) Whenever we notice that the report doesn't match the key we think we are publishing to the database. Additionally, a background thread is created that updates it at least as frequently as consensus (18 hours) Before this commit, we update it at least once per block. There are some questions about whether DNS failures with looking for IAS were causing instability in the service. This commit doesn't speak directly to that, but it can't hurt to hit up IAS less often than we were doing.

cbeck88 · 2021-07-29T01:47:34Z

fog/ingest/server/src/controller.rs

@@ -1028,11 +1071,34 @@ where
        ingress_public_key: &CompressedRistrettoPublic,
        state: &mut MutexGuard<IngestControllerState>,
    ) -> Result<IngressPublicKeyStatus, Error> {
-        self.update_enclave_report_cache()?;


removing this line, or rather, making it conditional on the current report not matching the key we are supposed to be publishing, is the main point of this PR -- it will mean that we don't update the report on every block, which is too often

cbeck88 · 2021-07-29T02:17:53Z

fog/ingest/server/src/worker.rs

+/// The report cache worker is a thread responsible for periodically calling
+/// update report cache. This is a separate thread so that it can be on a
+/// time-based schedule, so it will happen even if there are few blocks.
+pub struct ReportCacheWorker {


this worker object is added for fog-ingest, and it is not exactly the same as the one in mc_sgx_report_cache_untrusted.

It can't be because, while in consensus, the "enclave identity" bytes never change, in fog ingest, they do, because sometimes the ingress key changes, and then the bytes in the report have to change.

(1) Sometimes, the operator calls "new_keys"
(2) Sometimes, the active node tells us "hey this is the key"
(3) Sometimes, we reach out to a remote node and say "hey give me your key" (if the operator asks us to)

Before this PR, we would get an updated report, whenever we were about to take an action involving reports or attestation.

After this PR, we update the report

once when the server starts

every 18 hours according to background thread, same as consensus

after actions 1, 2, 3 listed above

before some miscellaneous actions: activating the server, syncing key from remote. these two things happen very rarely (with a human in the loop) so it should be fine, and keeping these prevents us having to refactor tests.

Thank you for the explanation here!

i guess there is one more case:

if we ask for a report to publish, and it doesn't match the key we think we have, then we update the report and try again.

Maybe instead of duplicating it we should change mc_sgx_report_cache_untrusted to accept a callback for getting the report data? Not something I feel strongly about, since in this case the implementation is very small.

yeah -- i think another way is, we could make an additional constructor of that thing that takes Arc<Mutex<ReportCache>> so that other things can also trigger the update?

it will be easier to develop changes like this after fog is merged into mobilecoin

related question...why did you make a new ReportCacheWorker instead of just using the existing ReportCacheThread? I see that you based the ReportCacheWorker on a PeerCheckupWorker--what does a PeerCheckupWorker do?

so what happened here actually was, I looked at what ReportCacheThread does, and I copied it here, but made it just call controller.update_enclave_report_cache, because I still want the Controller to be able to trigger update sometimes, and so I want the Controller to own the cache.

The philosophy I'm trying to use here is, there's one object that establishes invariants, owns most of the state, and has entrypoints where functionality can be called, that is like, not threaded and potentially unit-testable. And that's the controller. And all the threads, grpc routes, worker threads, etc. are outside that object, and call into it when they need something done. So that thing does have to have mutexes, but it can be done in a thread-safe way, and it means that to figure out the high level behavior of the server, if there is anyway it can deadlock etc., we mostly just have to read controller.rs. There's some logic in the worker.rs related to timings and such that's relevant, but anything where they are like changing the state of the system or talking to the enclave is going through the controller at some point.

But yeah actually what happened code wise is I read the ReportCacheThread code in the mc-sgx-report-cache crate, which almost works, except it owns the ReportCache and can't share it. Then I realized that it's implementation is slightly better than PeerCheckupWorker so i copied it's implementation for PeerCheckupWorker and ReportCacheWorker and then fixed up the logic.

cbeck88 · 2021-07-29T02:18:45Z

fog/ingest/server/src/worker.rs


-                controller.peer_checkup();
-                std::thread::sleep(peer_checkup_period);


this change was tangential to the goal of the PR, but i realized that this thread is sleeping for like 60 seconds. that probably makes the tests take at least 60 seconds. so if we make this only sleep 1 second at a time, same as consensus thread, this should make the unit tests run a lot faster.

fog/ingest/server/src/controller.rs

cbeck88 · 2021-07-29T02:20:34Z

fog/ingest/server/src/controller.rs

+        self.write_state_file_inner(&mut state);
+        let result = self.get_ingest_summary_inner(&mut state);
+
+        // Don't hold the state mutex while we are talking to IAS


i don't know that this is really critical, but it seems like a good idea and can't hurt

fog/ingest/server/src/controller.rs

sugargoat · 2021-07-29T15:18:27Z

fog/ingest/server/src/controller.rs

+                    // This means that the caller is wrong about what the
+                    // current ingress public key is, and we don't have anything we can publish.
+                    log::error!(self.logger, "Report doesn't contain the expected public key even after report refresh: {:?} != {:?}", found_key, ingress_public_key);
+                    return Err(Error::PublishReport);


Making a note that we will want to have enough information to take action in the error message - what is the node operator expected to do to resolve this scenario?

i have no idea how this would happen, we would have to investigate to know what is wrong. it would probably mean that like, some cache involved in creating the reports is in a bad state that is hard to reproduce.

if it doesn't resolve itself then maybe we could try killing the server and making a backup node active instead.

sugargoat

LGTM - would request a review from either @eranrund or @jcape, and for @joekottke or @jgreat to approve for understanding how they should handle the error scenario

cbeck88 · 2021-07-29T15:37:32Z

fog/ingest/server/src/controller.rs

@@ -222,8 +259,7 @@ where
    pub fn new_keys(&self) -> Result<IngestSummary, Error> {
        let mut state = self.get_state();
        self.new_keys_inner(&mut state)?;
-        self.write_state_file_inner(&mut state);


we dont' need to write state file here, because new_keys_inner is doing that

fog/ingest/server/src/worker.rs

cbeck88 · 2021-07-29T20:26:45Z

this passed deployment

eranrund

LGTM, although I am wondering if we should not move all report refreshing to the worker thread, and add an option to signal it to force a refresh now and not wait for the internal period to exceed. What would be the implications of moving the report refreshing to happen asynchronously that way?
The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

fog/ingest/server/src/controller.rs

fog/ingest/server/src/error.rs

eranrund · 2021-07-29T20:47:09Z

fog/ingest/server/src/worker.rs

+/// The report cache worker is a thread responsible for periodically calling
+/// update report cache. This is a separate thread so that it can be on a
+/// time-based schedule, so it will happen even if there are few blocks.
+pub struct ReportCacheWorker {


Maybe instead of duplicating it we should change mc_sgx_report_cache_untrusted to accept a callback for getting the report data? Not something I feel strongly about, since in this case the implementation is very small.

cbeck88 · 2021-07-29T21:05:38Z

LGTM, although I am wondering if we should not move all report refreshing to the worker thread, and add an option to signal it to force a refresh now and not wait for the internal period to exceed. What would be the implications of moving the report refreshing to happen asynchronously that way?
The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

So james and i talked about this -- the thing is, sometimes it is important for the main thread to be able to, trigger a refresh, and then, block, until that refresh is completed, because it makes the semantics of all of this easier.

All these lines where we have "reach out to a peer and make an attested connection", fail with "no report" if the report cache hasn't been populated, and then the server crashes. Because this line: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-0f4f3e64562ef1cbd4d4d712b8f5fcb08ef44066474f57a7005059a348293064R191
blocks the main thread until the report cache succeeded, i know that none of the workers will start doing things until the cache is populated at least once.

Another thing is, it makes it easier to reason about code like this: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-1bd75fa7d4e53379b9fe76ed5de3edd920c698f1d188a08ae34d2ba9fb414247R1103

Right now, I know that if we hit this log message, it means that
(1) the server is in the active state
(2) the enclave told us earlier that the key was one thing, then it gave us a report that didn't match that
(3) even calling update_enclave_report_cache didn't fix that

so I can reason that either, (1) the enclave key is somehow being allowed to change while the server is in active state (big bug), or (2) the report cache mechanism is messed up somehow, and the caches either in the enclave or out of the enclave are not being flushed when we trigger the update (big bug).

if that line merely triggers an asynchronous update that doesn't block the worker thread, then I can't reason in that way, because it may just be that the update didn't finish yet.

The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

it's worth drilling in on this -- in the current implementation, the controller holds Arc<Mutex<ReportCache>>:

fog/fog/ingest/server/src/controller.rs

Line 70 in 5a776bd

report_cache: Arc<Mutex<ReportCache<IngestSgxEnclave, R>>>,

So this serializes all calls to update_enclave_report_cache no matter which thread is triggering it, and this mutex is only locked when that function is called, and that function doesn't lock any other mutexes, so there's no risk of deadlock.

eranrund · 2021-07-29T21:21:41Z

LGTM, although I am wondering if we should not move all report refreshing to the worker thread, and add an option to signal it to force a refresh now and not wait for the internal period to exceed. What would be the implications of moving the report refreshing to happen asynchronously that way?
The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

So james and i talked about this -- the thing is, sometimes it is important for the main thread to be able to, trigger a refresh, and then, block, until that refresh is completed, because it makes the semantics of all of this easier.

All these lines where we have "reach out to a peer and make an attested connection", fail with "no report" if the report cache hasn't been populated, and then the server crashes. Because this line: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-0f4f3e64562ef1cbd4d4d712b8f5fcb08ef44066474f57a7005059a348293064R191
blocks the main thread until the report cache succeeded, i know that none of the workers will start doing things until the cache is populated at least once.

Another thing is, it makes it easier to reason about code like this: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-1bd75fa7d4e53379b9fe76ed5de3edd920c698f1d188a08ae34d2ba9fb414247R1103

Right now, I know that if we hit this log message, it means that
(1) the server is in the active state
(2) the enclave told us earlier that the key was one thing, then it gave us a report that didn't match that
(3) even calling update_enclave_report_cache didn't fix that

so I can reason that either, (1) the enclave key is somehow being allowed to change while the server is in active state (big bug), or (2) the report cache mechanism is messed up somehow, and the caches either in the enclave or out of the enclave are not being flushed when we trigger the update (big bug).

if that line merely triggers an asynchronous update that doesn't block the worker thread, then I can't reason in that way, because it may just be that the update didn't finish yet.

The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

it's worth drilling in on this -- in the current implementation, the controller holds Arc<Mutex<ReportCache>>:

fog/fog/ingest/server/src/controller.rs

Line 70 in 5a776bd

report_cache: Arc<Mutex<ReportCache<IngestSgxEnclave, R>>>,

So this serializes all calls to update_enclave_report_cache no matter which thread is triggering it, and this mutex is only locked when that function is called, and that function doesn't lock any other mutexes, so there's no risk of deadlock.

I see. Thank you for the detailed explanation. The ability to block does make sense in the cases you've mentioned, and since the report_cache mutex serializes the updates then I am satisfied.

fog/api/src/extra.rs

queenbdubs

At one point, we (I + someone...either you/sara/eran?) considered also keeping track of the last block where the report cache was updated. Would this still be useful to do? I imagine it would be useful for debugging, but I'm not sure.

cbeck88 · 2021-07-30T15:47:07Z

@queenbdubs I think we should make a ticket to follow up on that. It sounds like a good idea but I'm not sure if we should try to do it at this time in the release branch

eranrund · 2021-08-03T17:22:36Z

Is anything blocking this from getting merged?

cbeck88 · 2021-08-03T18:02:46Z

@eranrund Sara wanted that ops signs off on, that they understand the change, and that they understand what should happen if they see this error about report not matching the key in a loop or something. We are doing a meeting about this at this moment

cbeck88 · 2021-08-03T19:04:06Z

TODO:

add a metric for the pubkey_expiry of the last report we published

jcape

Some minor nitpicks that will simplify the dependencies of this (i.e. no need for sgx-types)---could you also add an issue in mobilecoin for what needs to happen in ReportCache?

I'd also recommend a fog ticket for refactoring the enclave method to return bool.

fog/api/src/report_parse.rs

…james)

jcape

LGTM!

This is requested in discussions with ops, to make better alerts

* fix FOG-180: enclave report cache updated too frequently this commit attempts to make it so that ingest server updates the enclave report cache only when necessary This is: (1) At least once before anything happens (2) Whenever the ingress key has changed (3) Whenever we notice that the report doesn't match the key we think we are publishing to the database. Additionally, a background thread is created that updates it at least as frequently as consensus (18 hours) Before this commit, we update it at least once per block. There are some questions about whether DNS failures with looking for IAS were causing instability in the service. This commit doesn't speak directly to that, but it can't hurt to hit up IAS less often than we were doing. * fixups to previous (mostly around exactly when mutexes are released) * Move key-extract function to fog-api crate, add test for new_keys call * add some comments and debug logs * fix module name * add missing copyright notice * simplify fog report parse code to avoid sgx-types dependency (thanks james) * Add metric for last published report This is requested in discussions with ops, to make better alerts Conflicts: Cargo.lock

* fix FOG-180: enclave report cache updated too frequently (#116) * fix FOG-180: enclave report cache updated too frequently this commit attempts to make it so that ingest server updates the enclave report cache only when necessary This is: (1) At least once before anything happens (2) Whenever the ingress key has changed (3) Whenever we notice that the report doesn't match the key we think we are publishing to the database. Additionally, a background thread is created that updates it at least as frequently as consensus (18 hours) Before this commit, we update it at least once per block. There are some questions about whether DNS failures with looking for IAS were causing instability in the service. This commit doesn't speak directly to that, but it can't hurt to hit up IAS less often than we were doing. * fixups to previous (mostly around exactly when mutexes are released) * Move key-extract function to fog-api crate, add test for new_keys call * add some comments and debug logs * fix module name * add missing copyright notice * simplify fog report parse code to avoid sgx-types dependency (thanks james) * Add metric for last published report This is requested in discussions with ops, to make better alerts Conflicts: Cargo.lock * Make fog-report-cli use the new code for parsing keys out of fog reports * remove unnecessary dependency

cbeck88 requested review from jcape, christian-oudard, joekottke, eranrund, tubi70, queenbdubs, sugargoat and a team July 29, 2021 01:46

cbeck88 commented Jul 29, 2021

View reviewed changes

fixups to previous (mostly around exactly when mutexes are released)

4f49e6e

cbeck88 commented Jul 29, 2021

View reviewed changes

fog/ingest/server/src/controller.rs Outdated Show resolved Hide resolved

cbeck88 commented Jul 29, 2021

View reviewed changes

fog/ingest/server/src/controller.rs Show resolved Hide resolved

cbeck88 commented Jul 29, 2021

View reviewed changes

fog/ingest/server/src/controller.rs Show resolved Hide resolved

sugargoat reviewed Jul 29, 2021

View reviewed changes

sugargoat approved these changes Jul 29, 2021

View reviewed changes

cbeck88 commented Jul 29, 2021

View reviewed changes

fog/ingest/server/src/worker.rs Outdated Show resolved Hide resolved

eranrund approved these changes Jul 29, 2021

View reviewed changes

Move key-extract function to fog-api crate, add test for new_keys call

14c0a50

queenbdubs reviewed Jul 30, 2021

View reviewed changes

fog/api/src/extra.rs Outdated Show resolved Hide resolved

queenbdubs approved these changes Jul 30, 2021

View reviewed changes

add some comments and debug logs

9c88200

cbeck88 added 2 commits July 30, 2021 09:48

fix module name

bf23598

add missing copyright notice

70dde39

jcape reviewed Aug 3, 2021

View reviewed changes

fog/api/src/report_parse.rs Outdated Show resolved Hide resolved

cbeck88 mentioned this pull request Aug 4, 2021

Refactor sgx report cache thread mobilecoinfoundation/mobilecoin#864

Open

simplify fog report parse code to avoid sgx-types dependency (thanks …

7f0a8e9

…james)

jcape approved these changes Aug 4, 2021

View reviewed changes

Add metric for last published report

e306b04

This is requested in discussions with ops, to make better alerts

cbeck88 merged commit a615c62 into release-1.1.1 Aug 4, 2021

cbeck88 deleted the fix-fog-180-report-cache-frequency branch August 4, 2021 20:05

cbeck88 mentioned this pull request Aug 9, 2021

WIP: Fog 180 update ingest report cache refresh rate #29

Closed

jcape mentioned this pull request Mar 18, 2022

AVR Cache Update Metric mobilecoinfoundation/mobilecoin#1664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix FOG-180: enclave report cache updated too frequently #116

fix FOG-180: enclave report cache updated too frequently #116

cbeck88 commented Jul 29, 2021

cbeck88 Jul 29, 2021 •

edited

cbeck88 Jul 29, 2021 •

edited

sugargoat Jul 29, 2021

cbeck88 Jul 29, 2021

eranrund Jul 29, 2021

cbeck88 Jul 29, 2021

queenbdubs Jul 30, 2021

cbeck88 Jul 30, 2021 •

edited

cbeck88 Jul 29, 2021

sugargoat Jul 29, 2021

cbeck88 Jul 29, 2021

sugargoat Jul 29, 2021

cbeck88 Jul 29, 2021

sugargoat left a comment

cbeck88 Jul 29, 2021

cbeck88 commented Jul 29, 2021

eranrund left a comment •

edited

eranrund Jul 29, 2021

cbeck88 commented Jul 29, 2021

eranrund commented Jul 29, 2021

queenbdubs left a comment

cbeck88 commented Jul 30, 2021

eranrund commented Aug 3, 2021

cbeck88 commented Aug 3, 2021

cbeck88 commented Aug 3, 2021

jcape left a comment

jcape left a comment


		controller.peer_checkup();
		std::thread::sleep(peer_checkup_period);

fix FOG-180: enclave report cache updated too frequently #116

fix FOG-180: enclave report cache updated too frequently #116

Conversation

cbeck88 commented Jul 29, 2021

cbeck88 Jul 29, 2021 • edited

Choose a reason for hiding this comment

cbeck88 Jul 29, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbeck88 Jul 30, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sugargoat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbeck88 commented Jul 29, 2021

eranrund left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbeck88 commented Jul 29, 2021

eranrund commented Jul 29, 2021

queenbdubs left a comment

Choose a reason for hiding this comment

cbeck88 commented Jul 30, 2021

eranrund commented Aug 3, 2021

cbeck88 commented Aug 3, 2021

cbeck88 commented Aug 3, 2021

jcape left a comment

Choose a reason for hiding this comment

jcape left a comment

Choose a reason for hiding this comment

cbeck88 Jul 29, 2021 •

edited

cbeck88 Jul 29, 2021 •

edited

cbeck88 Jul 30, 2021 •

edited

eranrund left a comment •

edited