Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix FOG-180: enclave report cache updated too frequently #116

Merged
merged 8 commits into from
Aug 4, 2021

Conversation

cbeck88
Copy link
Contributor

@cbeck88 cbeck88 commented Jul 29, 2021

this commit attempts to make it so that ingest server updates
the enclave report cache only when necessary

This is:
(1) At least once before anything happens
(2) Whenever the ingress key has changed
(3) Whenever we notice that the report doesn't match the key
we think we are publishing to the database.

Additionally, a background thread is created that updates it
at least as frequently as consensus (18 hours)

Before this commit, we update it at least once per block.

There are some questions about whether DNS failures with looking
for IAS were causing instability in the service.
This commit doesn't speak directly to that, but it
can't hurt to hit up IAS less often than we were doing.

this commit attempts to make it so that ingest server updates
the enclave report cache only when necessary

This is:
(1) At least once before anything happens
(2) Whenever the ingress key has changed
(3) Whenever we notice that the report doesn't match the key
    we think we are publishing to the database.

Additionally, a background thread is created that updates it
at least as frequently as consensus (18 hours)

Before this commit, we update it at least once per block.

There are some questions about whether DNS failures with looking
for IAS were causing instability in the service.
This commit doesn't speak directly to that, but it
can't hurt to hit up IAS less often than we were doing.
@@ -1028,11 +1071,34 @@ where
ingress_public_key: &CompressedRistrettoPublic,
state: &mut MutexGuard<IngestControllerState>,
) -> Result<IngressPublicKeyStatus, Error> {
self.update_enclave_report_cache()?;
Copy link
Contributor Author

@cbeck88 cbeck88 Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this line, or rather, making it conditional on the current report not matching the key we are supposed to be publishing, is the main point of this PR -- it will mean that we don't update the report on every block, which is too often

/// The report cache worker is a thread responsible for periodically calling
/// update report cache. This is a separate thread so that it can be on a
/// time-based schedule, so it will happen even if there are few blocks.
pub struct ReportCacheWorker {
Copy link
Contributor Author

@cbeck88 cbeck88 Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this worker object is added for fog-ingest, and it is not exactly the same as the one in mc_sgx_report_cache_untrusted.

It can't be because, while in consensus, the "enclave identity" bytes never change, in fog ingest, they do, because sometimes the ingress key changes, and then the bytes in the report have to change.

(1) Sometimes, the operator calls "new_keys"
(2) Sometimes, the active node tells us "hey this is the key"
(3) Sometimes, we reach out to a remote node and say "hey give me your key" (if the operator asks us to)

Before this PR, we would get an updated report, whenever we were about to take an action involving reports or attestation.

After this PR, we update the report

  • once when the server starts
  • every 18 hours according to background thread, same as consensus
  • after actions 1, 2, 3 listed above
  • before some miscellaneous actions: activating the server, syncing key from remote. these two things happen very rarely (with a human in the loop) so it should be fine, and keeping these prevents us having to refactor tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess there is one more case:

  • if we ask for a report to publish, and it doesn't match the key we think we have, then we update the report and try again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of duplicating it we should change mc_sgx_report_cache_untrusted to accept a callback for getting the report data? Not something I feel strongly about, since in this case the implementation is very small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah -- i think another way is, we could make an additional constructor of that thing that takes Arc<Mutex<ReportCache>> so that other things can also trigger the update?

it will be easier to develop changes like this after fog is merged into mobilecoin

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related question...why did you make a new ReportCacheWorker instead of just using the existing ReportCacheThread? I see that you based the ReportCacheWorker on a PeerCheckupWorker--what does a PeerCheckupWorker do?

Copy link
Contributor Author

@cbeck88 cbeck88 Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what happened here actually was, I looked at what ReportCacheThread does, and I copied it here, but made it just call controller.update_enclave_report_cache, because I still want the Controller to be able to trigger update sometimes, and so I want the Controller to own the cache.

The philosophy I'm trying to use here is, there's one object that establishes invariants, owns most of the state, and has entrypoints where functionality can be called, that is like, not threaded and potentially unit-testable. And that's the controller. And all the threads, grpc routes, worker threads, etc. are outside that object, and call into it when they need something done. So that thing does have to have mutexes, but it can be done in a thread-safe way, and it means that to figure out the high level behavior of the server, if there is anyway it can deadlock etc., we mostly just have to read controller.rs. There's some logic in the worker.rs related to timings and such that's relevant, but anything where they are like changing the state of the system or talking to the enclave is going through the controller at some point.

But yeah actually what happened code wise is I read the ReportCacheThread code in the mc-sgx-report-cache crate, which almost works, except it owns the ReportCache and can't share it. Then I realized that it's implementation is slightly better than PeerCheckupWorker so i copied it's implementation for PeerCheckupWorker and ReportCacheWorker and then fixed up the logic.


controller.peer_checkup();
std::thread::sleep(peer_checkup_period);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change was tangential to the goal of the PR, but i realized that this thread is sleeping for like 60 seconds. that probably makes the tests take at least 60 seconds. so if we make this only sleep 1 second at a time, same as consensus thread, this should make the unit tests run a lot faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find

self.write_state_file_inner(&mut state);
let result = self.get_ingest_summary_inner(&mut state);

// Don't hold the state mutex while we are talking to IAS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know that this is really critical, but it seems like a good idea and can't hurt

// This means that the caller is wrong about what the
// current ingress public key is, and we don't have anything we can publish.
log::error!(self.logger, "Report doesn't contain the expected public key even after report refresh: {:?} != {:?}", found_key, ingress_public_key);
return Err(Error::PublishReport);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a note that we will want to have enough information to take action in the error message - what is the node operator expected to do to resolve this scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have no idea how this would happen, we would have to investigate to know what is wrong. it would probably mean that like, some cache involved in creating the reports is in a bad state that is hard to reproduce.

if it doesn't resolve itself then maybe we could try killing the server and making a backup node active instead.

Copy link
Contributor

@sugargoat sugargoat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - would request a review from either @eranrund or @jcape, and for @joekottke or @jgreat to approve for understanding how they should handle the error scenario

@@ -222,8 +259,7 @@ where
pub fn new_keys(&self) -> Result<IngestSummary, Error> {
let mut state = self.get_state();
self.new_keys_inner(&mut state)?;
self.write_state_file_inner(&mut state);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont' need to write state file here, because new_keys_inner is doing that

@cbeck88
Copy link
Contributor Author

cbeck88 commented Jul 29, 2021

this passed deployment

Copy link
Contributor

@eranrund eranrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I am wondering if we should not move all report refreshing to the worker thread, and add an option to signal it to force a refresh now and not wait for the internal period to exceed. What would be the implications of moving the report refreshing to happen asynchronously that way?
The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

fog/ingest/server/src/controller.rs Show resolved Hide resolved
fog/ingest/server/src/controller.rs Show resolved Hide resolved
fog/ingest/server/src/controller.rs Show resolved Hide resolved
fog/ingest/server/src/controller.rs Outdated Show resolved Hide resolved
fog/ingest/server/src/error.rs Outdated Show resolved Hide resolved
fog/ingest/server/src/error.rs Outdated Show resolved Hide resolved
/// The report cache worker is a thread responsible for periodically calling
/// update report cache. This is a separate thread so that it can be on a
/// time-based schedule, so it will happen even if there are few blocks.
pub struct ReportCacheWorker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of duplicating it we should change mc_sgx_report_cache_untrusted to accept a callback for getting the report data? Not something I feel strongly about, since in this case the implementation is very small.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Jul 29, 2021

LGTM, although I am wondering if we should not move all report refreshing to the worker thread, and add an option to signal it to force a refresh now and not wait for the internal period to exceed. What would be the implications of moving the report refreshing to happen asynchronously that way?
The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

So james and i talked about this -- the thing is, sometimes it is important for the main thread to be able to, trigger a refresh, and then, block, until that refresh is completed, because it makes the semantics of all of this easier.

All these lines where we have "reach out to a peer and make an attested connection", fail with "no report" if the report cache hasn't been populated, and then the server crashes. Because this line: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-0f4f3e64562ef1cbd4d4d712b8f5fcb08ef44066474f57a7005059a348293064R191
blocks the main thread until the report cache succeeded, i know that none of the workers will start doing things until the cache is populated at least once.

Another thing is, it makes it easier to reason about code like this: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-1bd75fa7d4e53379b9fe76ed5de3edd920c698f1d188a08ae34d2ba9fb414247R1103

Right now, I know that if we hit this log message, it means that
(1) the server is in the active state
(2) the enclave told us earlier that the key was one thing, then it gave us a report that didn't match that
(3) even calling update_enclave_report_cache didn't fix that

so I can reason that either, (1) the enclave key is somehow being allowed to change while the server is in active state (big bug), or (2) the report cache mechanism is messed up somehow, and the caches either in the enclave or out of the enclave are not being flushed when we trigger the update (big bug).

if that line merely triggers an asynchronous update that doesn't block the worker thread, then I can't reason in that way, because it may just be that the update didn't finish yet.

The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

it's worth drilling in on this -- in the current implementation, the controller holds Arc<Mutex<ReportCache>>:

report_cache: Arc<Mutex<ReportCache<IngestSgxEnclave, R>>>,

So this serializes all calls to update_enclave_report_cache no matter which thread is triggering it, and this mutex is only locked when that function is called, and that function doesn't lock any other mutexes, so there's no risk of deadlock.

@eranrund
Copy link
Contributor

LGTM, although I am wondering if we should not move all report refreshing to the worker thread, and add an option to signal it to force a refresh now and not wait for the internal period to exceed. What would be the implications of moving the report refreshing to happen asynchronously that way?
The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

So james and i talked about this -- the thing is, sometimes it is important for the main thread to be able to, trigger a refresh, and then, block, until that refresh is completed, because it makes the semantics of all of this easier.

All these lines where we have "reach out to a peer and make an attested connection", fail with "no report" if the report cache hasn't been populated, and then the server crashes. Because this line: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-0f4f3e64562ef1cbd4d4d712b8f5fcb08ef44066474f57a7005059a348293064R191
blocks the main thread until the report cache succeeded, i know that none of the workers will start doing things until the cache is populated at least once.

Another thing is, it makes it easier to reason about code like this: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-1bd75fa7d4e53379b9fe76ed5de3edd920c698f1d188a08ae34d2ba9fb414247R1103

Right now, I know that if we hit this log message, it means that
(1) the server is in the active state
(2) the enclave told us earlier that the key was one thing, then it gave us a report that didn't match that
(3) even calling update_enclave_report_cache didn't fix that

so I can reason that either, (1) the enclave key is somehow being allowed to change while the server is in active state (big bug), or (2) the report cache mechanism is messed up somehow, and the caches either in the enclave or out of the enclave are not being flushed when we trigger the update (big bug).

if that line merely triggers an asynchronous update that doesn't block the worker thread, then I can't reason in that way, because it may just be that the update didn't finish yet.

The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.

it's worth drilling in on this -- in the current implementation, the controller holds Arc<Mutex<ReportCache>>:

report_cache: Arc<Mutex<ReportCache<IngestSgxEnclave, R>>>,

So this serializes all calls to update_enclave_report_cache no matter which thread is triggering it, and this mutex is only locked when that function is called, and that function doesn't lock any other mutexes, so there's no risk of deadlock.

I see. Thank you for the detailed explanation. The ability to block does make sense in the cases you've mentioned, and since the report_cache mutex serializes the updates then I am satisfied.

fog/api/src/extra.rs Outdated Show resolved Hide resolved
Copy link

@queenbdubs queenbdubs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point, we (I + someone...either you/sara/eran?) considered also keeping track of the last block where the report cache was updated. Would this still be useful to do? I imagine it would be useful for debugging, but I'm not sure.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Jul 30, 2021

@queenbdubs I think we should make a ticket to follow up on that. It sounds like a good idea but I'm not sure if we should try to do it at this time in the release branch

@eranrund
Copy link
Contributor

eranrund commented Aug 3, 2021

Is anything blocking this from getting merged?

@cbeck88
Copy link
Contributor Author

cbeck88 commented Aug 3, 2021

@eranrund Sara wanted that ops signs off on, that they understand the change, and that they understand what should happen if they see this error about report not matching the key in a loop or something. We are doing a meeting about this at this moment

@cbeck88
Copy link
Contributor Author

cbeck88 commented Aug 3, 2021

TODO:

  • add a metric for the pubkey_expiry of the last report we published

Copy link
Contributor

@jcape jcape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpicks that will simplify the dependencies of this (i.e. no need for sgx-types)---could you also add an issue in mobilecoin for what needs to happen in ReportCache?

I'd also recommend a fog ticket for refactoring the enclave method to return bool.

fog/api/src/report_parse.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jcape jcape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

This is requested in discussions with ops, to make better alerts
@cbeck88 cbeck88 merged commit a615c62 into release-1.1.1 Aug 4, 2021
@cbeck88 cbeck88 deleted the fix-fog-180-report-cache-frequency branch August 4, 2021 20:05
cbeck88 added a commit that referenced this pull request Aug 4, 2021
* fix FOG-180: enclave report cache updated too frequently

this commit attempts to make it so that ingest server updates
the enclave report cache only when necessary

This is:
(1) At least once before anything happens
(2) Whenever the ingress key has changed
(3) Whenever we notice that the report doesn't match the key
    we think we are publishing to the database.

Additionally, a background thread is created that updates it
at least as frequently as consensus (18 hours)

Before this commit, we update it at least once per block.

There are some questions about whether DNS failures with looking
for IAS were causing instability in the service.
This commit doesn't speak directly to that, but it
can't hurt to hit up IAS less often than we were doing.

* fixups to previous (mostly around exactly when mutexes are released)

* Move key-extract function to fog-api crate, add test for new_keys call

* add some comments and debug logs

* fix module name

* add missing copyright notice

* simplify fog report parse code to avoid sgx-types dependency (thanks james)

* Add metric for last published report

This is requested in discussions with ops, to make better alerts

Conflicts:
    Cargo.lock
cbeck88 added a commit that referenced this pull request Aug 9, 2021
* fix FOG-180: enclave report cache updated too frequently (#116)

* fix FOG-180: enclave report cache updated too frequently

this commit attempts to make it so that ingest server updates
the enclave report cache only when necessary

This is:
(1) At least once before anything happens
(2) Whenever the ingress key has changed
(3) Whenever we notice that the report doesn't match the key
    we think we are publishing to the database.

Additionally, a background thread is created that updates it
at least as frequently as consensus (18 hours)

Before this commit, we update it at least once per block.

There are some questions about whether DNS failures with looking
for IAS were causing instability in the service.
This commit doesn't speak directly to that, but it
can't hurt to hit up IAS less often than we were doing.

* fixups to previous (mostly around exactly when mutexes are released)

* Move key-extract function to fog-api crate, add test for new_keys call

* add some comments and debug logs

* fix module name

* add missing copyright notice

* simplify fog report parse code to avoid sgx-types dependency (thanks james)

* Add metric for last published report

This is requested in discussions with ops, to make better alerts

Conflicts:
    Cargo.lock

* Make fog-report-cli use the new code for parsing keys out of fog reports

* remove unnecessary dependency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants