-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix FOG-180: enclave report cache updated too frequently #116
Conversation
this commit attempts to make it so that ingest server updates the enclave report cache only when necessary This is: (1) At least once before anything happens (2) Whenever the ingress key has changed (3) Whenever we notice that the report doesn't match the key we think we are publishing to the database. Additionally, a background thread is created that updates it at least as frequently as consensus (18 hours) Before this commit, we update it at least once per block. There are some questions about whether DNS failures with looking for IAS were causing instability in the service. This commit doesn't speak directly to that, but it can't hurt to hit up IAS less often than we were doing.
@@ -1028,11 +1071,34 @@ where | |||
ingress_public_key: &CompressedRistrettoPublic, | |||
state: &mut MutexGuard<IngestControllerState>, | |||
) -> Result<IngressPublicKeyStatus, Error> { | |||
self.update_enclave_report_cache()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this line, or rather, making it conditional on the current report not matching the key we are supposed to be publishing, is the main point of this PR -- it will mean that we don't update the report on every block, which is too often
/// The report cache worker is a thread responsible for periodically calling | ||
/// update report cache. This is a separate thread so that it can be on a | ||
/// time-based schedule, so it will happen even if there are few blocks. | ||
pub struct ReportCacheWorker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this worker object is added for fog-ingest, and it is not exactly the same as the one in mc_sgx_report_cache_untrusted
.
It can't be because, while in consensus, the "enclave identity" bytes never change, in fog ingest, they do, because sometimes the ingress key changes, and then the bytes in the report have to change.
(1) Sometimes, the operator calls "new_keys"
(2) Sometimes, the active node tells us "hey this is the key"
(3) Sometimes, we reach out to a remote node and say "hey give me your key" (if the operator asks us to)
Before this PR, we would get an updated report, whenever we were about to take an action involving reports or attestation.
After this PR, we update the report
- once when the server starts
- every 18 hours according to background thread, same as consensus
- after actions 1, 2, 3 listed above
- before some miscellaneous actions: activating the server, syncing key from remote. these two things happen very rarely (with a human in the loop) so it should be fine, and keeping these prevents us having to refactor tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess there is one more case:
- if we ask for a report to publish, and it doesn't match the key we think we have, then we update the report and try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of duplicating it we should change mc_sgx_report_cache_untrusted
to accept a callback for getting the report data? Not something I feel strongly about, since in this case the implementation is very small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah -- i think another way is, we could make an additional constructor of that thing that takes Arc<Mutex<ReportCache>>
so that other things can also trigger the update?
it will be easier to develop changes like this after fog is merged into mobilecoin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related question...why did you make a new ReportCacheWorker instead of just using the existing ReportCacheThread? I see that you based the ReportCacheWorker on a PeerCheckupWorker--what does a PeerCheckupWorker do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what happened here actually was, I looked at what ReportCacheThread does, and I copied it here, but made it just call controller.update_enclave_report_cache
, because I still want the Controller to be able to trigger update sometimes, and so I want the Controller to own the cache.
The philosophy I'm trying to use here is, there's one object that establishes invariants, owns most of the state, and has entrypoints where functionality can be called, that is like, not threaded and potentially unit-testable. And that's the controller. And all the threads, grpc routes, worker threads, etc. are outside that object, and call into it when they need something done. So that thing does have to have mutexes, but it can be done in a thread-safe way, and it means that to figure out the high level behavior of the server, if there is anyway it can deadlock etc., we mostly just have to read controller.rs
. There's some logic in the worker.rs
related to timings and such that's relevant, but anything where they are like changing the state of the system or talking to the enclave is going through the controller at some point.
But yeah actually what happened code wise is I read the ReportCacheThread
code in the mc-sgx-report-cache crate, which almost works, except it owns the ReportCache
and can't share it. Then I realized that it's implementation is slightly better than PeerCheckupWorker
so i copied it's implementation for PeerCheckupWorker
and ReportCacheWorker
and then fixed up the logic.
|
||
controller.peer_checkup(); | ||
std::thread::sleep(peer_checkup_period); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change was tangential to the goal of the PR, but i realized that this thread is sleeping for like 60 seconds. that probably makes the tests take at least 60 seconds. so if we make this only sleep 1 second at a time, same as consensus thread, this should make the unit tests run a lot faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find
self.write_state_file_inner(&mut state); | ||
let result = self.get_ingest_summary_inner(&mut state); | ||
|
||
// Don't hold the state mutex while we are talking to IAS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know that this is really critical, but it seems like a good idea and can't hurt
// This means that the caller is wrong about what the | ||
// current ingress public key is, and we don't have anything we can publish. | ||
log::error!(self.logger, "Report doesn't contain the expected public key even after report refresh: {:?} != {:?}", found_key, ingress_public_key); | ||
return Err(Error::PublishReport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a note that we will want to have enough information to take action in the error message - what is the node operator expected to do to resolve this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have no idea how this would happen, we would have to investigate to know what is wrong. it would probably mean that like, some cache involved in creating the reports is in a bad state that is hard to reproduce.
if it doesn't resolve itself then maybe we could try killing the server and making a backup node active instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - would request a review from either @eranrund or @jcape, and for @joekottke or @jgreat to approve for understanding how they should handle the error scenario
@@ -222,8 +259,7 @@ where | |||
pub fn new_keys(&self) -> Result<IngestSummary, Error> { | |||
let mut state = self.get_state(); | |||
self.new_keys_inner(&mut state)?; | |||
self.write_state_file_inner(&mut state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont' need to write state file here, because new_keys_inner
is doing that
this passed deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I am wondering if we should not move all report refreshing to the worker thread, and add an option to signal it to force a refresh now and not wait for the internal period to exceed. What would be the implications of moving the report refreshing to happen asynchronously that way?
The advantage of doing that would be that all IAS interaction (network calls) happens from a single thread - so it doesn't block anything, and also we are not doing two refreshes concurrently.
/// The report cache worker is a thread responsible for periodically calling | ||
/// update report cache. This is a separate thread so that it can be on a | ||
/// time-based schedule, so it will happen even if there are few blocks. | ||
pub struct ReportCacheWorker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of duplicating it we should change mc_sgx_report_cache_untrusted
to accept a callback for getting the report data? Not something I feel strongly about, since in this case the implementation is very small.
So james and i talked about this -- the thing is, sometimes it is important for the main thread to be able to, trigger a refresh, and then, block, until that refresh is completed, because it makes the semantics of all of this easier. All these lines where we have "reach out to a peer and make an attested connection", fail with "no report" if the report cache hasn't been populated, and then the server crashes. Because this line: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-0f4f3e64562ef1cbd4d4d712b8f5fcb08ef44066474f57a7005059a348293064R191 Another thing is, it makes it easier to reason about code like this: https://github.com/mobilecoinfoundation/fog/pull/116/files#diff-1bd75fa7d4e53379b9fe76ed5de3edd920c698f1d188a08ae34d2ba9fb414247R1103 Right now, I know that if we hit this log message, it means that so I can reason that either, (1) the enclave key is somehow being allowed to change while the server is in active state (big bug), or (2) the report cache mechanism is messed up somehow, and the caches either in the enclave or out of the enclave are not being flushed when we trigger the update (big bug). if that line merely triggers an asynchronous update that doesn't block the worker thread, then I can't reason in that way, because it may just be that the update didn't finish yet.
it's worth drilling in on this -- in the current implementation, the fog/fog/ingest/server/src/controller.rs Line 70 in 5a776bd
So this serializes all calls to |
I see. Thank you for the detailed explanation. The ability to block does make sense in the cases you've mentioned, and since the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point, we (I + someone...either you/sara/eran?) considered also keeping track of the last block where the report cache was updated. Would this still be useful to do? I imagine it would be useful for debugging, but I'm not sure.
@queenbdubs I think we should make a ticket to follow up on that. It sounds like a good idea but I'm not sure if we should try to do it at this time in the release branch |
Is anything blocking this from getting merged? |
@eranrund Sara wanted that ops signs off on, that they understand the change, and that they understand what should happen if they see this error about report not matching the key in a loop or something. We are doing a meeting about this at this moment |
TODO:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nitpicks that will simplify the dependencies of this (i.e. no need for sgx-types)---could you also add an issue in mobilecoin for what needs to happen in ReportCache?
I'd also recommend a fog ticket for refactoring the enclave method to return bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This is requested in discussions with ops, to make better alerts
* fix FOG-180: enclave report cache updated too frequently this commit attempts to make it so that ingest server updates the enclave report cache only when necessary This is: (1) At least once before anything happens (2) Whenever the ingress key has changed (3) Whenever we notice that the report doesn't match the key we think we are publishing to the database. Additionally, a background thread is created that updates it at least as frequently as consensus (18 hours) Before this commit, we update it at least once per block. There are some questions about whether DNS failures with looking for IAS were causing instability in the service. This commit doesn't speak directly to that, but it can't hurt to hit up IAS less often than we were doing. * fixups to previous (mostly around exactly when mutexes are released) * Move key-extract function to fog-api crate, add test for new_keys call * add some comments and debug logs * fix module name * add missing copyright notice * simplify fog report parse code to avoid sgx-types dependency (thanks james) * Add metric for last published report This is requested in discussions with ops, to make better alerts Conflicts: Cargo.lock
* fix FOG-180: enclave report cache updated too frequently (#116) * fix FOG-180: enclave report cache updated too frequently this commit attempts to make it so that ingest server updates the enclave report cache only when necessary This is: (1) At least once before anything happens (2) Whenever the ingress key has changed (3) Whenever we notice that the report doesn't match the key we think we are publishing to the database. Additionally, a background thread is created that updates it at least as frequently as consensus (18 hours) Before this commit, we update it at least once per block. There are some questions about whether DNS failures with looking for IAS were causing instability in the service. This commit doesn't speak directly to that, but it can't hurt to hit up IAS less often than we were doing. * fixups to previous (mostly around exactly when mutexes are released) * Move key-extract function to fog-api crate, add test for new_keys call * add some comments and debug logs * fix module name * add missing copyright notice * simplify fog report parse code to avoid sgx-types dependency (thanks james) * Add metric for last published report This is requested in discussions with ops, to make better alerts Conflicts: Cargo.lock * Make fog-report-cli use the new code for parsing keys out of fog reports * remove unnecessary dependency
this commit attempts to make it so that ingest server updates
the enclave report cache only when necessary
This is:
(1) At least once before anything happens
(2) Whenever the ingress key has changed
(3) Whenever we notice that the report doesn't match the key
we think we are publishing to the database.
Additionally, a background thread is created that updates it
at least as frequently as consensus (18 hours)
Before this commit, we update it at least once per block.
There are some questions about whether DNS failures with looking
for IAS were causing instability in the service.
This commit doesn't speak directly to that, but it
can't hurt to hit up IAS less often than we were doing.