-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial change to consolidate get watcher timestamp #123
Conversation
Any reason to not have this in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits, but my main request would be to place this as a method in WatcherDB
.
Ok((ts, res)) => match res { | ||
TimestampResultCode::WatcherBehind => { | ||
if watcher_behind_timer.elapsed() > watcher_timeout { | ||
log::warn!(logger, "watcher is still behind on block index = {} after waiting {} seconds, ingest will be blocked", block_index, watcher_timeout.as_secs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ingest
no longer applies here. Applies to other places below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please let me know if below is the change you are requesting:
log::warn!(logger, "watcher is still behind on block index = {} after waiting {} seconds, caller will be blocked", block_index, watcher_timeout.as_secs());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not have this in the
mobilecoin
repository? This is not fog-specific.
Since this heavily relies onmc-watcher
, maybe it should live inside that crate?
Ping on this.
/// proceed. But bringing the server down is costly from ops POV because | ||
/// we will lose all the user rng's. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user rngs do not apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please confirm that below is a better replacement for the comment
/// we will lose all the blocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will not be losing any blocks. We have no way of knowing what effect broken database invariants has on the caller since we don't know what the caller is doing.
I would change But bringing the server down is costly from ops POV because /// we will lose all the user rng's.
to something like Generally when an invariant is violated we would panic, but this code is used in services that are expensive to restart (such as the ingest enclave).
It's not super easy to see from the diff what the differences are here -- can you document what the change is in which server, and what you think the consequences of that might be? |
Soundtrack of this PR: link to song that really fits the mood of this PR
Motivation
Currently the watcher database provides an API for getting block signatures and from there we can get timestamps for a block. However, sometimes the watcher fails or gets stuck. Both fog ingest and fog ledger have code for handling these errors and retries. Because the clients need a timestamp on every piece of data in order to function, failing to get a timestamp blocks the service, and immediately escalates to a P0 issue. We now have the code for getting the timestamps from the watcher in two places — in the fog-ingest worker thread and in the fog ledger worker thread, and specifically a function "get_watcher_timestamp" .
fog/fog/ingest/server/src/worker.rs
Line 151 in 5a776bd
that is duplicated in both places.
In this PR
It would be better to have a crate like "fog-timestamps" to provides this function, and reconcile any differences between the two versions, and test this. So that we know it works the same way everywhere, and this will make it easier to debug if fog gets blocked on this in production (which has happened before, and will happen again).
< Ticket status, e.g. "fixes #issue number" >
https://app.asana.com/0/1200353042931237/1200688378834228/f
Future Work