Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial change to consolidate get watcher timestamp #123

Closed
wants to merge 1 commit into from

Conversation

tubi70
Copy link
Contributor

@tubi70 tubi70 commented Aug 12, 2021

Soundtrack of this PR: link to song that really fits the mood of this PR

Motivation

Currently the watcher database provides an API for getting block signatures and from there we can get timestamps for a block. However, sometimes the watcher fails or gets stuck. Both fog ingest and fog ledger have code for handling these errors and retries. Because the clients need a timestamp on every piece of data in order to function, failing to get a timestamp blocks the service, and immediately escalates to a P0 issue. We now have the code for getting the timestamps from the watcher in two places — in the fog-ingest worker thread and in the fog ledger worker thread, and specifically a function "get_watcher_timestamp" .

fn get_watcher_timestamp(

that is duplicated in both places.

In this PR

It would be better to have a crate like "fog-timestamps" to provides this function, and reconcile any differences between the two versions, and test this. So that we know it works the same way everywhere, and this will make it easier to debug if fog gets blocked on this in production (which has happened before, and will happen again).

< Ticket status, e.g. "fixes #issue number" >
https://app.asana.com/0/1200353042931237/1200688378834228/f

Future Work

  • < Out of scope non-goals for this PR >
  • < These should be links to tickets. If the tickets do not exist, make them. >

@tubi70 tubi70 requested a review from cbeck88 August 12, 2021 12:37
@eranrund
Copy link
Contributor

Any reason to not have this in the mobilecoin repository? This is not fog-specific.
Since this heavily relies on mc-watcher, maybe it should live inside that crate?

Copy link
Contributor

@eranrund eranrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but my main request would be to place this as a method in WatcherDB.

Ok((ts, res)) => match res {
TimestampResultCode::WatcherBehind => {
if watcher_behind_timer.elapsed() > watcher_timeout {
log::warn!(logger, "watcher is still behind on block index = {} after waiting {} seconds, ingest will be blocked", block_index, watcher_timeout.as_secs());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingest no longer applies here. Applies to other places below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please let me know if below is the change you are requesting:

log::warn!(logger, "watcher is still behind on block index = {} after waiting {} seconds, caller will be blocked", block_index, watcher_timeout.as_secs());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not have this in the mobilecoin repository? This is not fog-specific.
Since this heavily relies on mc-watcher, maybe it should live inside that crate?

Ping on this.

Comment on lines +13 to +14
/// proceed. But bringing the server down is costly from ops POV because
/// we will lose all the user rng's.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user rngs do not apply here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please confirm that below is a better replacement for the comment

/// we will lose all the blocks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not be losing any blocks. We have no way of knowing what effect broken database invariants has on the caller since we don't know what the caller is doing.
I would change But bringing the server down is costly from ops POV because /// we will lose all the user rng's. to something like Generally when an invariant is violated we would panic, but this code is used in services that are expensive to restart (such as the ingest enclave).

@cbeck88
Copy link
Contributor

cbeck88 commented Aug 16, 2021

It's not super easy to see from the diff what the differences are here -- can you document what the change is in which server, and what you think the consequences of that might be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants