Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement time travel for NEAR testing infrastructure #3661

Closed
wants to merge 17 commits into from

Conversation

pczarn
Copy link

@pczarn pczarn commented Nov 28, 2020

Implements near/bounties#31

Document all occurences of Utc::now

  • Inspect whether they need to be replaced with a time proxy.

confirm with us the correctness of each workitem before moving on to the next one

Please review and confirm the correctness of point one:

  • Inspect all the occurrences of Utc::now and documenting which ones need to be replaced with a proxy.
  • Design a time-proxy that will return Utc::now without the adversarial feature flag, but otherwise enables manipulating time via an RPC end point
  • Implement the proxy, add the meta-test described above.

@pczarn
Copy link
Author

pczarn commented Nov 28, 2020

Upon basic inspection, the occurences of Instant::now are all related to timing or testing.

Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the time_proxy module? Also I don't think we need to have a Utc submodule there

@@ -260,6 +260,7 @@ impl SealsManager {
};

let chosen = Self::get_random_part_ords(candidates);
// `sent` is not accessed, so we don't use the time proxy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not used right now, but will be enabled after #3657 is merged

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix this catch

@pczarn
Copy link
Author

pczarn commented Dec 3, 2020

The time_proxy module is to-be-done as the second step. I believe I will invoke this stuff by doing use utils::time_proxy::UtcProxy, then calling UtcProxy::now, what do you think?

Alas, I caught a cold (this is a cold season after all) so this task will need to wait a few days.

@pczarn
Copy link
Author

pczarn commented Jan 12, 2021

I'm back to coding. I'll pick up this task shortly. Please excuse the delay.

@pczarn
Copy link
Author

pczarn commented Jan 24, 2021

I have the second sub-task compiling, and the last sub-task as a draft.

Could you think of more tests utilizing time travel? If we spend effort implementing the feature that enables testing, we might as well add more testing.

@pczarn
Copy link
Author

pczarn commented Mar 23, 2021

Without time travel, the test completes in 22 seconds. With time travel enabled, the test completes in 50 seconds half of the time, and gets stuck the other half of the time. The core simply does not work well with time changes.

edit: it's possible that my proxy replacements are wrong or insufficient or incongruent in the core.

edit: one proxy replacement was incorrect.

@pczarn pczarn force-pushed the adversarial-time-travel branch 3 times, most recently from d0e2886 to aeb233c Compare March 31, 2021 13:59
@pczarn pczarn changed the title [WIP] Implement time travel for NEAR testing infrastructure Implement time travel for NEAR testing infrastructure Mar 31, 2021
@pczarn
Copy link
Author

pczarn commented Mar 31, 2021

@bowenwang1996 @SkidanovAlex @chefsale @mfornet @pmnoxx please review, it's now passing the tests and checks.

@pczarn pczarn force-pushed the adversarial-time-travel branch 2 times, most recently from 317b39a to 9b19c1b Compare April 6, 2021 15:49
@pczarn pczarn force-pushed the adversarial-time-travel branch 2 times, most recently from c6560db to 6803ace Compare April 8, 2021 09:22
* Tests for another use case as well:
not stopping time, but instead slowing it down by 70%,
and making sure that on the same (relatively long)
period of time 70% fewer blocks are produced.
@SkidanovAlex
Copy link
Collaborator

Hi, @pczarn, the changes look good to me.
Let's remove the proxify map, I think it overly complicates the machinery (and make the time machine always affect all the locations when enabled).

Removes code that is not necessary for typical operation.
Makes time travel affect all proxied locations
when enabled.

You may want to revert this commit in case you'd wish
to hunt for incorrect proxifications using
`scripts/find_time_travel_setting.py`.
@pczarn
Copy link
Author

pczarn commented Apr 15, 2021

done in the latest 2 commits @SkidanovAlex

@pczarn
Copy link
Author

pczarn commented Apr 28, 2021

@SkidanovAlex please allocate some time for a review

print(f'block {h} {fr}->{to}')
return True

# consensus_config = {"consensus": {"min_num_peers": 0}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be removed?

@SkidanovAlex
Copy link
Collaborator

The code looks good!
Thanks @pczarn!

@bowenwang1996 let's not merge until I figure out the issue with cluster.py.

@bowenwang1996 bowenwang1996 added the S-donotmerge Status: do not merge this PR label May 17, 2021
@@ -154,7 +154,7 @@ fn end_count_instructions() -> u64 {
}

fn start_count_time() -> Consumed {
Consumed::Instant(Instant::now())
Consumed::Instant(Instant::system_time())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this change? We do not need system time here, we need a way to count monotonic ticks between two moments.
Instant::now does exactly what we need, i.e.

A measurement of a monotonically nondecreasing clock. Opaque and useful only with Duration.

While description of SystemTime

A measurement of the system clock, useful for talking to external entities like the file system or other processes.

is not behavior we're looking for here.

Copy link
Author

@pczarn pczarn Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you are misled by the fact that the change has no impact. The call still goes through to Instant::now as before. Still, I wanted to introduce this name to bring attention to the fact that we decided not to go through a proxy.

A more descriptive name such as system_now or unproxied_now could be used here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olonho does my comment resolve your doubts?

Copy link
Contributor

@olonho olonho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use monotonic clock when measuring system elapsed time in estimator.

@stale
Copy link

stale bot commented Jul 6, 2021

This PR has been automatically marked as stale because it has not had recent activity in the 2 weeks.
It will be closed in 3 days if no further activity occurs.
Thank you for your contributions.

@stale
Copy link

stale bot commented Nov 12, 2021

This PR has been automatically marked as stale because it has not had recent activity in the 2 weeks.
It will be closed in 3 days if no further activity occurs.
Thank you for your contributions.

@stale stale bot added S-stale and removed S-stale labels Nov 12, 2021
@pczarn
Copy link
Author

pczarn commented Nov 13, 2021

@SkidanovAlex what's the status of this PR?

@bowenwang1996
Copy link
Collaborator

bowenwang1996 commented Nov 15, 2021

Hey @pczarn, due to the fact that this PR is pretty outdated, we decided to implement it from scratch instead of rebasing this PR on master. See (#5123) for details. Did you get the payment?

@pczarn
Copy link
Author

pczarn commented Nov 17, 2021

@bowenwang1996 yes I did receive the payment for this PR. This PR is a good chance for me to learn since I can look into somebody's reimplementation. I can already see some things that could have been better in this PR.

@pmnoxx
Copy link
Contributor

pmnoxx commented Nov 17, 2021

@pczarn I think that overall this is a good PR.However, in order to ship it safely a few steps would have been required.

In the ideal scenario we could have done it like this:

  • (introduce one source of time): We have multiple ways to get time. SystemTime, chrono:Utc, Instant.
    The instant has Instant::elapsed() method, which can get current time by itself, so replacing Instant with a wrapper is a must.
    However, having multiple types of wrappers in the code base for multiple types introduces unnecessary complexity.
    The best approach would be to start with introducing new wrapper for that, that can be used everywhere, take a look at feat: Create near-clock crate for mocking time - pt 1 #5177.

  • Implement a prototype for mocking - the structs are in (Mocking the system clock #5123). However, the mocking is not complete, it doesn't take care of Instant::elapsed and System time.

  • Then do the replacement step by step, ideally separately for each crate or group of crates. Like (chain/network ; chain/json* / chain/client; core/) - Ideally up to 10 files a a time, to make it easy to review

  • Add mocking, just like you did, and replace now(), elapsed() methods.

  • Also rust runs tests in parallel, so we have to handle it somehow. We decided on propagating TimeFactory throughout the code. And that will be a big change as well.

@bowenwang1996
Copy link
Collaborator

@bowenwang1996 yes I did receive the payment for this PR. This PR is a good chance for me to learn since I can look into somebody's reimplementation. I can already see some things that could have been better in this PR.

Thanks for your contribution! If you see places for improvements, feel free to submit new PRs.

@pmnoxx pmnoxx mentioned this pull request Nov 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-donotmerge Status: do not merge this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants