New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tokio for scheduler requests and local process execution #5846

Merged
merged 5 commits into from May 23, 2018

Conversation

Projects
None yet
5 participants
@stuhood
Copy link
Member

stuhood commented May 19, 2018

Problem

As described in #5763, we're currently sharing a thread pool between filesystem access and process execution, and we'd like to be able to control the number of concurrent filesystem accesses and running processes independently.

Additionally (as mentioned in comments there), the proliferation of context switching between various threadpools is likely to bog us down as we add more resources that we'd like to control (optionally blocking) access to.

Solution

Introduce a dependency on tokio at the root of engine requests in the Scheduler (which ensures that all Futures not run explicitly on a CpuPool are running on a tokio Runtime.

Also add a dependency on tokio-process for local process execution, and remove usage of the filesystem pool for process execution.

Finally, add AsyncSemaphore to control access to resources without blocking or context switching.

Result

Fixes #5763 and unblocks usage of tokio-timeout immediately, and tokio-fs once tokio-rs/tokio#369 is resolved.

@stuhood stuhood requested review from jsirois , illicitonion and dotordogh May 19, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented May 19, 2018

Each commit should be useful to review independently.

@stuhood stuhood force-pushed the twitter:stuhood/tokio branch 3 times, most recently from 6af0a52 to ac7a7e8 May 19, 2018

@dotordogh
Copy link
Contributor

dotordogh left a comment

Cool! That AsyncSemaphore lib is interesting! Thanks for including me!

@@ -24,10 +24,10 @@ where
// Sadly there is no way to accept an Fn() -> T because it's not Sized, so we need to accept an

This comment has been minimized.

@dotordogh

dotordogh May 20, 2018

Contributor

This doesn't seem relevant anymore...

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented May 21, 2018

Blocked by #5850.

EDIT: Rebased onto #5850.

@kwlzn

kwlzn approved these changes May 21, 2018

Copy link
Member

kwlzn left a comment

lgtm, afaict

authors = [ "Pants Build <pantsbuild@gmail.com>" ]

[dependencies]
futures = "0.1"

This comment has been minimized.

@kwlzn

kwlzn May 21, 2018

Member

is the version regression from 0.1.16 -> 0.1 intended here? weren't there changes that were upstreamed in later versions of futures that we relied on?

This comment has been minimized.

@stuhood

stuhood May 21, 2018

Member

In cargo-land, 0.1 is "anything that is compatible with 0.1"... so 0.1.*, approximately. The contents of the lockfile pin down a particular version, so making this less specific is safe from a build-reproducibility standpoint.

@stuhood stuhood force-pushed the twitter:stuhood/tokio branch from ac7a7e8 to d41a44e May 21, 2018

@illicitonion
Copy link
Contributor

illicitonion left a comment

This looks great, thanks! Much cleaner :)

@@ -25,6 +28,7 @@ pub struct Core {
pub rule_graph: RuleGraph,
pub types: Types,
pub fs_pool: Arc<ResettablePool>,
pub runtime: Resettable<Arc<Runtime>>,

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

Can you add a comment here explaining when runtime should be used (and that we should transition more to it over time)?

///
/// You should generally prefer to use `with_acquired`, as it is less error prone.
///
fn acquire(&self) -> PermitFuture {

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

Maybe inline this? Can be extracted again if actually required.

(I'm generally wary of standalone methods which tell you not to use them, particularly if they're actually unused :))

This comment has been minimized.

@stuhood

stuhood May 23, 2018

Member

Mm, yep. It started out public before I added with_acquired.

// Unblock thread1 and confirm that thread2 acquires.
unblock_thread1.send(()).unwrap();
acquired_thread2
.recv_timeout(Duration::from_secs(5))

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

This timeout should be the same as the recv timeout above - they're testing the same effect.

Ideally they'd both be more in the microsecond range - second-long blocking on an available semaphore isn't great, and even 500ms is pretty huge, but I guess there's a trade-off to be made for test reliability... Maybe 100ms is a reasonable compromise?

This comment has been minimized.

@stuhood

stuhood May 23, 2018

Member

The 5 second timeouts are what I refer to as "negative timeouts" (ie, we expect them not to be reached), and the 500 millisecond timeout is a "positive timeout" because we expect it to be reached. Setting negative timeouts nice and high helps to reduce flakiness.

Will lower the positive timeout to 100ms though.

This comment has been minimized.

@illicitonion

illicitonion May 23, 2018

Contributor

Yeah, I get the thinking here, but I still think they should be the same. My reasoning goes:

If the test is flaky with the 5s timeout at 500ms, that means that sometimes it takes more than 500ms for us to wake up the thread.

Which means that waiting 500ms isn't waiting long enough to verify that we won't wake up the thread, because we know that sometimes it takes longer than that.

Which means that our check which is trying to say "we never wake up the thread while the semaphore should be unavailable" needs to use a timeout bigger than 500ms to check that we never wake up the thread, because we expect it to sometimes take more than 500ms (which is why the test is flaky).

So we need the values to be the same, because the amount we need to wait to convincingly guarantee that the thread will be woken up is the same amount of time we need to wait to convincingly guarantee that the thread won't be woken up.

This comment has been minimized.

@stuhood

stuhood May 23, 2018

Member

Which means that waiting 500ms isn't waiting long enough to verify that we won't wake up the thread, because we know that sometimes it takes longer than that.

Well, we don't know what the worst case is, no. But precisely identifying the worst case is hard, and thus tightly bounding the worst case from both the top and the bottom of its potential range is hard. And so choosing one timeout for both cases is the hardest of all.

This comment has been minimized.

@illicitonion

illicitonion May 29, 2018

Contributor

Which means that waiting 500ms isn't waiting long enough to verify that we won't wake up the thread, because we know that sometimes it takes longer than that.

Well, we don't know what the worst case is, no. But precisely identifying the worst case is hard, and thus tightly bounding the worst case from both the top and the bottom of its potential range is hard. And so choosing one timeout for both cases is the hardest of all.

The problem I'm trying to highlight is that the two bounds are aiming to supply different guarantees.

The first wait is a correctness wait; the higher the value, the more confidence we have that the test is actually correct (but the slower our tests run). Decreasing this value decreases our confidence that the thread is never woken up, because maybe it would be if we just waited another millisecond.

The second wait is a deflaking wait; the higher the value, the less likely the test is to spuriously fail. The lower the value, while the test consistently passes, the tighter a bound we give on how fast the semaphore can be re-acquired.

The reason I think these should be the same is that we want the first wait to be as low as we're confident lowering it to (so that our tests run fast) [1]; the way that we get this confidence is by lowering the second wait (to the same value). If we see flakiness with the second wait, we need to increase both values - the second wait consistently succeeding shows that we waited long enough to have confidence that the first wait was actually meaningful. Otherwise nothing tells us that the first wait was a long enough wait to be confident in it.

1: We also want the second wait to be as low as we're confident lowering it to, because we actually do want to guarantee that our semaphore can be acquired quickly - if the test is flaky with a 1s timeout, that's genuinely a bug in the code - we need to be able to acquire a semaphore quicker than that.

@@ -74,3 +78,31 @@ pub trait CommandRunner: Send + Sync {

fn reset_prefork(&self);
}

///
/// A CommandRunner wrapper that limits the number of concurrent requests.

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

Forward-looking: I'm not sure how we should handle this for the remote::CommandRunner case - the remote::CommandRunner has its own threadpool for the grpc environment, which has a fixed size, and having multiple concurrency limits doesn't seem ideal.

Then again, maybe just switching to https://github.com/tower-rs/tower-grpc at some point and using tokio to drive grpc is the answer.

This comment has been minimized.

@stuhood

stuhood May 23, 2018

Member

Right... I fully expect us to switch to tokio for grpc at some point.

@@ -19,6 +19,8 @@ resettable = { path = "../resettable" }
sha2 = "0.6.0"
tempdir = "0.3.5"
futures-timer = "0.1"
# TODO: A master sha that is likely to be released as 0.2.1.

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

Can you comment here on what specifically we need which isn't in 0.2.0?

@@ -66,10 +68,12 @@ boxfuture = { path = "boxfuture" }
enum_primitive = "0.1.1"
fnv = "1.0.5"
fs = { path = "fs" }
futures = "0.1.16"
futures = "0.1"

This comment has been minimized.

@illicitonion

illicitonion May 22, 2018

Contributor

We actually have a hard dep on 0.1.16 as a minimum; I remember bumping it at some point. I just tried forcing 0.1.15 and the build failed, whereas forcing 0.1.16 was successful.

I've been meaning to write a little utility or something which tries to build a crate actually forcing all deps to their minimum specified compatible version...

This comment has been minimized.

@stuhood

stuhood May 23, 2018

Member

I needed to remove this constraint in order to not see a failure where tokio selected one version (0.1.21) and our explicitly selected version interfered. But I think it should work with a "caret requirement"? https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html

@ity

ity approved these changes May 22, 2018

Copy link
Member

ity left a comment

lgtm!

@stuhood stuhood merged commit a6eaef7 into pantsbuild:master May 23, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@stuhood stuhood deleted the twitter:stuhood/tokio branch May 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment