Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable concurrent runs for pantsd in v2 #7654

Open
ity opened this issue May 2, 2019 · 17 comments
Open

Re-enable concurrent runs for pantsd in v2 #7654

ity opened this issue May 2, 2019 · 17 comments

Comments

@ity
Copy link
Contributor

ity commented May 2, 2019

pantsd currently prevents concurrent runs because of global mutable singletons that were pervasive in v1. In v2, there are none of these left, and so only concurrent access to stdio needs to be managed.

We would like to allow for concurrent runs under pantsd, without the use of the PANTS_CONCURRENT=True flag (ie, that flag would become a noop and then be deprecated).


A sketch of what this will likely involve, in at least four PRs:

  1. Move to the rust nailgun client, to avoid needing to implement any client-side logic twice. Fixed in Re-land the port of Pants' nailgun client to Rust #11147.

  2. cancellation needs to move to heartbeat-based

    • Add a "canceled" bool/sync primitive (probably a watch) to Session, and propagate it in from the read half of the connection closing.
    • Client closes the write half of the socket, then waits for the server to close the other half (might require additional support in nails).
    • Cancellation bool consumed in all relevant places:
      • InteractiveProcess: should move to spawning the process and then asynchronously waiting for completion.
      • Graph: should cancel ongoing work if the client/Session that started it goes away (and let any existing clients restart it).
    • Document this new behavior of the pantsd lifecycle somewhere.
  3. remaining singletons need to be located and fixed

    • All access to stdio in both the rust code and the python code should be replaced with access to Session-specific files, and the sys.std* file handles should be closed, poisoned, or replaced with synthetic thread-local files (á la).
    • All remaining usages of Subsystem.global_instance should be removed.
  4. allow concurrent runs in DaemonPantsRunner

    • Make PantsDaemonCore a container for multiple Schedulers (rather than the current singular Scheduler) to allow for concurrent access with different options. Also, consider porting the concurrent Scheduler management to Rust?
    • Remove the DaemonPantsRunner._one_run_at_a_time lock.
@ity ity added this to TODO in Pants Daemon via automation May 3, 2019
@blorente blorente added the pantsd label May 8, 2019
stuhood added a commit that referenced this issue May 23, 2020
### Problem

Pants' runtime overhead is influenced by multiple factors:
1) packaging overhead in virtualenvs and pexes
2) import time for python code
3) time to walk the filesystem and (re)fingerprint inputs
4) time to run `@rule`s with unchanged inputs

Pantsd helps to address all of these issues, and has just (re-)reached maturity.

### Solution

Prepare to enable `pantsd` by default by deprecating not setting the `enable_pantsd` flag. Also, set the flag for pantsbuild/pants.

### Result

noop time when running in a virtualenv (such as in [github.com/pantsbuild/example-python](https://github.com/pantsbuild/example-python)) drops from `~2.4s` to `~1.6s`. Followup work (moving fingerprinting to the server, porting the client to rust) is expected to push this below `500ms`.

There are a collection of known issues that might impact users tracked in https://github.com/pantsbuild/pants/projects/5, some of which we will address via dogfooding. Others are matters of expectation: "I wouldn't expect that to work".

One of the most significant issues though is #7654: we should consider making a push before `1.29.0` to remove use of global mutable state to allow for concurrent pants runs with pantsd under v2.

Fixes #4438.
@stuhood stuhood modified the milestones: 1.29.x, 1.30.x May 29, 2020
@stuhood stuhood moved this from TODO to In Progress in Pants Daemon Jun 2, 2020
@stuhood stuhood self-assigned this Jun 2, 2020
@stuhood stuhood moved this from In Progress to TODO in Pants Daemon Jun 3, 2020
stuhood added a commit that referenced this issue Jun 3, 2020
### Problem

The `PantsDaemon` class was playing double duty as both the entrypoint for the pantsd server, and as a launcher for the client. We will soon need to make significant changes there in order to support #8200 and #7654.

### Solution

Extract a `PantsDaemonProcessManager` base class for process metadata reads/writes and a `PantsDaemonClient` subclass to consume the metadata to decide whether to launch the server. This is 100% code moves... no logic changed.

### Result

The `PantsDaemon` class is now exclusively a server, and `PantsDaemonClient` is now exclusively a client.

[ci skip-rust-tests]
[ci skip-jvm-tests]
@stuhood stuhood moved this from TODO to In Progress in Pants Daemon Jun 4, 2020
@stuhood stuhood moved this from In Progress to TODO in Pants Daemon Jun 12, 2020
stuhood added a commit that referenced this issue Jun 15, 2020
### Problem

Currently we restart pantsd for most configuration changes, and exclude a small set of bootstrap options (by marking them `daemon=False`) that should not trigger a restart. But in most cases, restarting is heavy-handed. We'd like to be able to keep more and more of our state alive over time as we continue to remove global mutable state (in order to allow us to tackle #7654, among other things).

Additionally, the pantsd client currently implements the fingerprinting that decides when the server should restart, which blocks moving the pantsd client to rust. We'd like the client to only need to interact with a small set of options to simplify its implementation.

### Solution

Move the nailgun server out of the `PantsService` model and directly into the `PantsDaemon` class. Share a `PantsDaemonCore` between the daemon and `DaemonPantsRunner` that encapsulates the current `Scheduler` and all live services. Finally, have the `PantsDaemonCore` implement fingerprinting to decide when to reinitialize/recreate the `Scheduler` (without restarting) and trim down the options that trigger a restart (`daemon=True`) to only those that are used to start the daemon itself (rather than to create the `Scheduler`).

### Result

`pantsd` will stay up through the vast majority of options changes (with the exception of a handful of "micro-bootstrap" options), and will instead reinitialize the `Scheduler` for bootstrap options changes with some useful output when it does so.

Example:
```
$ ./pants help
23:26:22 [INFO] initializing pantsd...
23:26:24 [INFO] pantsd initialized.
Pants 1.30.0.dev0 https://pypi.org/pypi/pantsbuild.pants/1.30.0.dev0

Usage:
<snip>

$ ./pants --no-v1 help
23:26:31 [INFO] initialization options changed: reinitializing pantsd...
23:26:32 [INFO] pantsd initialized.
Pants 1.30.0.dev0 https://pypi.org/pypi/pantsbuild.pants/1.30.0.dev0

Usage:
<snip>
```

This prepares to port the client to rust, and unblocks a fix for #8200, by having the `PantsDaemon` class tear down the nailgun server cleanly in the foreground if any services exit. Fixes #6114, fixes #7573, and fixes #10041.
@stuhood stuhood removed this from the 1.30.x milestone Jun 17, 2020
@stuhood
Copy link
Sponsor Member

stuhood commented Jun 26, 2020

Got another request for this. Possibly worth tackling pre-2.0 in case there are API changes necessary? Unclear.

stuhood added a commit to stuhood/pants that referenced this issue Jul 7, 2020
…k in future versions (via pantsbuild#7654) that implementation will not require nesting to validate: only concurrent runs.
stuhood added a commit that referenced this issue Jul 7, 2020
### Problem

Since #8265, we have been running `PantsRunIntegrationTest`s from loose sources in the repository, which are included via `src/python/pants/testutil:int-test`'s dependency on `src/python/pants/bin:pants_local_binary`. That change thus removed the need for the `pants.pex` from our wrapper scripts.

### Solution

Remove the "`pants.pex` for integration tests" mechanism. Post #8625, pants is run from the PYTHONPATH of the test target, which will automatically include either loose sources or a `pants_requirement` target, depending on whether the test is run in or out of the pantsbuild/pants repo.

Additionally, remove one set of tests that was attempting to test that nested runs of pants do not deadlock. #7654 will eventually allow that issue to be resolved by allowing the concurrent run rather than via any sort of automatic disabling of `pantsd`, and that is much easier to test via concurrent runs _without_ nesting.

[ci skip-rust-tests]
@stuhood stuhood moved this from TODO to In Progress in Pants Daemon Jul 7, 2020
stuhood added a commit that referenced this issue Jul 11, 2020
…#10320)

### Problem

To prepare for #7654, we need to remove dependence on signal handling between the pantsd client and server. Signals do not provide enough information to decide which run to abort, and in general are not particularly subtle.

### Solution

Rather than signals, we will instead rely on nailgun connection liveness via the "Heartbeat" extension to the nailgun protocol (implemented in [nails 0.6.0](stuhood/nails@fb72699...070fe03)), which allows for requiring regular heartbeats from the client, and aborting a run if they do not arrive.

In addition to heartbeats, `nails` now also makes guarantees about how a `Nail` will be notified that the connection has closed (namely that the input stream will remain open until the connection dies).

### Result

Followup changes (including #10004) will be able to take advantage of these new signals to cancel ongoing work without killing the pantsd server.
@stuhood stuhood moved this from In Progress to TODO in Pants Daemon Jul 14, 2020
@stuhood
Copy link
Sponsor Member

stuhood commented Feb 17, 2021

More progress on this. A draft of the plugin resolution change is out at #11568. Unfortunately, I'm out for the next 24 hours, and will have some planning work to do next week: am hoping to get both #11568 and #11536 green before the end of the week, but if not this might slip a bit further.

Sorry for all of the delays!

stuhood added a commit that referenced this issue Feb 22, 2021
### Problem

#11536 moves from using POSIX-level replacement of the `stdio` file descriptors `{0, 1, 2}`, to replacing `sys.std*` with a thread-local implementation. Unfortunately, Python `subprocess`/`Popen` APIs hardcode `{0, 1, 2}` rather than actually inspecting `sys.std*.fileno()`, and so usages of those APIs that use `std*=None` (i.e. "inherit" mode), will inherit intentionally dead/closed file handles under `pantsd`.

PEX uses inherit mode when spawning `pip` (in order to pass through `stderr` to the parent process), and this runs afoul of the above behavior. Pants has used PEX as a library for a very long time, but 95% of usecases have now migrated to using PEX as a binary, with wrappers exposed via the `@rule` API. The `PluginResolver` that Pants uses to resolve its own code is the last usage of PEX as a library. 

### Solution

In a series of commits, introduce a "bootstrap" `Scheduler` that is able to resolve plugins after an `OptionsBootstrapper` has been created, but before creating the `BuildConfiguration` (which contains plugin information).

### Result

Although Pants has some stray references to PEX APIs, it no longer uses PEX as a library to resolve dependencies. #11536 and #7654 are unblocked.

In future, if the options required to construct a minimal scheduler can be further pruned, the bootstrap `Scheduler` might also be able to be used to create the `OptionsBootstrapper`, which would allow for addressing #10360.
@stuhood
Copy link
Sponsor Member

stuhood commented Feb 27, 2021

Hey folks! In the homestretch on #11536, with two test failures and a console-width-reporting issue to look into. Very optimistic that it will land next week. Once it does, I'm optimistic that the final PR for this change will be significantly smaller. Thanks for your patience.

stuhood added a commit that referenced this issue Mar 1, 2021
)

### Problem

In order to support concurrent `pantsd` clients, work spawned by each client (even into background threads) must interact with the relevant `stdio` file handles. And when a client disconnects, spawned work should be able to continue to log to the `pantsd.log`.

### Solution

Extract the thread/task-local aspects of our `logging` crate into a `stdio` crate that provides:
1) a `Console`-aware `logging` destination
2) exclusive access to a `Console` while a UI or `InteractiveProcess` is running
3) Python-level `sys.std*` replacements

### Result

No user-visible impact, but one of the largest remaining blockers for #7654 is removed.

[ci skip-build-wheels]
@stuhood
Copy link
Sponsor Member

stuhood commented Mar 3, 2021

#11536 landed yesterday, and appears stable so far. I've drafted the change to enable concurrent runs, and so far it looks good! Very optimistic that this will be able to land before the end of the week.

stuhood added a commit that referenced this issue Mar 9, 2021
### Problem

In order to allow concurrent runs of Pants in #11639 (part of #7654), all global mutable singletons must be either 1) thread-local, 2) replaced with explicitly passed values. `os.environ` and `sys.argv` are two of the last cases we're concerned with.

### Solution

Although we could hypothetically redefine `os.environ` and `os.getenv` as thread-local, they seem to be a better fit to be explicitly passed. So this change:
1. Renames (and moves) `PantsEnvironment` to `CompleteEnvironment`
2. Adds `Environment` and `EnvironmentRequest`, which filter the `CompleteEnvironment` to a smaller subset, and should be used more frequently by `@rules`.
3. Fixes a few cases of implicit `os.environ` usage in the `PythonSetup` subsystem by explicitly selecting interpreter-selection env vars (`HOME`/`PATH`/`PYENV_ROOT`)

Finally, we empty the environment for `pantsd` runs to enforce explicit usage of `CompleteEnvironment`/`Environment`. 

### Result

The last (?) global mutable variable blocking #7654 is removed. 

[ci skip-build-wheels]
@stuhood
Copy link
Sponsor Member

stuhood commented Jun 17, 2021

So, unfortunately, #10827 was a bit too optimistic about being able to support concurrent sessions with SessionValues. The issue is that although the SessionValues can be consumed by uncacheable nodes, we cannot know ahead of time which SessionValue value the transitive dependents of a node in the graph will end up transitively observing. This inhibits concurrency: we'd effectively have to go back to including the OptionsBootstrapper in node identities all the way down to where it was consumed.

But I think that I have a new solution that would replace SessionValues (argh, flip-flop... sorry) based on revisiting #6845 via #11269. Effectively: implicit conversion means that we fully convert to the (transitive) argument type of a @rule before invoking it. In practice, that should mean that at the root of the graph, we would start by executing implicit conversions from the OptionsBootstrapper to any relevant subsystems that are computed from it, and those would end up being the Params passed down into the graph. AFAICT, this accomplishes #6845 ("param minimization") using the new requirement from #11269: that all @rule arguments must be hashable.

So, in short: this ticket is blocked on #11269. I have a few days progress on #11269, and I'm optimistic that it will be simpler than our existing rule graph construction, as well as unlocking things like this.

@Eric-Arellano
Copy link
Contributor

Thanks @stuhood. Makes sense. That sounds good to focus on #11269.

@stuhood
Copy link
Sponsor Member

stuhood commented Feb 22, 2022

In order for the bsp builtin goal from #13260 to not prevent other Pants runs while it is open, we'll likely want to move the current locking onto Scheduler.execute instead, as that will allow some interleaving of runs.

@Moortiii
Copy link
Contributor

Is the end goal for this issue solely that simultaneous invocations of pants would act as if each invocation was ran with PANTS_CONCURRENT=True or would it also bring other improvements such as increased concurrent performance?

@stuhood
Copy link
Sponsor Member

stuhood commented Sep 11, 2023

Is the end goal for this issue solely that simultaneous invocations of pants would act as if each invocation was ran with PANTS_CONCURRENT=True

Hey @Moortiii : yes, that is the goal. It would not be expected to impact the performance of un-contended runs.

@chris-stetter
Copy link

This would be great for use case 1 as discussed in #20642 (comment). We currently set --concurrent for long running tasks so we can still run other pants commands. This is not optimal of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Status: No status
Pants Daemon
  
In Progress
Development

Successfully merging a pull request may close this issue.

7 participants