Fix `find_binary` rule to not use the cache #10751

Eric-Arellano · 2020-09-09T02:31:48Z

Because this rule depends entirely on the state of the external environment, it's more correct to not cache it. For example, if a user uninstalls a program or installs a new one, Pants needs to detect that automatically.

This does not fix every issue. The original motivation for this PR is Pyenv, where we want pyenv global 2.7.15 to cause invalidation for most Pex processes. Unfortunately, this does not fix it because Pyenv uses a single folder .shims/, so the output of this rule is simply path/to/.shims and things stay the same. But, this PR still improves the situation.

[ci skip-rust]
[ci skip-build-wheels]

[ci skip-rust] [ci skip-build-wheels]

benjyw · 2020-09-09T02:57:46Z

src/python/pants/engine/internals/uuid.py

@@ -4,6 +4,7 @@
 import random
 import uuid
 from dataclasses import dataclass, field
+from uuid import UUID as UUID


Because it's convenient to have centralized import statements. This is exporting it.

We do this type of centralized import in a couple places, like rules.py exporting Get and MultiGet.

Sure, but those are our own internal types, so we can choose where to make them publicly available. This is a python stdlib type, and it's more confusing than convenient to export it as-if it's an internal type.

benjyw

Unclear to me that this is really the behavior we want.

benjyw · 2020-09-09T02:59:54Z

src/python/pants/backend/python/goals/pytest_runner.py

@@ -42,7 +41,7 @@
 from pants.core.util_rules.source_files import SourceFiles, SourceFilesRequest
 from pants.engine.addresses import Addresses
 from pants.engine.fs import AddPrefix, Digest, DigestSubset, MergeDigests, PathGlobs, Snapshot
-from pants.engine.internals.uuid import UUIDRequest
+from pants.engine.internals.uuid import UUID, UUIDRequest


This just seems confusing and adds cognitive load: "What is this seemingly-custom UUID class? Oh it's actually just a superfluous alias for the well-known Python stdlib class."

benjyw · 2020-09-09T03:01:20Z

src/python/pants/engine/process.py

+
+    # We get a UUID so that we ignore the cache every time, as this script depends on the state of
+    # the external environment.
+    script_digest, uuid = await MultiGet(


IIRC this will cause every rule that depends on this product to rerun every time. Is that really what we want?

I think if the end result of the rule is the same, those downstream rules won't be invalidated. It only means that this Param will always be re-evaluated. This is the same with how the new TestEnvironment type works.

@gshuflin can you please confirm this is correct?

I believe that the uncacheable UUID Get here will only cause this rule to have to rerun, rather than invalidating every downstream rule.

Sure, but every upstream rule will have to rerun. Is that what we want? What uses find_binary?

but every upstream rule will have to rerun. Is that what we want? What uses find_binary?

If your Python interpreters changed? Yes, absolutely, I think we should rerun every dependent rule.

For example, if you removed Python 3.7 from your machine and installed Python 3.8, that should invalidate your Python processes. It's not guaranteed that your tests would still pass.

Otherwise, the only way to get Pants to re-search for your interpreters is to purge lmdb_store, or modify --python-setup-interpreter-search-paths. There's no other way to get Pants to re-evaluate where your interpreters are.

--

Atm, find_binary is only used for finding Python. In the example plugin repo, it's used to find zip and bash.

it will rerun every single time

This specific rule will rerun every time, but the downstream rules will only rerun if the output of this specific rule changes.

This is the same with Greg's new --test-extra-env feature he added. That re-evaluated os.environ every single Pants run, yet we are still able to get caching.

What happens if a rule await Gets on this rule (or something that depends on it)? The entire body of that rule has to rerun every time as well, no? Since this rule might be run conditionally.

A not-so-side-note: Fixing things for the Pants client side is great, but we never have this problem for remoting since we effectively have a hash of the remote executuion image as a key in Procss executions via remote_execution_extra_platform_properties - "container-image=...". So if the fix affects both sides its good to keep in mind we could probably strive to do better in the future and not impact remoting somehow with an evolution of the current client side fix.

This specific rule will rerun every time, but the downstream rules will only rerun if the output of this specific rule changes.

This is correct. Uncacheable nodes are not completely free, but they are cheaper than actually re-running the logic above the uncacheable node.

An uncacheable node runs once per "session" (typically one pants run), and everything that depends on an uncacheable node is effectively always marked "dirty". Dirty nodes do not re-run unless their dependencies (recursively, all the way down to the uncacheable node) have changed output values, represented using an integer generation value. That rust-only recursion on integer graph ids and integer generation values (called "cleaning") is much cheaper than actually running the nodes.

BUT: two things.

It's possible that this patch will experience the kinds of issues that @gshuflin ran into on Failed processes memoized under pantsd #10129 (...although we've been using uncacheable nodes for test --force, so perhaps not)

This might be better modeled (eventually?) as a native operation of the CommandRunners as I suggested originally on Harden PATH setting in pex runs #9760: see the followup ticket I created after Harden PATH setting in pex runs #9760 landed: Make PATH scanning/filtering a native operation #10526. As @jsirois mentioned above, the behavior in the remote and local cases is potentially different.

coveralls · 2020-09-09T03:03:44Z

Coverage remained the same at 0.0% when pulling f9ac86c on Eric-Arellano:uncachable-find-binary into 7f76dbc on pantsbuild:master.

jsirois · 2020-09-12T22:59:12Z

@Eric-Arellano here's an alternative PR - it looks like I may need to use this mechanism a third time to straighten out --use-first...: #10768.

jsirois · 2020-09-12T22:59:49Z

The original motivation for this PR is Pyenv, where we want pyenv global 2.7.15 to cause invalidation for most Pex processes. Unfortunately, this does not fix it because Pyenv uses a single folder .shims/, so the output of this rule is simply path/to/.shims and things stay the same.

Is there an issue detailing this problem? I couldn't find one.

Eric-Arellano · 2020-09-13T01:57:05Z

Is there an issue detailing this problem? I couldn't find one.

No, I only realized the issue this week when I was iterating on MyPy handling of interpreter constraints and found that using pyenv global to active and deactivate MyPy had no impact.

jsirois · 2020-09-13T16:34:01Z

... and found that using pyenv global to active and deactivate MyPy had no impact.

That should be fixed by #10770.

Eric-Arellano · 2020-09-13T17:32:00Z

Superseded by #10770 and $10768. Thanks John!

Fix find_binary rule to not use the cache

f9ac86c

[ci skip-rust] [ci skip-build-wheels]

Eric-Arellano requested review from stuhood, benjyw and tdyas September 9, 2020 02:31

benjyw reviewed Sep 9, 2020

View reviewed changes

Eric-Arellano requested a review from gshuflin September 9, 2020 04:47

jsirois mentioned this pull request Sep 12, 2020

--use-first-matching-interpreter shebang doesn't play nicely with macOS Python 3.7 install #10648

Closed

Eric-Arellano closed this Sep 13, 2020

Eric-Arellano deleted the uncachable-find-binary branch September 13, 2020 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `find_binary` rule to not use the cache #10751

Fix `find_binary` rule to not use the cache #10751

Eric-Arellano commented Sep 9, 2020 •

edited

benjyw Sep 9, 2020

Eric-Arellano Sep 9, 2020

benjyw Sep 10, 2020

benjyw left a comment

benjyw Sep 9, 2020

benjyw Sep 9, 2020

Eric-Arellano Sep 9, 2020

gshuflin Sep 9, 2020

benjyw Sep 10, 2020

Eric-Arellano Sep 10, 2020

Eric-Arellano Sep 11, 2020

benjyw Sep 11, 2020

jsirois Sep 11, 2020

stuhood Sep 11, 2020 •

edited

benjyw Sep 12, 2020

coveralls commented Sep 9, 2020

jsirois commented Sep 12, 2020

jsirois commented Sep 12, 2020

Eric-Arellano commented Sep 13, 2020

jsirois commented Sep 13, 2020

Eric-Arellano commented Sep 13, 2020

Fix find_binary rule to not use the cache #10751

Fix find_binary rule to not use the cache #10751

Conversation

Eric-Arellano commented Sep 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjyw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Sep 11, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Sep 9, 2020

jsirois commented Sep 12, 2020

jsirois commented Sep 12, 2020

Eric-Arellano commented Sep 13, 2020

jsirois commented Sep 13, 2020

Eric-Arellano commented Sep 13, 2020

Fix `find_binary` rule to not use the cache #10751

Fix `find_binary` rule to not use the cache #10751

Eric-Arellano commented Sep 9, 2020 •

edited

stuhood Sep 11, 2020 •

edited