Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix find_binary rule to not use the cache #10751

Closed

Conversation

Eric-Arellano
Copy link
Contributor

@Eric-Arellano Eric-Arellano commented Sep 9, 2020

Because this rule depends entirely on the state of the external environment, it's more correct to not cache it. For example, if a user uninstalls a program or installs a new one, Pants needs to detect that automatically.

This does not fix every issue. The original motivation for this PR is Pyenv, where we want pyenv global 2.7.15 to cause invalidation for most Pex processes. Unfortunately, this does not fix it because Pyenv uses a single folder .shims/, so the output of this rule is simply path/to/.shims and things stay the same. But, this PR still improves the situation.

[ci skip-rust]
[ci skip-build-wheels]

[ci skip-rust]
[ci skip-build-wheels]
@@ -4,6 +4,7 @@
import random
import uuid
from dataclasses import dataclass, field
from uuid import UUID as UUID
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's convenient to have centralized import statements. This is exporting it.

We do this type of centralized import in a couple places, like rules.py exporting Get and MultiGet.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but those are our own internal types, so we can choose where to make them publicly available. This is a python stdlib type, and it's more confusing than convenient to export it as-if it's an internal type.

Copy link
Sponsor Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear to me that this is really the behavior we want.

@@ -42,7 +41,7 @@
from pants.core.util_rules.source_files import SourceFiles, SourceFilesRequest
from pants.engine.addresses import Addresses
from pants.engine.fs import AddPrefix, Digest, DigestSubset, MergeDigests, PathGlobs, Snapshot
from pants.engine.internals.uuid import UUIDRequest
from pants.engine.internals.uuid import UUID, UUIDRequest
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just seems confusing and adds cognitive load: "What is this seemingly-custom UUID class? Oh it's actually just a superfluous alias for the well-known Python stdlib class."


# We get a UUID so that we ignore the cache every time, as this script depends on the state of
# the external environment.
script_digest, uuid = await MultiGet(
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this will cause every rule that depends on this product to rerun every time. Is that really what we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the end result of the rule is the same, those downstream rules won't be invalidated. It only means that this Param will always be re-evaluated. This is the same with how the new TestEnvironment type works.

@gshuflin can you please confirm this is correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the uncacheable UUID Get here will only cause this rule to have to rerun, rather than invalidating every downstream rule.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but every upstream rule will have to rerun. Is that what we want? What uses find_binary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but every upstream rule will have to rerun. Is that what we want? What uses find_binary?

If your Python interpreters changed? Yes, absolutely, I think we should rerun every dependent rule.

For example, if you removed Python 3.7 from your machine and installed Python 3.8, that should invalidate your Python processes. It's not guaranteed that your tests would still pass.

Otherwise, the only way to get Pants to re-search for your interpreters is to purge lmdb_store, or modify --python-setup-interpreter-search-paths. There's no other way to get Pants to re-evaluate where your interpreters are.

--

Atm, find_binary is only used for finding Python. In the example plugin repo, it's used to find zip and bash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will rerun every single time

This specific rule will rerun every time, but the downstream rules will only rerun if the output of this specific rule changes.

This is the same with Greg's new --test-extra-env feature he added. That re-evaluated os.environ every single Pants run, yet we are still able to get caching.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a rule await Gets on this rule (or something that depends on it)? The entire body of that rule has to rerun every time as well, no? Since this rule might be run conditionally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A not-so-side-note: Fixing things for the Pants client side is great, but we never have this problem for remoting since we effectively have a hash of the remote executuion image as a key in Procss executions via remote_execution_extra_platform_properties - "container-image=...". So if the fix affects both sides its good to keep in mind we could probably strive to do better in the future and not impact remoting somehow with an evolution of the current client side fix.

Copy link
Sponsor Member

@stuhood stuhood Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific rule will rerun every time, but the downstream rules will only rerun if the output of this specific rule changes.

This is correct. Uncacheable nodes are not completely free, but they are cheaper than actually re-running the logic above the uncacheable node.

An uncacheable node runs once per "session" (typically one pants run), and everything that depends on an uncacheable node is effectively always marked "dirty". Dirty nodes do not re-run unless their dependencies (recursively, all the way down to the uncacheable node) have changed output values, represented using an integer generation value. That rust-only recursion on integer graph ids and integer generation values (called "cleaning") is much cheaper than actually running the nodes.

BUT: two things.

  1. It's possible that this patch will experience the kinds of issues that @gshuflin ran into on Failed processes memoized under pantsd #10129 (...although we've been using uncacheable nodes for test --force, so perhaps not)
  2. This might be better modeled (eventually?) as a native operation of the CommandRunners as I suggested originally on Harden PATH setting in pex runs #9760: see the followup ticket I created after Harden PATH setting in pex runs #9760 landed: Make PATH scanning/filtering a native operation #10526. As @jsirois mentioned above, the behavior in the remote and local cases is potentially different.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@coveralls
Copy link

Coverage Status

Coverage remained the same at 0.0% when pulling f9ac86c on Eric-Arellano:uncachable-find-binary into 7f76dbc on pantsbuild:master.

@jsirois
Copy link
Member

jsirois commented Sep 12, 2020

@Eric-Arellano here's an alternative PR - it looks like I may need to use this mechanism a third time to straighten out --use-first...: #10768.

@jsirois
Copy link
Member

jsirois commented Sep 12, 2020

The original motivation for this PR is Pyenv, where we want pyenv global 2.7.15 to cause invalidation for most Pex processes. Unfortunately, this does not fix it because Pyenv uses a single folder .shims/, so the output of this rule is simply path/to/.shims and things stay the same.

Is there an issue detailing this problem? I couldn't find one.

@Eric-Arellano
Copy link
Contributor Author

Is there an issue detailing this problem? I couldn't find one.

No, I only realized the issue this week when I was iterating on MyPy handling of interpreter constraints and found that using pyenv global to active and deactivate MyPy had no impact.

@jsirois
Copy link
Member

jsirois commented Sep 13, 2020

... and found that using pyenv global to active and deactivate MyPy had no impact.

That should be fixed by #10770.

@Eric-Arellano
Copy link
Contributor Author

Superseded by #10770 and $10768. Thanks John!

@Eric-Arellano Eric-Arellano deleted the uncachable-find-binary branch September 13, 2020 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants