paths: allow finding paths between addresses that expand to multiple targets #19482

AlexTereshenkov · 2023-07-17T15:04:29Z

The paths goal currently expects a single from and a single to address. This works great for finding all paths in the dependency graph between two individual targets, e.g. a python_source or a python_requirement. Attempting to find all paths between multiple targets in a directory, however, would fail with an error:

$ pants paths --from=src/python/pants/backend/adhoc --to=src/python/pants/backend/project_info
...
pants.build_graph.address.ResolveError: Expected a single target, but was given multiple targets.

Did you mean one of these?

  * src/python/pants/backend/adhoc/__init__.py
  * src/python/pants/backend/adhoc/adhoc_tool.py
  * src/python/pants/backend/adhoc/run_system_binary.py
  * src/python/pants/backend/adhoc/target_types.py

This can be solved by running the paths goal multiple times, of course, by listing the targets in a directory and then using paths goal in a for loop (Bash/Python). This is very expensive, however, as making a Pants call is known to have an overhead (pantsd/scheduler/etc). It's a whole lot cheaper to get all the paths in a single Pants call.

Being able to find all paths between multiple targets would help answering the following questions:

How is this Python module connected to a Python package, if at all?

One can find out if the module is accessing the package by doing dependencies --transitive for the module of interest and see if there are any modules from the package of concern. If there's a single module listed, then doing just another manual paths invocation is fine. However, if there are multiple modules listed, you'd have to run paths for each of them which is, again, very expensive.

How tightly are two Python packages connected?

For example, you may want to estimate the refactoring efforts (to make two packages not depend on each other) and for this you need to see how many paths exist between two packages. Currently, obtaining the dependency graph for individual files in one shot (as {moduleA: [deps], moduleB: [deps]}) is not possible with Pants (I've solved this with a plugin), so you need to run dependencies, transitively for every file of interest, which again has an overhead (one file - one Pants call). Running pants paths --from=src/libraryA:: --to=src/appB:: would provide all the necessary information in one shot.

Dependency graph analytics. Dumping all the paths in the repo, e.g. between sources and tests, one can find the source modules that has most paths leading to tests (so perhaps needs to be split?), find too long paths (code architecture smell if a module leads to another module through 10+ modules steps?), and many more.

After the changes made, these commands produce identical results (i.e. it doesn't matter if a directory is passed or an address (with single or double colon notation):

$ pants paths --from=src/python/pants/backend/adhoc --to=src/python/pants/backend/project_info > all.json 
$ pants paths --from=src/python/pants/backend/adhoc:: --to=src/python/pants/backend/project_info:: > all-recursive.json
$ diff all.json all-recursive.json; echo $?
0

Running this command consistently takes about 45-50s with nested for loop (the final suggested implementation).

pants --no-pantsd --no-local-cache paths --to=src/python/pants/engine/target.py --from=src/python/pants/backend:: > all.json

I've experimented using MultiGet with multiple rules (please see commit 2 and 3), but the performance was identical. I therefore suggest to keep the for loops in place for simplicity purposes. Of course happy to switch to a full or partial rule-based implementation, if desired.

The paths are still printed in the ASC order based on the length, however, they are grouped. I think if a user dumps multiple paths, the order doesn't have any meaning any longer so we don't have to sort them in any particular way and should keep the current behavior in place.

An example of the output:

$ ./pants_from_sources paths --from=tests:: --to=cheeseshop/repository::
  [
    "tests/repository/test_repository.py:tests",
    "cheeseshop/repository/parsing/exceptions.py"
  ],
  [
    "tests/repository/test_repository.py:tests",
    "cheeseshop/repository/repository.py",
    "cheeseshop/repository/parsing/exceptions.py"
  ],
  [
    "tests/repository/test_repository.py:tests",
    "cheeseshop/repository/repository.py",
    "cheeseshop/repository/package.py",
    "cheeseshop/repository/parsing/casts.py",
    "cheeseshop/repository/parsing/exceptions.py"
  ],
  [
    "tests/repository/parsing/test_casts.py:tests",
    "cheeseshop/repository/parsing/casts.py"
  ],
  [
    "tests/repository/parsing/test_casts.py:tests",
    "cheeseshop/repository/parsing/exceptions.py"
  ],
  [
    "tests/repository/parsing/test_casts.py:tests",
    "cheeseshop/repository/parsing/casts.py",
    "cheeseshop/repository/parsing/exceptions.py"
  ]
]

As a little stress test experiment, I've run:


$ pants paths --from=src/python/pants/backend:: --to=src/python/pants/engine:: > huge.json 
(took 11 mins)

$ jq length huge.json                                                                                                                                           
1704925

kaos

lgtm.

I liked the version with the logic in a dedicated rule, it made the goal rule easier to see what was going on end-to-end imo, but no big deal either way I think.

AlexTereshenkov · 2023-07-18T09:32:06Z

lgtm.

Thanks for taking a look, Andreas, very kind of you!

I liked the version with the logic in a dedicated rule, it made the goal rule easier to see what was going on end-to-end imo, but no big deal either way I think.

Would you prefer to have

(A)
no for loops at all (with two rules, one to build root->destinations pairs and another one to find all paths for each root-destination)

OR

(B) one for loop (with one rule that will find all paths for each root-destination)

?

I personally believe it's best readable in (B) with a single for loop and one rule like this:

    for root in from_tgts:
        spec_paths = await MultiGet(  # noqa: PNT30: keep iterating in for loop
            Get(
                SpecsPaths,
                RootDestinationPair,
                RootDestinationPair(destination=destination, root=root),
            )
            for destination in to_tgts
        )
        for spec_path in spec_paths:
            all_spec_paths.extend(spec_path.paths)

kaos · 2023-07-18T14:53:42Z

(A) feels perhaps more efficient, as we do want to inline await Gets as much as possible rather than calling them in a loop (hence the PNT30 lint check).

…iple targets

…iple targets - add a rule to get paths between individual targets

…iple targets - add multiple rules to find paths

…iple targets

AlexTereshenkov · 2023-07-19T08:05:53Z

(A) feels perhaps more efficient, as we do want to inline await Gets as much as possible rather than calling them in a loop (hence the PNT30 lint check).

Alrighty, this is done then! Thank you so much for taking a look!

AlexTereshenkov added the category:new feature label Jul 17, 2023

AlexTereshenkov force-pushed the backend/paths-goal-supports-multiple-targets-in-address branch from f7554a9 to 333f90e Compare July 17, 2023 15:12

AlexTereshenkov requested review from stuhood and kaos July 17, 2023 15:13

AlexTereshenkov marked this pull request as ready for review July 17, 2023 15:51

kaos approved these changes Jul 17, 2023

View reviewed changes

AlexTereshenkov added 5 commits July 19, 2023 09:03

paths goal: allow finding paths between addresses that expand to mult…

ed49acb

…iple targets

paths goal: allow finding paths between addresses that expand to mult…

3db365f

…iple targets - add a rule to get paths between individual targets

paths goal: allow finding paths between addresses that expand to mult…

89b42ec

…iple targets - add multiple rules to find paths

paths goal: allow finding paths between addresses that expand to mult…

ed38de4

…iple targets

paths goal: allow finding paths between addresses that expand to mult…

afeb676

…iple targets

AlexTereshenkov force-pushed the backend/paths-goal-supports-multiple-targets-in-address branch from 333f90e to afeb676 Compare July 19, 2023 08:04

AlexTereshenkov merged commit cd67679 into pantsbuild:main Jul 19, 2023
24 checks passed

AlexTereshenkov deleted the backend/paths-goal-supports-multiple-targets-in-address branch July 19, 2023 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paths: allow finding paths between addresses that expand to multiple targets #19482

paths: allow finding paths between addresses that expand to multiple targets #19482

AlexTereshenkov commented Jul 17, 2023 •

edited

kaos left a comment

AlexTereshenkov commented Jul 18, 2023

kaos commented Jul 18, 2023

AlexTereshenkov commented Jul 19, 2023

paths: allow finding paths between addresses that expand to multiple targets #19482

paths: allow finding paths between addresses that expand to multiple targets #19482

Conversation

AlexTereshenkov commented Jul 17, 2023 • edited

kaos left a comment

Choose a reason for hiding this comment

AlexTereshenkov commented Jul 18, 2023

kaos commented Jul 18, 2023

AlexTereshenkov commented Jul 19, 2023

AlexTereshenkov commented Jul 17, 2023 •

edited