Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paths: allow finding paths between addresses that expand to multiple targets #19482

Conversation

AlexTereshenkov
Copy link
Member

@AlexTereshenkov AlexTereshenkov commented Jul 17, 2023

The paths goal currently expects a single from and a single to address. This works great for finding all paths in the dependency graph between two individual targets, e.g. a python_source or a python_requirement. Attempting to find all paths between multiple targets in a directory, however, would fail with an error:

$ pants paths --from=src/python/pants/backend/adhoc --to=src/python/pants/backend/project_info
...
pants.build_graph.address.ResolveError: Expected a single target, but was given multiple targets.

Did you mean one of these?

  * src/python/pants/backend/adhoc/__init__.py
  * src/python/pants/backend/adhoc/adhoc_tool.py
  * src/python/pants/backend/adhoc/run_system_binary.py
  * src/python/pants/backend/adhoc/target_types.py

This can be solved by running the paths goal multiple times, of course, by listing the targets in a directory and then using paths goal in a for loop (Bash/Python). This is very expensive, however, as making a Pants call is known to have an overhead (pantsd/scheduler/etc). It's a whole lot cheaper to get all the paths in a single Pants call.


Being able to find all paths between multiple targets would help answering the following questions:

  1. How is this Python module connected to a Python package, if at all?

One can find out if the module is accessing the package by doing dependencies --transitive for the module of interest and see if there are any modules from the package of concern. If there's a single module listed, then doing just another manual paths invocation is fine. However, if there are multiple modules listed, you'd have to run paths for each of them which is, again, very expensive.

  1. How tightly are two Python packages connected?

For example, you may want to estimate the refactoring efforts (to make two packages not depend on each other) and for this you need to see how many paths exist between two packages. Currently, obtaining the dependency graph for individual files in one shot (as {moduleA: [deps], moduleB: [deps]}) is not possible with Pants (I've solved this with a plugin), so you need to run dependencies, transitively for every file of interest, which again has an overhead (one file - one Pants call). Running pants paths --from=src/libraryA:: --to=src/appB:: would provide all the necessary information in one shot.

  1. Dependency graph analytics. Dumping all the paths in the repo, e.g. between sources and tests, one can find the source modules that has most paths leading to tests (so perhaps needs to be split?), find too long paths (code architecture smell if a module leads to another module through 10+ modules steps?), and many more.

After the changes made, these commands produce identical results (i.e. it doesn't matter if a directory is passed or an address (with single or double colon notation):

$ pants paths --from=src/python/pants/backend/adhoc --to=src/python/pants/backend/project_info > all.json 
$ pants paths --from=src/python/pants/backend/adhoc:: --to=src/python/pants/backend/project_info:: > all-recursive.json
$ diff all.json all-recursive.json; echo $?
0

Running this command consistently takes about 45-50s with nested for loop (the final suggested implementation).

pants --no-pantsd --no-local-cache paths --to=src/python/pants/engine/target.py --from=src/python/pants/backend:: > all.json

I've experimented using MultiGet with multiple rules (please see commit 2 and 3), but the performance was identical. I therefore suggest to keep the for loops in place for simplicity purposes. Of course happy to switch to a full or partial rule-based implementation, if desired.


The paths are still printed in the ASC order based on the length, however, they are grouped. I think if a user dumps multiple paths, the order doesn't have any meaning any longer so we don't have to sort them in any particular way and should keep the current behavior in place.

An example of the output:

$ ./pants_from_sources paths --from=tests:: --to=cheeseshop/repository::
  [
    "tests/repository/test_repository.py:tests",
    "cheeseshop/repository/parsing/exceptions.py"
  ],
  [
    "tests/repository/test_repository.py:tests",
    "cheeseshop/repository/repository.py",
    "cheeseshop/repository/parsing/exceptions.py"
  ],
  [
    "tests/repository/test_repository.py:tests",
    "cheeseshop/repository/repository.py",
    "cheeseshop/repository/package.py",
    "cheeseshop/repository/parsing/casts.py",
    "cheeseshop/repository/parsing/exceptions.py"
  ],
  [
    "tests/repository/parsing/test_casts.py:tests",
    "cheeseshop/repository/parsing/casts.py"
  ],
  [
    "tests/repository/parsing/test_casts.py:tests",
    "cheeseshop/repository/parsing/exceptions.py"
  ],
  [
    "tests/repository/parsing/test_casts.py:tests",
    "cheeseshop/repository/parsing/casts.py",
    "cheeseshop/repository/parsing/exceptions.py"
  ]
]

As a little stress test experiment, I've run:


$ pants paths --from=src/python/pants/backend:: --to=src/python/pants/engine:: > huge.json 
(took 11 mins)

$ jq length huge.json                                                                                                                                           
1704925

@AlexTereshenkov AlexTereshenkov force-pushed the backend/paths-goal-supports-multiple-targets-in-address branch from f7554a9 to 333f90e Compare July 17, 2023 15:12
@AlexTereshenkov AlexTereshenkov marked this pull request as ready for review July 17, 2023 15:51
Copy link
Member

@kaos kaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

I liked the version with the logic in a dedicated rule, it made the goal rule easier to see what was going on end-to-end imo, but no big deal either way I think.

@AlexTereshenkov
Copy link
Member Author

lgtm.

Thanks for taking a look, Andreas, very kind of you!

I liked the version with the logic in a dedicated rule, it made the goal rule easier to see what was going on end-to-end imo, but no big deal either way I think.

Would you prefer to have

(A)
no for loops at all (with two rules, one to build root->destinations pairs and another one to find all paths for each root-destination)

OR

(B) one for loop (with one rule that will find all paths for each root-destination)

?

I personally believe it's best readable in (B) with a single for loop and one rule like this:

    for root in from_tgts:
        spec_paths = await MultiGet(  # noqa: PNT30: keep iterating in for loop
            Get(
                SpecsPaths,
                RootDestinationPair,
                RootDestinationPair(destination=destination, root=root),
            )
            for destination in to_tgts
        )
        for spec_path in spec_paths:
            all_spec_paths.extend(spec_path.paths)

@kaos
Copy link
Member

kaos commented Jul 18, 2023

(A) feels perhaps more efficient, as we do want to inline await Gets as much as possible rather than calling them in a loop (hence the PNT30 lint check).

@AlexTereshenkov AlexTereshenkov force-pushed the backend/paths-goal-supports-multiple-targets-in-address branch from 333f90e to afeb676 Compare July 19, 2023 08:04
@AlexTereshenkov
Copy link
Member Author

(A) feels perhaps more efficient, as we do want to inline await Gets as much as possible rather than calling them in a loop (hence the PNT30 lint check).

Alrighty, this is done then! Thank you so much for taking a look!

@AlexTereshenkov AlexTereshenkov merged commit cd67679 into pantsbuild:main Jul 19, 2023
24 checks passed
@AlexTereshenkov AlexTereshenkov deleted the backend/paths-goal-supports-multiple-targets-in-address branch July 19, 2023 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants