Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add targets to re-wrap source files in different SourcesField types. #17877

Merged
merged 11 commits into from Jan 5, 2023

Conversation

chrisjrn
Copy link
Contributor

@chrisjrn chrisjrn commented Dec 23, 2022

This is primarily intended to be used with experimental_shell_command/experimental_run_in_sandbox, but will also work with any target that produces a SourcesField/FileSourceField that is not applicable for use by dependent rules -- e.g. if your experimental_run_in_sandbox produces Python code that you want included as source in a Python distribution.

The implementation is hacky at the moment, but it mostly solves a problem.

Fixes #15493.

@chrisjrn chrisjrn marked this pull request as ready for review December 23, 2022 20:27
@chrisjrn
Copy link
Contributor Author

chrisjrn commented Dec 23, 2022

(Still needs tests, but want to air the idea before committing further :))

@thejcannon
Copy link
Member

I don't feel comfortable with this without @stuhood 's approval.

I see the pieces and it could work in certain situations for sure, but this has me uneasy... Specifically:

  • What target source fields would we do this for? All of them? A subset (which subset)?
  • What does this mean for formatters/fixers/linters? Theoretically their FieldSets only require a PythonSourceField.

That's not to say I don't love the idea, and the possibilities. But I worry about how much we're signing ourselves up for and wan to make sure it's well thought out.

@chrisjrn
Copy link
Contributor Author

chrisjrn commented Dec 27, 2022

@thejcannon

What target source fields would we do this for? All of them? A subset (which subset)?

If I got my way, we'd support every source type that we can package or compile

What does this mean for formatters/fixers/linters? Theoretically their FieldSets only require a PythonSourceField.

The implications for formatting/linting etc are the same as for any other form of non-exporting codegen (since that's what this is). There's nothing fundamentally new here.

@thejcannon
Copy link
Member

The implications for formatting/linting etc are the same as for any other form of non-exporting codegen (since that's what this is). There's nothing fundamentally new here.

Ahh yes very good point.

import logging
from dataclasses import dataclass
from typing import Iterable, Union

Copy link
Sponsor Member

@stuhood stuhood Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a UX perspective, it feels like this should draw inspiration from http_source instead: https://www.pantsbuild.org/v2.15/docs/reference-resource#codesourcecode ... such that the syntax was more like:

experimental_shell_command(
  name='cmd',
  ..,
)

python_sources(
  sources=output_sources(":cmd"),
)

I know that @thejcannon also took advantage of codegen for the purposes of http_source, but if we need to deepen the support for alternative sources fields types (because due to its implementation, http_source is currently limited to resource and file) that seems like it would still be worth doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, see I knew roping Stu in was the right call. I like this syntax as well (bikeshed on the name).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've been mulling this over for a while, and having looked at http_source, I can see how it works, and could apply here. I'm happy to implement it if the discussion lands on implementing it in this way.

I've re-written this comment a few times, to make sure that my understanding of how things work lines up with what's actually going onside Pants. Some of the objections I initially had turned out to not be entirely correct.

I think there's some complication however, that justifies the conceptual separation of our sources target generators and a corresponding multiple source wrapper targets. Count me as a medium no against adding the new value type.

Most x_sources file types (which produce x_source targets) are TargetFilesGenerators, which currently produce GeneratedTargets representing one target per file matched by the generator. Eventually, the MultipleSourcesField on the target is used in the GeneratedTargets-producing rule to resolve a SourcesPaths, which does not consider codegen.

It would be possible to amend that -> GeneratedTargets rule to act as a no-op if the MultipleSourcesField's value were an output_sources. I presume it would be possible to implement a GeneratedSources rule for each of the MultipleSourcesFields that we end up wanting to support. Rules would need to be registered per x_sources target type, since codegen requires a single rule per (InputSourceField, OutputSourceField) pair. This can't be as neat a solution as the TargetFilesGenerator approach is for generating targets for actual files.

The conceptual gap here is the x_sources Target would be a Target Generator when operating on source files, but a plain old Target that resolves to several (codegenned) source files when using an output_sources value. This makes it difficult to reason about the behaviour of this target type, and similarly difficult to document: This would significantly complicate the documentation for each of these x_sources targets. It would need to explain that the target behaves significantly differently depending on whether you're using it for real files or to wrap generated files from elsewhere.

It also looks as though the behaviour of moved_fields and copied_fields would also be somewhat difficult to reason about -- they'd be neither moved, nor copied if the target is used directly? It seems like parametrize also behaves weirdly here.

So I think an implementation using the existing target generators with a new output_sources value type would (at the moment) comfortably land in the "difficult to explain" bucket. We are doing something significantly different to what the target types currently do.

On the other hand, the current implementation produces a bunch of target types (not ideal), but they're easier to reason about, and they're all grouped in help under experimental_wrap_as, which should make them easy-ish to spot in the help.

This is experimental, we can change the API later

Currently all of the wrapper types will be marked experimental, which would allow us time to deal with some of the above items where possible. With that in mind, we could land this as currently proposed, and take the time to solve the above technical blockers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically I would assume this acts more like a single python_source.
Perhaps we can encode that requirement? 1 source and one target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thejcannon I don't think that's a good restriction to make here -- you could have an esc that generates a pile of files for use as resources, and it may not be possible to enumerate them all. Alternatively, you could have a process that generates a number of source files (not unreasonable in JVM land), and you need to include all of them. If you had 1 source per target, you'd end up with a lot of individual target definitions, and you'd need to manually map the dependencies between them.

Copy link
Contributor Author

@chrisjrn chrisjrn Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming my technical understanding of GenerateSourcesRequest and TargetFilesGenerator.

Per my previous comment here, I think that needing to know what files will be generated ahead of time is of limited utility and creates a maintenance burden.

Currently, an x_sources target generator needs to be written once, and will keep up to date as more x files are added to a directory. Using a TargetFilesGenerator that subsets the outputs from another target would mean the BUILD file would need to be updated whenever the contents of the target would change. Admittedly, this would guarantee the best dependency/cache performance, however it would come at the expense of a maintenance burden of the build file, and the build errors that arise due to files not being captured from the outputs would be hard to debug.

On the other hand, the implementation in the PR can glob the outputs of another target, so for cases where all of the outputs are needed, the BUILD file only needs to be written once, and remains straightforward. If you would benefit from more fine-grained targets, then you can can add new targets for each subset of files. The user has the option of choosing easy maintenance or fine-grained caching.

In my view, dependency inference is a nice-to-have for this use case, but not essential: in the experimental_shell_command/experimental_run_in_sandbox use cases, users have to specify dependencies manually.

There's also a firm red flag arising from enabling dependency inference for esc use cases: Java's dependency inference code in particular would need to run the underlying esc process to fulfil the SourceFilesRequests in its first-party symbol mapping. That would mean every esc that fulfils a _sources target would need to run before dependency inference can be performed, regardless of whether that dependency is necessary to fulfil the goal.

My preference here is for this idea to be wrapped up in an experimental API with minimal surface area, so that we can get a better idea about how this is being used, and then proceed with a deeper change in machinery once we've got a more solid understanding of how it'll be used.

Copy link
Sponsor Member

@stuhood stuhood Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, an x_sources target generator needs to be written once, and will keep up to date as more x files are added to a directory. Using a TargetFilesGenerator that subsets the outputs from another target would mean the BUILD file would need to be updated whenever the contents of the target would change. Admittedly, this would guarantee the best dependency/cache performance, however it would come at the expense of a maintenance burden of the build file, and the build errors that arise due to files not being captured from the outputs would be hard to debug.

I am not proposing that this integration would require explicitly listing the files to generate targets for, as I don't think that that buys you a significant difference in performance. Basically: if you required that all files be explicitly listed, ./pants list would be able to skip generating code to generate targets, but ./pants dependencies would need to generate code in order to compute dependencies for inference... and making list faster isn't worth the API difference probably.

In my view, dependency inference is a nice-to-have for this use case, but not essential: in the experimental_shell_command/experimental_run_in_sandbox use cases, users have to specify dependencies manually.

To be clear: this is about both the dependencies of the wrapped_as_ target, and dependencies on the wrapped_as_ target.


I might be fine with landing the wrappers as experimental, as long as we agree that they are even less stable than experimental_shell_command itself, and should probably stabilize at a different time.

But I suspect that not having a stable API for this use case is a useful reminder that codegen itself still has some rough edges which might end up impacting the implementation (and maybe API) of experimental_shell_command in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ./pants dependencies would need to generate code in order to compute dependencies for inference

Yup, that's the red flag I had above, you would need to run the processes in order to compute dependencies, and that would have an impact on performance: all source-providing targets would need to be evaluated before the dependency inference would take place. This could have a significant impact on performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be fine with landing the wrappers as experimental, as long as we agree that they are even less stable than experimental_shell_command itself, and should probably stabilize at a different time.

Yes, I agree entirely.

To be clear, I think there's value in supporting alternative values for SourceFields, but at the moment, but it's probably a bigger design task with riskier scope than we can solve in a PR discussion.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ./pants dependencies would need to generate code in order to compute dependencies for inference

Yup, that's the red flag I had above, you would need to run the processes in order to compute dependencies, and that would have an impact on performance: all source-providing targets would need to be evaluated before the dependency inference would take place. This could have a significant impact on performance.

Only those actually used in this position: not all codegen targets. But yea.

I could imagine this usecase as differentiated from how we handle "native" codegen: native codegen has a higher implementation difficulty bar, because you need to implement both dependency inference rules and the codegen itself.

To be clear, I think there's value in supporting alternative values for SourceFields, but at the moment, but it's probably a bigger design task with riskier scope than we can solve in a PR discussion.

I disagree that a PR is necessarily the wrong place to have this discussion. To reduce the need for this type of discussion on PRs it can be useful to ensure that design/UX are discussed before a PR is started... but in many cases, some of that ends up happening during code review.

@thejcannon
Copy link
Member

Part of my difficulty wrapping my head around this might be squelched with a simple real-world use-case. Can you help me understand the problem we're trying to solve and how were falling short today?

Copy link
Sponsor Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless @thejcannon has further comments, I'm fine with this, as long as it is marked:

  1. as even more experimental than experimental_shell_command (which we expect to stabilize sooner)
  2. as not supporting dependency inference

import logging
from dataclasses import dataclass
from typing import Iterable, Union

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ./pants dependencies would need to generate code in order to compute dependencies for inference

Yup, that's the red flag I had above, you would need to run the processes in order to compute dependencies, and that would have an impact on performance: all source-providing targets would need to be evaluated before the dependency inference would take place. This could have a significant impact on performance.

Only those actually used in this position: not all codegen targets. But yea.

I could imagine this usecase as differentiated from how we handle "native" codegen: native codegen has a higher implementation difficulty bar, because you need to implement both dependency inference rules and the codegen itself.

To be clear, I think there's value in supporting alternative values for SourceFields, but at the moment, but it's probably a bigger design task with riskier scope than we can solve in a PR discussion.

I disagree that a PR is necessarily the wrong place to have this discussion. To reduce the need for this type of discussion on PRs it can be useful to ensure that design/UX are discussed before a PR is started... but in many cases, some of that ends up happening during code review.

@thejcannon
Copy link
Member

Yeah, nothing blocking, just trying to navigate this feature.

I should say too, the idea is really exciting and excellent. That's part of why I'm so active on this PR. Your changes in this space have really exploded what's possible in an easy way. I just want to make sure the UX is as gooey and flexible as possible.

@chrisjrn
Copy link
Contributor Author

chrisjrn commented Jan 5, 2023

Thanks all! I appreciate that this is a reasonably contentious decision, so thanks for helping flesh out the alternatives.

Before I merge:

  1. This will stabilise separately to experimental_shell_command and friends
  2. Documentation will be added, indicating that these must be specified manually as dependencies
  3. Wrappers will be added for remaining first-party sources types*

After I merge:

  1. Create ticket, linking back to this PR, discussing the proposal for new SourceField values

  • Noting that this will specifically not work for codegen source types, which is a great argument in favour of adding these additional SourceField values at a later date.

@chrisjrn chrisjrn force-pushed the chrisjrn/generic-file-exports-2 branch from 798064d to 0ef05d3 Compare January 5, 2023 18:02
@chrisjrn
Copy link
Contributor Author

chrisjrn commented Jan 5, 2023

@thejcannon @stuhood see 0ef05d3 for the commit that addresses Stu's remaining concerns

@chrisjrn
Copy link
Contributor Author

chrisjrn commented Jan 5, 2023

See #17926 for the discussion of creating new value types for source(s) fields.

@chrisjrn chrisjrn merged commit 2c65434 into pantsbuild:main Jan 5, 2023
Copy link
Member

@thejcannon thejcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aww shucks this was still pending publishing 😓

@@ -32,10 +33,19 @@
tests_analysis,
third_party_pkg,
)
from pants.core.util_rules.wrap_source import wrap_source_rule_and_target

wrap_golang = wrap_source_rule_and_target(GoPackageSourcesField, "kotlin_sources")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says "go" but the string says "kotlin" 👀

@chrisjrn
Copy link
Contributor Author

chrisjrn commented Jan 6, 2023 via email

cczona pushed a commit to cczona/pants that referenced this pull request Jan 9, 2023
pantsbuild#17877)

This is primarily intended to be used with `experimental_shell_command`/`experimental_run_in_sandbox`, but will also work with _any_ target that produces a `SourcesField`/`FileSourceField` that is not applicable for use by dependent rules -- e.g. if your `experimental_run_in_sandbox` produces Python code that you want included as source in a Python distribution. 

The implementation is hacky at the moment, but it mostly solves a problem.

Fixes pantsbuild#15493.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

experimental_shell_command should be able to gen resources
5 participants