Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract a generalized V2 rule to inject `__init__.py` files #7722

Merged
merged 18 commits into from May 16, 2019

Conversation

Projects
None yet
5 participants
@Eric-Arellano
Copy link
Contributor

commented May 14, 2019

Problem

Python requires packages to have an __init__.py file to be recognized as a proper module for the sake of imports.

#7696 does this for Pytest, but inlines the logic, even though it will likely be helpful for other Python rules as well.

Further, because this logic was originally written before being able to from Digest->Snapshot thanks to #7725, we had to use FilesContent to grab the paths of the digest. This would mean that every single source file would be materialized and persisted to memory, resulting in extremely high memory usage (found as part of investigation in #7741). There is no need for the actual content, just the paths, so this is a huge inefficiency.

Will close #7715.

Solution

Generalize into @rule(InitInjectedDigest, [Snapshot]), where InitInjectedDigest is a thin wrapper around a Digest.

We take a Snapshot because we need the file paths to work properly. This contrasts with earlier using FileContents to get the same paths. A Snapshot is much more light weight.

We return a Digest because that is the minimum information necessary to work properly, and the caller of the rule can then convert that Digest back into a Snapshot.

Result

It will now be easier for other Python rules to work with Python packages.

The unnecessary memory usage is now fixed. The V2 Pytest runner now has a space complexity of O(t + t*e + b), rather than O(t + t*e + s), where t is # targets, e is # env vars, b is # BUILD files, and s is # source files.

@Eric-Arellano Eric-Arellano force-pushed the Eric-Arellano:v2-inject-init branch from 7100f1a to 8f2aafe May 14, 2019

@Eric-Arellano Eric-Arellano force-pushed the Eric-Arellano:v2-inject-init branch from 8f2aafe to ab35242 May 14, 2019

Eric-Arellano added some commits May 14, 2019

Get rule working! Just means returned digest isn't merged
There's an ambiguity when using MergedDirectories. We can get around this by just returning the __init__ digest and leaving the caller to merge.

@Eric-Arellano Eric-Arellano changed the title WIP: Generalize V2 rule to inject `__init__.py` files Generalize V2 rule to inject `__init__.py` files May 16, 2019

@Eric-Arellano Eric-Arellano changed the title Generalize V2 rule to inject `__init__.py` files Refactor to have a new generalized V2 rule to inject `__init__.py` files May 16, 2019

@Eric-Arellano Eric-Arellano marked this pull request as ready for review May 16, 2019

@Eric-Arellano Eric-Arellano changed the title Refactor to have a new generalized V2 rule to inject `__init__.py` files Refactor for a new generalized V2 rule to inject `__init__.py` files May 16, 2019

@Eric-Arellano Eric-Arellano changed the title Refactor for a new generalized V2 rule to inject `__init__.py` files Extract a generalized V2 rule to inject `__init__.py` files May 16, 2019

@stuhood
Copy link
Member

left a comment

Thanks!

return super(TestInjectInit, cls).rules() + [RootRule(Snapshot)]

def assert_result(self, input_snapshot, expected_digest):
injected_digest = assert_single_element(

This comment has been minimized.

Copy link
@stuhood

stuhood May 16, 2019

Member

This assert_single_element call can be replaced by a single comma. Just saying! =D

This comment has been minimized.

Copy link
@Eric-Arellano

Eric-Arellano May 16, 2019

Author Contributor

Hah true. I'm personally rooting for us to add scheduler.single_product_request() and converge on that.

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer May 16, 2019

Contributor

Yes, it's a bit of an anti-pattern when doing tuple destructuring, I think, but I didn't realize this at first so there's lots of bad examples.

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer May 16, 2019

Contributor

And @Eric-Arellano I would really like to encourage task parallelism if possible (requesting multiple things at once) and using tuples by default encourages that with a really neat syntax!

This comment has been minimized.

Copy link
@Eric-Arellano

Eric-Arellano May 16, 2019

Author Contributor

using tuples by default encourages that with a really neat syntax!

To clarify, this means that I should use the tuple syntax of injected_digest, = blah(), right?

Why is this the case that using tuple syntax would increase parallelism? Is the idea that it's a good practice, or will it actually result in that here?

This comment has been minimized.

Copy link
@illicitonion

illicitonion May 16, 2019

Contributor

I'm happy with this as-is.

I agree with some other folks that the foo, = bar syntax is pretty subtle and can be confusing, and we should probably avoid it.

In test code, I'd advocate for introducing:

class TestBase
  def single_scheduler_product_request(self, product, subject):
    return assert_single_element(self.scheduler.product_request(product, [subject]))

which IIRC we used to have before, but I can't find?

I'm mildly against scheduler.single_product_request as it may discourage parallelism in non-test code.

This comment has been minimized.

Copy link
@blorente

blorente May 16, 2019

Contributor

To add to the noise, I also find the foo, = bar syntax very confusing and easy to miss, the seconds we save when typing longer accessors are all lost when we spend an extra hour or two going on fruitless debug tangents when we miss the comma and assume things that are not true. Rant over.

I'm in favour of the helper method being in TestBase vs the scheduler, since I agree that we should make it slightly painful to request only one product. This may avoid problems like this in the future (not v2-specific, but it's the same UX reasoning).

This comment has been minimized.

Copy link
@Eric-Arellano

Eric-Arellano May 16, 2019

Author Contributor

I'm mildly against scheduler.single_product_request as it may discourage parallelism in non-test code.

Had no idea that'd be a thing. Good to know, and agreed this can live in TestBase. I can put up a PR later today to add this and refactor every current usage of the alternatives.

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer May 16, 2019

Contributor

I also find the foo, = bar syntax very confusing and easy to miss

Ok, I hadn't realized this. It might make sense to have an explicit single product request method if we make it clear in the docstring how to request in parallel (since tuple destructuring with more than one element should be less ambiguous, I think?). Not sure.

But also, syntax highlighting saves lives!

the seconds we save when typing longer accessors

If anyone isn't using an editor that allows them to autocomplete python symbols, we could consider adding documentation on how to do that in their preferred editor perhaps!

@cosmicexplorer
Copy link
Contributor

left a comment

This is great, and I'm quite happy with this code! I would however really like to clamp down on uses of absolute paths to binaries, especially in remoteable process executions. It's used in test code but shouldn't (imho) really spread to production code. I propose one alternative (create an intermediate rule to encapsulate the workaround) which shouldn't be super hard to implement, and may also end up being useful for other rules as well (which is another reason why the v2 engine is so great!).

return super(TestInjectInit, cls).rules() + [RootRule(Snapshot)]

def assert_result(self, input_snapshot, expected_digest):
injected_digest = assert_single_element(

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer May 16, 2019

Contributor

Yes, it's a bit of an anti-pattern when doing tuple destructuring, I think, but I didn't realize this at first so there's lots of bad examples.

else:
# TODO(7718): add a builtin rule for FilesContent->Snapshot.
touch_init_request = ExecuteProcessRequest(
argv=("/usr/bin/touch",) + missing_init_files,

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer May 16, 2019

Contributor

I would really like us to not use any absolute paths to files in ExecuteProcessRequests in production code without a very clear TODO to fix it. Alternatively we can introduce an intermediate e.g. @rule(FallibleExecuteProcessResult, [TouchFileRequest]) which wraps the creation of the command line, where TouchFileRequest is some new datatype. We're pretty good about not doing this in v1 python code -- it's more important here, since those files won't necessarily be on the remote host if executed remotely. I would like to amortize the process of figuring out which rules need special filesystem/etc resources instead of having to fix them later.

This comment has been minimized.

Copy link
@stuhood

stuhood May 16, 2019

Member

There is a very near term plan to fix that (in the comment), and @Eric-Arellano has discussed it with two reviewers, so I think that it is fine in this case.

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer May 16, 2019

Contributor

It's not clear what the significance of the proposed builtin rule is to the line below the comment. If it could include something about not having absolute paths to files in this comment, I would be satisfied. I don't see why that's something we don't want to do here.

@Eric-Arellano

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

I would really like us to not use any absolute paths to files in ExecuteProcessRequests in production code without a very clear TODO to fix it.

We do have a clear path forward for no longer relying on /usr/bin/touch. See #7718. @illicitonion and I were planning on pairing on this next week.

Alternatively we can introduce an intermediate e.g. @rule(FallibleExecuteProcessResult, [TouchFileRequest])

I strongly prefer not adding this intermediate type, given that Daniel and I next week plan to add a much more robust builtin Rust rule to do the same thing. This was already difficult to get everything working, and I do not want to spend much time on something that is only temporary.

I'd like to please keep following the "progress is better than perfection" approach we talked about it in #7696 (review). Because we let ourselves land that in an imperfect state, I was unblocked from several other PRs, including this memory over-usage investigation. Already, we've implemented 2 of the 4 followup PRs we agreed to. The 3rd followup is what we're discussing, and the 4th is blocked by #7710.

Tangibly, this PR is necessary for the memory over-usage work, because it removes a use of FilesContent, which appears to be the main contributor to the memory outage.

@illicitonion
Copy link
Contributor

left a comment

Looks great :) Happy to merge as-is.

return super(TestInjectInit, cls).rules() + [RootRule(Snapshot)]

def assert_result(self, input_snapshot, expected_digest):
injected_digest = assert_single_element(

This comment has been minimized.

Copy link
@illicitonion

illicitonion May 16, 2019

Contributor

I'm happy with this as-is.

I agree with some other folks that the foo, = bar syntax is pretty subtle and can be confusing, and we should probably avoid it.

In test code, I'd advocate for introducing:

class TestBase
  def single_scheduler_product_request(self, product, subject):
    return assert_single_element(self.scheduler.product_request(product, [subject]))

which IIRC we used to have before, but I can't find?

I'm mildly against scheduler.single_product_request as it may discourage parallelism in non-test code.

@blorente
Copy link
Contributor

left a comment

This is awesome :)

return super(TestInjectInit, cls).rules() + [RootRule(Snapshot)]

def assert_result(self, input_snapshot, expected_digest):
injected_digest = assert_single_element(

This comment has been minimized.

Copy link
@blorente

blorente May 16, 2019

Contributor

To add to the noise, I also find the foo, = bar syntax very confusing and easy to miss, the seconds we save when typing longer accessors are all lost when we spend an extra hour or two going on fruitless debug tangents when we miss the comma and assume things that are not true. Rant over.

I'm in favour of the helper method being in TestBase vs the scheduler, since I agree that we should make it slightly painful to request only one product. This may avoid problems like this in the future (not v2-specific, but it's the same UX reasoning).

from pants.util.objects import datatype


class InitInjectedDigest(datatype([('directory_digest', Digest)])): pass

This comment has been minimized.

Copy link
@blorente

blorente May 16, 2019

Contributor

A very small nit (I feel like I post a lot of those, people should stop listening to them at some point.
This represents the digest of an init file that has been injected, so I think InjectedInitDigest is a slightly better name.

Feel free to ignore.

Also, a comment at the top would be nice:

Suggested change
class InitInjectedDigest(datatype([('directory_digest', Digest)])): pass
# This represents the digest of an init file that has been injected into a module.
class InitInjectedDigest(datatype([('directory_digest', Digest)])): pass

This comment has been minimized.

Copy link
@Eric-Arellano

Eric-Arellano May 16, 2019

Author Contributor

Great suggestion! Once we close #7710 and can return the merged original digest + injected digest, we should rename this back to InitInjectedDigest. For now, InjectedInitDigest is a much better name.

I feel like I post a lot of those, people should stop listening to them at some point.

Please don't! They are always really helpful changes. The details, like what to name things, matter with programming.

@cosmicexplorer
Copy link
Contributor

left a comment

I think that there's real value to attempting to discuss and understand the novel v2 rulesets we're making, and especially to look, from the start, to find opportunities to make composable rulesets that others can hook into and to avoid repeated logic. For example, yesterday we found that we had missed such an existing composable ruleset in the native backend in #7735, and may have been about to write that logic all over again. I would really like to not have absolute paths to files sprouting without TODOs attached to them, and I think we should do that here and link to issues or a project if necessary so it can be clear to others what the timeline of these changes are and what upcoming work is going to invalidate the temporary workaround of using an absolute path (/usr/bin/touch in this case).

Separately, if there is a PR which is blocking others and therefore is under pressure to review quickly, it would be extremely useful to me if that could be noted somehow in the PR title, label, or description (a label named hot for example?) so I could know what type of feedback is appropriate to provide.

else:
# TODO(7718): add a builtin rule for FilesContent->Snapshot.
touch_init_request = ExecuteProcessRequest(
argv=("/usr/bin/touch",) + missing_init_files,

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer May 16, 2019

Contributor

It's not clear what the significance of the proposed builtin rule is to the line below the comment. If it could include something about not having absolute paths to files in this comment, I would be satisfied. I don't see why that's something we don't want to do here.

@Eric-Arellano

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

I think that there's real value to attempting to discuss and understand the novel v2 rulesets we're making, and especially to look, from the start, to find opportunities to make composable rulesets that others can hook into and to avoid repeated logic.

I agree! And I appreciate that you raised the concern originally in #7696. When writing this the first time, I did not know that it's bad to have to rely on touch.

My main point is that we can incrementally improve these rules via followup PRs (so long as we actually address the TODOs, which so far we have been doing). Which leads to the next point about not communicating this well enough:

Separately, if there is a PR which is blocking others and therefore is under pressure to review quickly, it would be extremely useful to me if that could be noted somehow in the PR title, label, or description (a label named hot for example?) so I could know what type of feedback is appropriate to provide.

Acknowledged and agreed. It would have been a lot less confusing all around if I communicated better the incremental plan to get this landed as the first improvement, and then to follow it up with the next step once Daniel lands #7739. I'm sorry for the confusion this caused!

Thanks for the reviews everyone and to @illicitonion for landing #7725!

@Eric-Arellano

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

Merging because I just made a comment change and the prior two runs were green.

@Eric-Arellano Eric-Arellano merged commit 6f99a14 into pantsbuild:master May 16, 2019

1 check failed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details

@Eric-Arellano Eric-Arellano deleted the Eric-Arellano:v2-inject-init branch May 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.