Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new --no-compile flag to not include .pyc in built Pex due to its non-determinism #718

Merged
merged 6 commits into from May 2, 2019

Conversation

Eric-Arellano
Copy link
Contributor

@Eric-Arellano Eric-Arellano commented May 1, 2019

Problem

Until Python 3.7 via https://www.python.org/dev/peps/pep-0552/, .pyc files were never reproducible because they included the timestamp. See the discussion at #716 (comment).

We are aiming to ensure reproducible builds a la #716 so that we can safely remote PEXes in CI. Reproducible builds also play an important role in security, per https://reproducible-builds.org.

Solution

Allow users to toggle on and off .pyc files via the new --compile / --no-compile flag.

We currently default to including .pyc, because this has been the precedent for 9 years. In a future release, e.g. 1.70, we will change the default to not include .pyc so that builds are reproducible by default, but people can still opt into including the files if they'd like via the flag.

Alternative flag considered: --reproducible

Originally, we were going to use --reproducible to toggle on multiple behaviors like not including .pyc and using a hardcoded timestamp instead of system time. However, after deciding that in a future release we are going to default to reproducible, it doesn't make sense for people to ever call --not-reproducible, but it does make sense to opt out of specific behaviors like wanting to use --include-pyc. So, we use behavior-specific flags instead of a universal flag.

Alternative solution: ensure our own .pyc timestamp

It may be possible to still include .pyc and have their timestamp be deterministic #718 (comment). This is not pursued for now because it is more complex than necessary, especially because currently we do not ship .pyc files if multiple platforms are used in the PEX. But it could be a followup PR if deemed worth it.

Result

When --no-compile is set, PEXes will no longer include .pyc. This results in all of the reproducible build acceptance tests passing the exploded pex portion of their tests, excluding the bdist_wheel test.

Impact on performance

Not using .pyc has a slight speedup of ~0.2 seconds to the creation of simple PEXes.

  • time python -m pex -o normal.pex --compile averaged around 0.49 seconds
  • time python -m pex -o repro.pex --no-compile averaged around 0.31 seconds

Not using .pyc has a slight slowdown of ~0.06 seconds to the startup time of simple PEXes.

  • time ./normal.pex --version averaged around 0.26 seconds
  • time ./repro.pex --version averaged around 0.32 seconds

python -m pex -m pydoc -o pydoc.pex likewise ran 0.07 seconds slower when not including .pyc.

Note that the inclusion of .pyc only impacts startup time, not actual runtime performance. See https://stackoverflow.com/questions/2998215/if-python-is-interpreted-what-are-pyc-files for what .pyc files do / why they exist.

Copy link
Contributor Author

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding so many of you to the review because this PR has two very important API decisions to make:

  1. What to call the flag that toggles reproducible build behavior.
  2. Do we default to reproducible builds?

Writing up explanations of both and would love your feedback on either or both. Thanks!

@cosmicexplorer
Copy link
Contributor

Is it possible to pyc-compile files in the pex upon first use? Would that make this significantly slower? If it isn't too much slower to pyc-compile at first use, we may be able to expect a slight speedup for process invocation requests which somehow invoke some single pex multiple times.

@cosmicexplorer
Copy link
Contributor

  1. What to call the flag that toggles reproducible build behavior.

I actually think we could consider this an option which is mutually exclusive with -o, and call it --hermetic-build or --hermetic-file-output or something. I don't understand what a "reproducible" option value would mean if we're not creating a pex file, so making it specifically into an output file argument seems to be a good way to scope that for now.

  1. Do we default to reproducible builds?

See above. We may not need to even make -o and --hermetic-build mutually exclusive -- pex could output both a non-hermetic pex to the -o argument and a hermetic pex to the --hermetic-build argument in one command!

Copy link
Member

@jsirois jsirois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group.add_option(
'--reproducible', '--not-reproducible',
dest='reproducible',
default=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Premature optimization may be the root of all evil, but the optimization is established with a ~9 year precedent. I think it is unwise to change this default of long standing - there are real non-Pants pex customers out there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the optimization is established with a ~9 year precedent.

Agreed. I support sticking with precedent and defaulting to not reproducible, particularly because --reproducible means either they will have to define SOURCE_DATE_EPOCH or we hardcode it with some arbritary value, per https://github.com/pantsbuild/pex/pull/718/files#r279998538.

cc @illicitonion today, who was curious if it would be feasible to default to --reproducible, if you have thoughts on this all.

Copy link
Contributor Author

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to pyc-compile files in the pex upon first use?

I don't think so, without changing our assumptions of Pex being a zipfile and allowing mutations. I talked to John over Slack whether Pex ever mutates its .pex file after a run, and it sounds like no it does not, which is in general a great thing. Even if we wanted it to modify the pex file to introduce .pyc, I'm not 100% sure we could without significant ceremony.

Beyond the feasibility of doing this, I don't think we would want to introduce .pyc after its built, because then the Pex would no longer be reproducible after a single run, which isn't very valuable. The whole purpose of this project is to ensure that they are always reproducible.

pex/bin/pex.py Outdated
@@ -291,6 +291,17 @@ def configure_clp_pex_options(parser):
'No value (alias for prefer, for backwards compatibility). '
'[Default: %default]')

group.add_option(
'--reproducible', '--not-reproducible',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not convinced myself that we want to call the flag this, specifically because of timestamps.

Due to the way zipfiles work, we cannot have a reproducible build without somehow hardcoding the timestamp. Per https://reproducible-builds.org/docs/source-date-epoch/, the canonical way to do this is via an env var SOURCE_DATE_EPOCH. If we don't use this env var, then we would have to ourselves hardcode some arbitrary value for what timestamp the zip file should use, which seems very wrong.

So, my hesitation is that unless you specify SOURCE_DATE_EPOCH, the build will not be truly reproducible. I don't want the flag to make a false promise.

--

Just thought of a good workaround. If they specify --reproducible, then we eagerly check that SOURCE_DATE_EPOCH is set in their env vars. If not, we fail and request they set this, then run again.

I advocate for this approach.

--

If we don't go with this workaround and don't like --reproducible, then we could rename this to something like --pyc-files-included / --no-pyc-files-included or --byte-code-included / --no-byte-code-included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsirois

As to names for things - I'm never right. I'll leave that debate to you all.

Fair, but could you weigh in please on the SOURCE_DATE_EPOCH idea of them having to supply it via env var vs. us hardcoding an arbitrary value? If we indeed are going to ask users to define SOURCE_DATE_EPOCH, which per Reproducible Builds site is the canonical way to do this, then I think both decisions are easy to make.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things that are necessary for reproducible builds should be specified in options scoped specifically to reproducible builds. If a value named SOURCE_DATE_EPOCH is required, we could make an option --reproducible-source-date-epoch=<epoch> which would be required if --reproducible-build=<output pex file name> was provided, and otherwise ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but could you weigh in please on the SOURCE_DATE_EPOCH idea...

Honoring SOURCE_DATE_EPOCH sounds sane, forcing it to be specified does not make sense to me. Why can't we use 0 as the default? It may be arbitrary, but that's the whole point - the value is arbitrary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honoring SOURCE_DATE_EPOCH sounds sane, forcing it to be specified does not make sense to me. Why can't we use 0 as the default? It may be arbitrary, but that's the whole point - the value is arbitrary.

That sounds good to me.

If not reproducible build, use default system time.
If reproducible build, first check SOURCE_DATE_EPOCH then fallback to 0.

--

Danny, in general I would agree but I don't think we want to use --reproducible-source-date-epoch because SOURCE_DATE_EPOCH is an env var used by dozens of tools, like cmake and gcc. We don't want to implement our own custom version.

--

So, with that established that we would not require SOURCE_DATE_EPOCH to be defined, then I would advocate using --reproducible, followed by Danny's suggestion to have an exclusively mutual options. Note that when running pex --help, --reproducible, --not-reproducible appears under this section:

PEX output options:
    Tailor the behavior of the emitted .pex file if -o is specified.

Indeed it doesn't make since to use --reproducible if -o foo.pex is left off, but our docs already make that clear. Keeping this as a flag that modifies -o, rather than introducing a new command, fits in with how we currently approach --zip-safe, --always-write-cache, --ignore-errors, and --inherit-path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this as a flag that modifies -o, rather than introducing a new command, fits in with how we currently approach --zip-safe, --always-write-cache, --ignore-errors, and --inherit-path.

Ok, that sounds a lot more correct -- let's do that.

group.add_option(
'--reproducible', '--not-reproducible',
dest='reproducible',
default=True,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next important question. Do we default to reproducible or not?

See the PR description for my argument for why I think we should default to reproducible, along with analysis on performance hit for doing this: #718.

However, this may be closely related to the first question. If we go with my proposal to call the flag --reproducible and to eagerly check that they have SOURCE_DATE_EPOCH defined, then we certainly should not make --reproducible the default - that would be a horrible UX to fail the first time they try to run ./pex -o test.pex and then to instruct them to have to set the mysterious sounding SOURCE_DATE_EPOCH.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think making this a filename option similar to -o (but writing a hermetic pex instead) is a great way to avoid this question at all because it doesn't require picking a default.

@jsirois
Copy link
Member

jsirois commented May 1, 2019

As to all the .pyc hand-wringing: this was / is a broken feature anyhow. We do not bytecode compile for all platforms in a multiplatform pex for example. It only really works for very simple pexes today. If someone cares about perf across platforms on repeated use they can mark --not-zipsafe and get correct pyc for the current platform on 1st run and cached use of that later with no unzipping overhead.

@cosmicexplorer
Copy link
Contributor

then the Pex would no longer be reproducible after a single run, which isn't very valuable

To me, it's not yet clear how reproducible pexes are intended to be produced and consumed, so I'm not making too many assumptions about what parts must be reproducible. I wouldn't want to rule out a way to create a non-hermetic pex (which may have faster startup) if only a hermetic one is available, for example.

@jsirois
Copy link
Member

jsirois commented May 1, 2019

As to names for things - I'm never right. I'll leave that debate to you all.

@jsirois
Copy link
Member

jsirois commented May 1, 2019

For fullness of context going forward - the pyc thing can be managed if we want to - I think. pex/compiler.py could make a copy of the source, set times on the file and compile that instead of the original. If the times it sets are coordinated with the fake time fed to the zip entries we create when we fix that timestamp problem, i think you'd get reproduceable pyc even under python2.7.

@jsirois
Copy link
Member

jsirois commented May 1, 2019

Beyond the feasibility of doing this, I don't think we would want to introduce .pyc after its built, because then the Pex would no longer be reproducible after a single run, which isn't very valuable. The whole purpose of this project is to ensure that they are always reproducible.

Be careful. The whole point of the project is to ensure built pexes are reproduceable, not that pex runs are reproduceable. Build time vs run time in pex is a serious tripping point in comprehension. I've learned this lesson multiple times. In this project we only care about build-time.

Copy link
Contributor Author

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fullness of context going forward - the pyc thing can be managed if we want to - I think. pex/compiler.py could make a copy of the source, set times on the file and compile that instead of the original. If the times it sets are coordinated with the fake time fed to the zip entries we create when we fix that timestamp problem, i think you'd get reproduceable pyc even under python2.7.

Hm, I do think that this is right. Notably, the PEP https://www.python.org/dev/peps/pep-0552/#rationale suggests in workaround #3 something like this. I didn't see how we could leverage it in our case, but I think your suggestion would work.

So 3rd decision now, how much do we value .pyc being included for reproducible builds? If we want to pursue this workaround, we should probably not merge this and instead go right to the custom .pyc timestamp approach.

pex/bin/pex.py Outdated
@@ -291,6 +291,17 @@ def configure_clp_pex_options(parser):
'No value (alias for prefer, for backwards compatibility). '
'[Default: %default]')

group.add_option(
'--reproducible', '--not-reproducible',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honoring SOURCE_DATE_EPOCH sounds sane, forcing it to be specified does not make sense to me. Why can't we use 0 as the default? It may be arbitrary, but that's the whole point - the value is arbitrary.

That sounds good to me.

If not reproducible build, use default system time.
If reproducible build, first check SOURCE_DATE_EPOCH then fallback to 0.

--

Danny, in general I would agree but I don't think we want to use --reproducible-source-date-epoch because SOURCE_DATE_EPOCH is an env var used by dozens of tools, like cmake and gcc. We don't want to implement our own custom version.

--

So, with that established that we would not require SOURCE_DATE_EPOCH to be defined, then I would advocate using --reproducible, followed by Danny's suggestion to have an exclusively mutual options. Note that when running pex --help, --reproducible, --not-reproducible appears under this section:

PEX output options:
    Tailor the behavior of the emitted .pex file if -o is specified.

Indeed it doesn't make since to use --reproducible if -o foo.pex is left off, but our docs already make that clear. Keeping this as a flag that modifies -o, rather than introducing a new command, fits in with how we currently approach --zip-safe, --always-write-cache, --ignore-errors, and --inherit-path.

@Eric-Arellano Eric-Arellano changed the title Default to not including .pyc files in built PEXes for reproducible builds & introduce --not-reproducible flag Introduce new --reproducible flag that will won't include .pyc files due to their nondeterminism May 1, 2019
@jsirois
Copy link
Member

jsirois commented May 1, 2019

So 3rd decision now, how much do we value .pyc being included for reproducible builds? If we want to pursue this workaround, we should probably not merge this and instead go right to the custom .pyc timestamp approach.

Since .pyc is basically broken for pex (the multiplatform / multi-python issue), I'd save changing how we create .pyc for the restricted case we support for future work. Avoid scope creep here. I was going to comment earlier but refrained - though now it's relevant: I think the --reproducable option help is too detailed. Do not get into how we make it reproducable, just that we do. Maybe warn that there may be a small perf impact at runtime - but no more. This buys flexibility in firming up how we acheive reproduceable.

That riff though brings to mind a point to consider - what guaranty do we want to provide wrt pex version and reproducable. If today we ship reproducable omitting pyc and tomorrow generating pyc, then today output != tomorrow output. Perhaps just guaranteeing for a given pex version the output will be reproducable and no more would suffice. Variants could include - for a given major version, etc. In this decision, Pants is the only customer / initial customer - so Pants needs can hold sway.

If no option is given, then it defaults to not being reproducible.
The how is an implementation detail we shouldn't restrict ourselves to.
@Eric-Arellano
Copy link
Contributor Author

I think it's safe to assume that we should not default to reproducible, unless someone wants to push back. I updated the PR title and description to reflect this.

--

Avoid scope creep here.

Agreed. Thanks for stopping me from letting this slip in.

I think the --reproducable option help is too detailed.

Fixed.

what guaranty do we want to provide wrt pex version and reproducable.

Hm, @stuhood or anyone else from Twitter? I don't understand Twitter's use case well enough yet to comment on this. I would bias towards the simplest for us devs as possible, meaning no guarantee.

Wouldn't a guarantee mean that we couldn't change any Pex code during a major version, because the .bootstrap folder would change? If so, that seems like a non-starter.

@Eric-Arellano Eric-Arellano changed the title Introduce new --reproducible flag that will won't include .pyc files due to their nondeterminism Introduce new --reproducible flag that won't include .pyc files due to their nondeterminism May 1, 2019
@cosmicexplorer
Copy link
Contributor

what guaranty do we want to provide wrt pex version and reproducable.
Hm, @stuhood or anyone else from Twitter? I don't understand Twitter's use case well enough yet to comment on this. I would bias towards the simplest for us devs as possible, meaning no guarantee.

I'm not an authority, but I don't see any specific use case for Twitter that would conflict with wanting to allow development as quickly and simply as possible upstream. One way to orchestrate a transition from unstable -> stable reproducibility could be:

  1. pex output with --reproducible is reproducible for a particular installation of pex on a host
  2. pex output with --reproducible for a given major.minor.patch version is reproducible across hosts on at least one platform
  3. pex output with --reproducible for a given major.minor.patch version is reproducible across hosts on all supported platforms

After that, we could reach for --reproducible being guaranteed for longer periods, up to minor versions then major versions, depending on what demands users have.

@cosmicexplorer
Copy link
Contributor

Wouldn't a guarantee mean that we couldn't change any Pex code during a major version, because the .bootstrap folder would change? If so, that seems like a non-starter.

Or, perhaps, it could seem like removing the .bootstrap folder is the next challenge for the implementors of the --reproducible flag!

@illicitonion
Copy link
Contributor

I think it's safe to assume that we should not default to reproducible, unless someone wants to push back. I updated the PR title and description to reflect this.

I pretty strongly disagree here.

I think the way I'd phrase this is things should be reproducible by default, and we should expose flags which enable specific violations with clear names and purposes. Because I think reproducibility should be the default, I don't think we want a --reproducible flag at all; but specific opt-outs. People should reach for the specific opt-out that they want if they need it, rather than opting out of reproducibility wholesale.

So specifically this change would introduce a flag named --prefer-startup-speed-to-reproducibility (default=False) which would embed .pyc files in, if we think those 60ms are necessary for people to be able to opt in to. If in the future, we make reproducible .pyc files, we can just deprecate that one flag, and people will have reproducible builds, which is a net positive.

For timestamps, we would introduce a flag named --use-system-timestamp (default=False) which would allow timestamp setting.

For most changes (e.g. ordering of files), there wouldn't be a flag, because there would just be a string reproducibility improvement with no reason to opt out of it.

what guaranty do we want to provide wrt pex version and reproducable.

Hm, @stuhood or anyone else from Twitter? I don't understand Twitter's use case well enough yet to comment on this. I would bias towards the simplest for us devs as possible, meaning no guarantee.

Wouldn't a guarantee mean that we couldn't change any Pex code during a major version, because the .bootstrap folder would change? If so, that seems like a non-starter.

My opinion:
We should absolutely guarantee reproducibility within the same exact version of pex.

We should aim for reproducibility across as many versions as we can, but it's ok to violate if we need to.

That sounds good to me.

If not reproducible build, use default system time.
If reproducible build, first check SOURCE_DATE_EPOCH then fallback to 0.

We shouldn't actually use 0 for this for a few fun reasons... Some tools get confused at 0, and in particular the DOS epoch starts later.

Buck uses 1985-02-01 (and has some comments explaining why, which are kind of JVM-specific, but probably still worth heeding): https://github.com/pantsbuild/pants/blob/master/3rdparty/jvm/org/pantsbuild/buck/util/zip/ZipConstants.java

Bazel uses the DOS epoch (1980-01-01): https://github.com/bazelbuild/bazel/blob/baefeabefc64bb30c293a3dfa9e45a0017b7696b/src/java_tools/singlejar/java/com/google/devtools/build/zip/CentralDirectoryFileHeader.java#L112 and https://github.com/bazelbuild/bazel/blob/baefeabefc64bb30c293a3dfa9e45a0017b7696b/src/java_tools/singlejar/java/com/google/devtools/build/zip/ZipUtil.java#L33-L34

@jsirois
Copy link
Member

jsirois commented May 1, 2019

I think it's safe to assume that we should not default to reproducible, unless someone wants to push back. I updated the PR title and description to reflect this.

I pretty strongly disagree here.

@illicitonion was the context that pex has included .pyc for ~9 years evident? I agree that pex would ideally default to reproducible but on this one I'm hesitant to switch without a deprecation period - say flip at pex 1.7.0 or 2.0.0. There is an ecosystem of non-Pants pex users out there and although start-up time is likely not on most of their minds - who knows.

@illicitonion
Copy link
Contributor

I think it's safe to assume that we should not default to reproducible, unless someone wants to push back. I updated the PR title and description to reflect this.

I pretty strongly disagree here.

@illicitonion was the context that pex has included .pyc for ~9 years evident? I agree that pex would ideally default to reproducible but on this one I'm hesitant to switch without a deprecation period - say flip at pex 1.7.0 or 2.0.0. There is an ecosystem of non-Pants pex users out there and although start-up time is likely not on most of their minds - who knows.

Yeah, the context was evident; I just suspect that realistically no one is going to actually notice the 60ms of extra startup overhead. (And we have a flag they can set to get rid of it, in case they do).

That said, I'd be ok with having the flag default to the existing behaviour until some specified version; though I don't know how we'd communicate the deprecation as we try not to interfere with stdout/stderr in pex. Does pex actually have a deprecation policy? The last breaking change I remember was https://github.com/pantsbuild/pex/blob/master/CHANGES.rst#163 which we just did in a patch release and noted in the release notes...

@jsirois
Copy link
Member

jsirois commented May 1, 2019

Yeah, the context was evident; I just suspect that realistically no one is going to actually notice the 60ms of extra startup overhead. (And we have a flag they can set to get rid of it, in case they do).

PEX does not have a deprecation policy yet: #584

That said I have been trying to be reasonable about versions. 1.5 -> 1.6 encoded vendored setuptools / wheel; although this was unlikely to effect end users visibly. The 1.6.3 thing was a bit different since the breaking change was to a Twitter-added feature mainly used by Twitter - we had ~full knowledge of the (non) use case of using two instances of the flag in the same command line. This case feels different since the .pyc is a feature implicitly used by everyone. I don't feel comfortable guessing how many ms of startup overhead are important to users out in the wild, so I'll gladly take you up on your openness to flip the default. I'd be happy with doing this in a 1.7.0 release.

@illicitonion
Copy link
Contributor

Yeah, the context was evident; I just suspect that realistically no one is going to actually notice the 60ms of extra startup overhead. (And we have a flag they can set to get rid of it, in case they do).

PEX does not have a deprecation policy yet: #584

That said I have been trying to be reasonable about versions. 1.5 -> 1.6 encoded vendored setuptools / wheel; although this was unlikely to effect end users visibly. The 1.6.3 thing was a bit different since the breaking change was to a Twitter-added feature mainly used by Twitter - we had ~full knowledge of the (non) use case of using two instances of the flag in the same command line. This case feels different since the .pyc is a feature implicitly used by everyone. I don't feel comfortable guessing how many ms of startup overhead are important to users out in the wild, so I'll gladly take you up on your openness to flip the default. I'd be happy with doing this in a 1.7.0 release.

Sounds good :)

@Eric-Arellano
Copy link
Contributor Author

(Afk so pardon the short response)

This sound like a great way forward, specifically to move away from a general —reproducible flag and to instead have flags for the specific trade offs.

It appears so far via experimenting on the acceptance tests that we only need to make trade offs with pyc and with the time stamp, along with maybe something related to dependencies. Any changes like to data structures can be done safely without needing a flag. With this pyc flag, we could remove it if we figure out the deterministic timestamp.

Sounds good on default until 1.17.

@stuhood
Copy link

stuhood commented May 1, 2019

After that, we could reach for --reproducible being guaranteed for longer periods, up to minor versions then major versions, depending on what demands users have.

We should aim for reproducibility across as many versions as we can, but it's ok to violate if we need to.

To respond to both of these, I definitely do not care whether pex is reproducible across versions. Only for exactly the same inputs, including pex itself. So IMO this has absolutely nothing to do with deprecation policies.

@stuhood
Copy link

stuhood commented May 1, 2019

I would also suggest maybe opening a new PR if this has changed fundamentally from where it started? I haven't reviewed yet, so don't know.

… will switch in a future version

Per Daniel's sugggestion and John's feedback, we decided two things:
1) We should default to current behavior until at least 1.70, e.g. we should still include .pyc. At that specified version, we can make the switch so that we are reproducible by default.
2) Behavior specific flags work better than a universal `--reproducible` / `--not-reproducible`. Especially since we will eventually default to `--reproducible`, it's highly unlikely anyone will want to call `--not-reproducible`, but it is likely they will want to turn off specific behaviors like wanting to still include `.pyc`.

So, this renames the flag to `--include-pyc` / `--no-include-pyc`, with a default to including pyc.
@Eric-Arellano Eric-Arellano changed the title Introduce new --reproducible flag that won't include .pyc files due to their nondeterminism Introduce new --no-include-pyc flag to not include in built Pex due to their non-determinism May 1, 2019
@Eric-Arellano Eric-Arellano changed the title Introduce new --no-include-pyc flag to not include in built Pex due to their non-determinism Introduce new --no-include-pyc flag to not include .pyc in built Pex due to its non-determinism May 1, 2019
@Eric-Arellano
Copy link
Contributor Author

Ready for a fresh round of reviews.

To summarize what we've decided so far:

  1. Default to current behavior until 1.70, at which point we default to reproducible builds.
  2. Use behavior specific flags rather than a general --reproducible. Once we change to reproducible by default, it doesn't make sense to ever use --not-reproducible, but it does make sense to opt out of specific behavior e.g. to use --include-timestamp.
  3. Flag should likely be called --include-pyc / --no-include-pyc. Talked to a few of you over Slack that this is more clear and less wordy than --prefer-startup-speed-to-reproducibility.

pex/bin/pex.py Outdated
action='callback',
callback=parse_bool,
help='Including .pyc files will result in slightly faster startup performance, but at the '
'expense of making the generated pex not be reproducible, meaning that if you were to '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be softened to maybe. If 3.7+ and they have exported SOURCE_DATE_EPOCH, then, today without changes to Pex, they get reproducible .pyc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Although didn't explicitly mention this 3.7+ case because that seems pretty niche. Let me know how the new one is.

pex/bin/pex.py Outdated
default=True,
action='callback',
callback=parse_bool,
help='Including .pyc files will result in slightly faster startup performance, but at the '
Copy link
Member

@jsirois jsirois May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if / how to include in help without writing a book, but this is only true for PEX sources, not anything added via requirements. So, for a requirements only pex, that's just the __main__.py and .bootstrap/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow how I should update for this. Do requirements ever have .pyc in them? It looks from a simple test of pex six -o t.pex --no-compile; unzip t.pex -d t that the six files is just the wheel itself and that there is no .pyc regardless of --compile or --no-compile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I meant and I have no great suggestion. The basic worry is that a new PEX user sees --compile and expects using that option will get all bytecode compiled. It won't. For the perhaps typical non-Pants use case of requirements only, it will even dissapointingly compile almost nothing.

pex/bin/pex.py Outdated Show resolved Hide resolved
Consistency with Pip is a great thing.
@Eric-Arellano Eric-Arellano changed the title Introduce new --no-include-pyc flag to not include .pyc in built Pex due to its non-determinism Introduce new --no-compile flag to not include .pyc in built Pex due to its non-determinism May 2, 2019
@Eric-Arellano Eric-Arellano merged commit ec14916 into pex-tool:master May 2, 2019
@Eric-Arellano Eric-Arellano deleted the no-pyc branch May 2, 2019 04:01
@illicitonion
Copy link
Contributor

After that, we could reach for --reproducible being guaranteed for longer periods, up to minor versions then major versions, depending on what demands users have.

We should aim for reproducibility across as many versions as we can, but it's ok to violate if we need to.

To respond to both of these, I definitely do not care whether pex is reproducible across versions. Only for exactly the same inputs, including pex itself. So IMO this has absolutely nothing to do with deprecation policies.

I care insofar as the more reproducible it happens to be the better. In the limit, if we embedded the version of pex used to make it in every pex file, we would get lower cache hit rates for no good reason. This would be bad, but not the end of the world.

The more versions we can be reproducible across, the better. But we shouldn't make life harder for ourselves to meet this aim.

@jsirois jsirois mentioned this pull request May 7, 2019
11 tasks
@ngortheone
Copy link

I desperately need a way too pass --no-use-system-time --no-compile options from pants.
I can't find a relevant docs on how to do that. Please help.

@Eric-Arellano
Copy link
Contributor Author

Hi @ngortheone, we have not yet hooked this up to Pants and are still upgrading Pants to the newest Pex release (released last night).

Could you please share a bit more about your use case? For example, are you running ./pants binary? That's helpful for us to know in our prioritization that there others who want this feature.

@ngortheone
Copy link

@Eric-Arellano
Here is the gist of my setup. I have a lot of aws lambdas that I build like this:

python_binary(
  name='release-trigger',
  source='release_trigger.py',
  dependencies=[
    ':lib',
  ],
  platforms=[
    'linux-x86_64',
    'current',
  ],
  no_cache=True,
)

python_awslambda(
  name='release-trigger-lambda',
  binary=':release-trigger',
  handler='release_trigger:entry_point',
  no_cache=True,
)

I call pants bundle :release-trigger-lambda which calls binary

I need a way to build a deterministic package, because currently all pyc files have different sha every time I build.

Separately, it would be great to have global settings for python_binary, where I could specify things like:

  platforms=[
    'linux-x86_64',
    'current',
  ],
  no_cache=True,

once, without having to repeat it in every python_binary

@ngortheone
Copy link

I see that options are not currently wired
https://github.com/pantsbuild/pants/blob/master/src/python/pants/backend/python/targets/python_binary.py#L57

It seems that the way it is coded is not scalable, as you need to update it each time a new options is added. There should be a more generic way of passing any cli options down to pex.

Is there a workaround I can use in the meanwhile?

@Eric-Arellano
Copy link
Contributor Author

Thanks @ngortheone.

We're currently working to land Pex 1.6.7.

Originally, we were planning to wait until Pex 1.7.0 is released to let the new default behavior kick in. However, I audited just now the changes that we need to make and there aren't many, so once we start using Pex 1.6.7 in Pants I will open a PR to default to Pants generating reproducible PEXes.

@ngortheone
Copy link

Thanks! Can't wait

stuhood pushed a commit to pantsbuild/pants that referenced this pull request May 16, 2019
…7734)

Pex 1.6.7 added support for reproducible Pex builds (pex-tool/pex#716), meaning that the output of `pex -o 1.pex` will be byte for byte identical to `pex -o 2.pex`. This specifically means that the PEX will use a deterministic timestamp of January 1, 1980 and that the built PEX will not include `.pyc`, which is inherently non-reproducible.

In Pex 1.7.0, Pex will default to this reproducible behavior. Originally we were going to wait for that to land to avoid making any changes to Pex. However, a [user has a strong desire](pex-tool/pex#718 (comment)) for reproducible PEX builds so we are turning on the behavior earlier.

Note that we do not introduce any flags to turn off reproducible builds, because we expect that all users will want this behavior. If we get feedback that users do not, e.g. that they want to include `.pyc` files, then we will add options to turn this behavior off.
@Eric-Arellano Eric-Arellano mentioned this pull request Nov 14, 2019
2 tasks
jsirois added a commit to jsirois/pex that referenced this pull request Nov 14, 2019
This follows up on the decision to switch Pex defaults to do so made
in pex-tool#718.
jsirois added a commit that referenced this pull request Nov 15, 2019
This follows up on the decision to switch Pex defaults to do so made
in #718.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants