Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pex resolver can end up with truncated reads of binary dists, causing an opaque untranslateable error #475

Open
kwlzn opened this issue May 14, 2018 · 10 comments
Labels

Comments

@kwlzn
Copy link
Contributor

kwlzn commented May 14, 2018

based on an issue we've seen internally when downloading large (~600MB) wheel files, it looks like pex does not currently check a downloaded payload against it's expected Content-Length header.

this usually manifests as an untranslateable error:

Exception caught: (<class 'pex.resolver.Untranslateable'>)
  File ".bootstrap/_pex/pex.py", line 367, in execute
    self._wrap_coverage(self._wrap_profiling, self._execute)
  File ".bootstrap/_pex/pex.py", line 293, in _wrap_coverage
    runner(*args)
  File ".bootstrap/_pex/pex.py", line 325, in _wrap_profiling
    runner(*args)
  File ".bootstrap/_pex/pex.py", line 410, in _execute
    return self.execute_entry(self._pex_info.entry_point)
  File ".bootstrap/_pex/pex.py", line 468, in execute_entry
    return runner(entry_point)
  File ".bootstrap/_pex/pex.py", line 486, in execute_pkg_resources
    return runner()
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/bin/pants_loader.py", line 69, in main
    PantsLoader.run()
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/bin/pants_loader.py", line 65, in run
    cls.load_and_execute(entrypoint)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/bin/pants_loader.py", line 58, in load_and_execute
    entrypoint_main()
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/bin/pants_exe.py", line 39, in main
    PantsRunner(exiter, start_time=start_time).run()
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/bin/pants_runner.py", line 53, in run
    return runner.run()
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/bin/local_pants_runner.py", line 49, in run
    self._run()
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/bin/local_pants_runner.py", line 97, in _run
    goal_runner_result = goal_runner.run()
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/bin/goal_runner.py", line 269, in run
    result = self._execute_engine()
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/bin/goal_runner.py", line 257, in _execute_engine
    result = engine.execute(self._context, self._goals)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/engine/legacy_engine.py", line 26, in execute
    self.attempt(context, goals)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/engine/round_engine.py", line 233, in attempt
    goal_executor.attempt(explain)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/engine/round_engine.py", line 49, in attempt
    task.execute()
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/backend/python/tasks/resolve_requirements.py", line 28, in execute
    pex = self.resolve_requirements(interpreter, self.context.targets(has_python_requirements))
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/backend/python/tasks/resolve_requirements_task_base.py", line 66, in resolve_requirements
    dump_requirement_libs(builder, interpreter, req_libs, self.context.log, platforms=maybe_platforms)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/backend/python/tasks/pex_build_util.py", line 153, in dump_requirement_libs
    dump_requirements(builder, interpreter, reqs, log, platforms)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/backend/python/tasks/pex_build_util.py", line 177, in dump_requirements
    distributions = _resolve_multi(interpreter, deduped_reqs, platforms, find_links)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl.a91145c8b1d46c3c1f7d4171563217e0cf70df1a/pantsbuild.pants-1.6.0rc2+1df5132b-cp27-none-linux_x86_64.whl/pants/backend/python/tasks/pex_build_util.py", line 220, in _resolve_multi
    pkg_blacklist=python_setup.resolver_blacklist)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pex-1.3.2-py2.py3-none-any.whl.098aaa2b0489957ad9856370c8ba2c9e44ff3e75/pex-1.3.2-py2.py3-none-any.whl/pex/resolver.py", line 398, in resolve
    return resolver.resolve(resolvables_from_iterable(requirements, builder))
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pex-1.3.2-py2.py3-none-any.whl.098aaa2b0489957ad9856370c8ba2c9e44ff3e75/pex-1.3.2-py2.py3-none-any.whl/pex/resolver.py", line 220, in resolve
    dist = self.build(package, resolvable.options)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pex-1.3.2-py2.py3-none-any.whl.098aaa2b0489957ad9856370c8ba2c9e44ff3e75/pex-1.3.2-py2.py3-none-any.whl/pex/resolver.py", line 293, in build
    dist = super(CachingResolver, self).build(package, options)
  File "/var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pex/install/pex-1.3.2-py2.py3-none-any.whl.098aaa2b0489957ad9856370c8ba2c9e44ff3e75/pex-1.3.2-py2.py3-none-any.whl/pex/resolver.py", line 178, in build
    raise Untranslateable('Package %s is not translateable by %s' % (package, translator))

Exception message: Package WheelPackage(u'file:///var/lib/mesos/slaves/8062d90b-712d-4410-8b3b-38e0af892326-S1968/frameworks/201104070004-0000002563-0000/executors/thermos-scoot-service-staging-scoot-worker-396-aa90d631-e2e9-4035-bc03-c91b0ebf302b/runs/7785048d-0be9-4021-8398-c022bf85d444/sandbox/workspace/source/.pants.d/python-setup/resolved_requirements/CPython-2.7.13/torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl') is not translateable by ChainedTranslator(WheelTranslator, EggTranslator, SourceTranslator)

where the locally downloaded copy of torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl is truncated and much smaller than the Content-Length header - and thus not a valid zip archive.

in order to raise better errors, we should be explicitly checking this so that we can raise more helpful errors in the case of incomplete reads.

@kwlzn kwlzn added the bug label May 14, 2018
@stuhood
Copy link

stuhood commented May 16, 2018

So, two potential sketches of how this might be implemented:

  1. Changing Context.open to return a tuple of file-like and length (where the length would be captured from the content-length header via Requests and Urllib), and then validating the length in callers... particularly Context.fetch.
  2. Adding a (required?) validator function that Context.fetch uses to decide whether to move the downloaded content to its final location. For all of sdists/whls/eggs, this might look like opening the archive and ensuring that we can extract the entire archive into /dev/null, or something.

@kwlzn : Thoughts? The former would be sufficient to catch the truncation case, but the latter could potentially catch some additional corruption beyond truncation... although it can't detect all corruption.

@kwlzn
Copy link
Contributor Author

kwlzn commented May 17, 2018

#1 sounds 👌 to me - I think at a minimum, we should validate Content-Length to safeguard against short reads/broken servers.

#2 is interesting, but I'd be worried about the potential cost of that added validation for every artifact fetched (esp given some wheels like pytorch can be upwards of 600MB). and it may be overkill - since pex will already blow up on these bad artifacts w/ a clear signal of which file is broken - just later in the process as part of their use. I think thats probably sufficient, IMHO, as long as we're at least reading the correct payload from the server.

@kwlzn
Copy link
Contributor Author

kwlzn commented May 17, 2018

related: psf/requests#1938

The bigger problem you're going to have is that we transparently decompress gzipped and deflated data. This means that if you're served such a response your content length checking will fail.
...
There are two forms of decoding here: decoding from compressed data to bytes, and then decoding from bytes to unicode. The first is done in urllib3, the second in Requests. Because the first is done in urllib3, Requests never sees the gzipped bytes.

@kwlzn
Copy link
Contributor Author

kwlzn commented May 17, 2018

also, a sane solution to just #1 would permit a retry strategy - which would be good to add too, I think.

@stuhood
Copy link

stuhood commented May 17, 2018

Hm. Your comments about the complexities of #1 have me leaning toward #2... it's almost impossible to get wrong. Since once something is cached we never need to re-validate it, I don't think the performance overhead will be that significant (scanning an archive is very cheap... much cheaper than scanning loose files).

@kwlzn
Copy link
Contributor Author

kwlzn commented May 17, 2018

to me:

#1 is concerned with "am I reading the proper payload from the server"
#2 is concerned with "is the payload I read from the server, whether properly retrieved or not, a valid dist"

it seems entirely possible that a given server can serve up an invalid dist as it's proper payload (i.e. we could not clearly indicate a connection/fetch issue based on a validation failure). this should not be a retryable event - and as we can see pex already blows up on it, just post-fetch. it's not clear to me what problem we would be solving by just moving that validation to fetch-time vs resolve-time.

but if the reason the dist is malformed is because the server didn't send all of it, I think that's certainly a retryable event (and/or at least one we can give a much better error message about) and ultimately the core problem I think we want to solve for here. I'm not sure there's a better way to do that in the framework of HTTP other than to find some reliable way to check the Content-Length in urllib{,3}?

@kwlzn
Copy link
Contributor Author

kwlzn commented May 17, 2018

#58 is also a thing, in case that helps simplify (then we'd only need to solve this for requests/urllib3).

@stuhood
Copy link

stuhood commented May 17, 2018

this should not be a retryable event - and as we can see pex already blows up on it, just post-fetch.

When you get an invalid dist, it's effectively impossible to know why... it could have been corrupted anywhere between being created on some machine, being uploaded to a server, and finally when it is downloaded from the server. Because #1 only addresses the download, it doesn't catch as much of the problem.

In-addition to doing #2, it would be great if there were checksums in place on the server side, but I don't know whether such a standard exists for python dists.

it's not clear to me what problem we would be solving by just moving that validation to fetch-time vs resolve-time.

It prevents putting something broken in the cache, and allows you to indicate to the user which thing from which url was broken. So it solves the "opaque error" issue in the description, and prevents it from recurring when you retry.

@kwlzn
Copy link
Contributor Author

kwlzn commented May 17, 2018

When you get an invalid dist, it's effectively impossible to know why...

the specific problem we saw that led to this ticket was a truncated read that was only exposed by checking Content-Length. when we repro'd with curl, we noticed it was returning an (unchecked) CURLE_PARTIAL_FILE (18) exit code.

so I think there's a clear precedent for checking this header as a basic means of size validation (and ultimately, that's why Content-Length exists in the HTTP spec) - it just happens to be unfortunate that it's not as simple as flipping on a flag in requests in particular. it seemed like the urllib3 guys were amenable to an upstream change to make that easier tho?

(fwiw, I hadn't meant to dissuade us from going this route with my comments, but was just pointing out some potential stumbling blocks.)

In-addition to doing #2, it would be great if there were checksums in place on the server side, but I don't know whether such a standard exists for python dists.

the idea behind "findlinks" repos is that they're plain old dumb web servers that you scrape A href links from, so there's definitely no standard where server side checksums would fit in there. a true pypi style "index server" may tho, but that wouldn't benefit us currently. but: there is existing client-side checksum support for payload integrity checking, we just don't take advantage of it today.

It prevents putting something broken in the cache, and allows you to indicate to the user which thing from which url was broken. So it solves the "opaque error" issue in the description, and prevents it from recurring when you retry.

yeah, I could see how that could be generally useful and it would solve the "opaque error" problem to a degree (it would tell you the source of the problem, but not the reason). perhaps the validation could be done inline as we're fetching via e.g. StreamFilelike?

this still doesnt solve for inline retries on a badly behaving server/network tho - we'd still end up with intermittent failures in CI for these cases. either way, the "detect a fetch vs content problem" + retry aspect still feels important to me for robustness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants