New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pex resolver can end up with truncated reads of binary dists, causing an opaque untranslateable error #475
Comments
So, two potential sketches of how this might be implemented:
@kwlzn : Thoughts? The former would be sufficient to catch the truncation case, but the latter could potentially catch some additional corruption beyond truncation... although it can't detect all corruption. |
#1 sounds 👌 to me - I think at a minimum, we should validate #2 is interesting, but I'd be worried about the potential cost of that added validation for every artifact fetched (esp given some wheels like pytorch can be upwards of 600MB). and it may be overkill - since pex will already blow up on these bad artifacts w/ a clear signal of which file is broken - just later in the process as part of their use. I think thats probably sufficient, IMHO, as long as we're at least reading the correct payload from the server. |
related: psf/requests#1938
|
also, a sane solution to just #1 would permit a retry strategy - which would be good to add too, I think. |
Hm. Your comments about the complexities of #1 have me leaning toward #2... it's almost impossible to get wrong. Since once something is cached we never need to re-validate it, I don't think the performance overhead will be that significant (scanning an archive is very cheap... much cheaper than scanning loose files). |
to me: #1 is concerned with "am I reading the proper payload from the server" it seems entirely possible that a given server can serve up an invalid dist as it's proper payload (i.e. we could not clearly indicate a connection/fetch issue based on a validation failure). this should not be a retryable event - and as we can see pex already blows up on it, just post-fetch. it's not clear to me what problem we would be solving by just moving that validation to fetch-time vs resolve-time. but if the reason the dist is malformed is because the server didn't send all of it, I think that's certainly a retryable event (and/or at least one we can give a much better error message about) and ultimately the core problem I think we want to solve for here. I'm not sure there's a better way to do that in the framework of HTTP other than to find some reliable way to check the Content-Length in |
#58 is also a thing, in case that helps simplify (then we'd only need to solve this for requests/urllib3). |
When you get an invalid dist, it's effectively impossible to know why... it could have been corrupted anywhere between being created on some machine, being uploaded to a server, and finally when it is downloaded from the server. Because #1 only addresses the download, it doesn't catch as much of the problem. In-addition to doing #2, it would be great if there were checksums in place on the server side, but I don't know whether such a standard exists for python dists.
It prevents putting something broken in the cache, and allows you to indicate to the user which thing from which url was broken. So it solves the "opaque error" issue in the description, and prevents it from recurring when you retry. |
the specific problem we saw that led to this ticket was a truncated read that was only exposed by checking Content-Length. when we repro'd with curl, we noticed it was returning an (unchecked) so I think there's a clear precedent for checking this header as a basic means of size validation (and ultimately, that's why (fwiw, I hadn't meant to dissuade us from going this route with my comments, but was just pointing out some potential stumbling blocks.)
the idea behind "findlinks" repos is that they're plain old dumb web servers that you scrape
yeah, I could see how that could be generally useful and it would solve the "opaque error" problem to a degree (it would tell you the source of the problem, but not the reason). perhaps the validation could be done inline as we're fetching via e.g. this still doesnt solve for inline retries on a badly behaving server/network tho - we'd still end up with intermittent failures in CI for these cases. either way, the "detect a fetch vs content problem" + retry aspect still feels important to me for robustness. |
based on an issue we've seen internally when downloading large (~600MB) wheel files, it looks like pex does not currently check a downloaded payload against it's expected
Content-Length
header.this usually manifests as an untranslateable error:
where the locally downloaded copy of
torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl
is truncated and much smaller than theContent-Length
header - and thus not a valid zip archive.in order to raise better errors, we should be explicitly checking this so that we can raise more helpful errors in the case of incomplete reads.
The text was updated successfully, but these errors were encountered: