Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious failure extracting zip archive #171

Closed
alexcrichton opened this issue Aug 9, 2017 · 4 comments
Closed

Spurious failure extracting zip archive #171

alexcrichton opened this issue Aug 9, 2017 · 4 comments

Comments

@alexcrichton
Copy link
Contributor

We've run into Invalid checksum errors a few times when working on rust-lang/rust, for example at https://ci.appveyor.com/project/rust-lang/rust/build/1.0.4224/job/ow4l9bb15wy56sht. This string apparently appears in the zip crate and comes from an invalid crc32 checksum.

How that actually managed to happen I'm not entirely sure! I'm not sure if this is a corrupt entry in the cache or a failed download, or if the download failed why it wasn't caught sooner...

@luser
Copy link
Contributor

luser commented Aug 9, 2017

Ok so that error originates from inside zip's Crc32Reader::read, and that struct is used inside of the ZipFileReader members, so presumably it's failing inside the io::copy in CacheRead::get_object.

...but yeah, I'm not really sure how we'd get an invalid zip file here unless the HTTP download failed somehow? I wonder if we could hash the zip file and store that digest as a header when storing in S3, and compare on download? Seems like something that shouldn't happen, in any event.

@alexcrichton
Copy link
Contributor Author

Oh for some reason I thought that's what happened already but apparently not!

I think that we uploaded a valid zip archive b/c we're not getting 100% failure rate on MSVC right now. Presuambly they're all getting the same cached value and later builds succeed after one fails. In that sense I think that this is a download failure of some form. As to what kind of download failure... unsure!

We're checking for a is_success HTTP status and validating we read all the bytes, but we only validate the latter if there's a Content-Length header.

I'm not sure if S3 could serve us invalid content?

In any case though, one thing we could do is to detect a failed extraction of the archive and just count it as a cache miss maybe? That may be difficult to thread through.

@ezyang
Copy link

ezyang commented Feb 13, 2019

We've seen this error too, on a different project: https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-devtoolset7-rocmrpm-centos7.5-build/2203//console

08:12:20 [ 72%] Building CXX object caffe2/CMakeFiles/caffe2.dir/sgd/learning_rate_adaption_op.cc.o
08:12:21 sccache: encountered fatal error
08:12:21 sccache: error : Invalid checksum
08:12:21 sccache:  cause: Invalid checksum
08:12:21 make[2]: *** [caffe2/CMakeFiles/caffe2.dir/build.make:15872: caffe2/CMakeFiles/caffe2.dir/sgd/gftrl_op.cc.o] Error 254
08:12:21 make[2]: *** Waiting for unfinished jobs....
08:12:21 make[1]: *** [CMakeFiles/Makefile2:3664: caffe2/CMakeFiles/caffe2.dir/all] Error 2

It's durable, so it definitely looks like there is something corrupted inside the cache.

@sylvestre
Copy link
Collaborator

I think it doesn't happen anymore

@sylvestre sylvestre closed this as not planned Won't fix, can't repro, duplicate, stale Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants