Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max concurrent jobs with sccache #248

Closed
yf225 opened this issue May 8, 2018 · 9 comments
Closed

Max concurrent jobs with sccache #248

yf225 opened this issue May 8, 2018 · 9 comments

Comments

@yf225
Copy link

yf225 commented May 8, 2018

When using yf225@336584a to compile CUDA source files with make -j4 on a 4-core machine, sccache seems to give the following error:

19:04:53 [ 14%] Building NVCC (Device) object src/ATen/CMakeFiles/ATen.dir/__/THC/ATen_generated_THCStorage.cu.o
19:04:53 [ 14%] Building NVCC (Device) object src/ATen/CMakeFiles/ATen.dir/__/THC/ATen_generated_THCReduceApplyUtils.cu.o
19:04:53 [ 14%] Building NVCC (Device) object src/ATen/CMakeFiles/ATen.dir/__/THC/ATen_generated_THCSleep.cu.o
19:04:53 [ 14%] Building NVCC (Device) object src/ATen/CMakeFiles/ATen.dir/__/THC/ATen_generated_THCBlas.cu.o
19:15:13 error: failed to execute compile
19:15:13 caused by: error reading compile response from server
19:15:13 caused by: Failed to read response header
19:15:13 caused by: failed to fill whole buffer
19:15:13 error: failed to execute compile
19:15:13 caused by: error reading compile response from server
19:15:13 caused by: Failed to read response header
19:15:13 caused by: failed to fill whole buffer
19:15:17 error: failed to execute compile
19:15:17 caused by: error reading compile response from server
19:15:17 caused by: Failed to read response header
19:15:17 caused by: failed to fill whole buffer
19:15:17 error: failed to execute compile
19:15:17 caused by: error reading compile response from server
19:15:17 caused by: Failed to read response header
19:15:17 caused by: failed to fill whole buffer

while make -j2 seems to work fine for the 4-core machine. Interestingly on a 16-core machine make -j8 works fine. This issue doesn’t happen when building C++ source files. Curious what might be causing this issue? Thanks!

@yf225
Copy link
Author

yf225 commented May 8, 2018

Update: setting number of concurrent jobs to $(nproc)-1 fixes the issue.

@ezyang
Copy link

ezyang commented May 8, 2018

I'm not sure I would necessarily close this issue? The server shouldn't deadlock even if you have more processes than cpus.

@luser
Copy link
Contributor

luser commented Aug 28, 2018

Can you reliably reproduce this? If so, getting a log out of the sccache server would help diagnosing what's happening here. I agree that this should not fail this way. You can get logs by first ensuring that no sccache server is running (sccache --stop-server), then starting a server with logging enabled:
RUST_LOG=sccache=trace SCCACHE_ERROR_LOG=/tmp/sccache.log sccache --start-server.

@Marwes
Copy link
Contributor

Marwes commented Oct 12, 2018

Found at least one cause for this error that is pretty easy to reproduce.

sccache does not do a graceful shutdown when the idle timeout is reached so any compile requests in flight gets aborted and this is the error printed. I'd say the fix for that is to keep track of whether there are in-flight requests and only start the idle timeout when the server is actually idle?

It is possible that there are other problems that cause the same error but that is at least a candidate.

@luser
Copy link
Contributor

luser commented Oct 12, 2018

Ah yes, that's #204

@Marwes
Copy link
Contributor

Marwes commented Oct 12, 2018

I have observed a deadlock as well which gives this same error. Haven't been able to reproduce it with logs enabled yet though.

@Marwes
Copy link
Contributor

Marwes commented Oct 15, 2018

I believe the deadlock may be alexcrichton/tokio-process#42 so hopefully updating to tokio-process 0.2.5 should fix. Included it in #304 since cargo update would conflict with that PR but it could be fixed independently.

@luser
Copy link
Contributor

luser commented Dec 10, 2018

This should be fixed since we merged #304 . If it reoccurs please let me know!

@luser luser closed this as completed Dec 10, 2018
@ezyang
Copy link

ezyang commented Dec 11, 2018

cc @yf225 @bddppq

gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.
gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.
gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.
gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.
gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.
gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.

Differential Revision: [D20849176](https://our.internmc.facebook.com/intern/diff/D20849176)
gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.

Differential Revision: [D20848317](https://our.internmc.facebook.com/intern/diff/D20848317)
gchanan added a commit to gchanan/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. pytorch#35734
and mozilla/sccache#248.

ghstack-source-id: faf0d04f95a9fbf7bd3a690fd4e660ff4ab66b88
Pull Request resolved: pytorch#35979
gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.

Differential Revision: [D20849176](https://our.internmc.facebook.com/intern/diff/D20849176)
gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.

Differential Revision: [D20848317](https://our.internmc.facebook.com/intern/diff/D20848317)
gchanan added a commit to pytorch/pytorch that referenced this issue Apr 3, 2020
ROCm changes seem to cause sccache failures, see e.g. #35734
and mozilla/sccache#248.

Differential Revision: [D20849254](https://our.internmc.facebook.com/intern/diff/D20849254)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants