Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

third_party build error when using CUDA Dockerfile #331

Closed
lifefeel opened this issue Mar 21, 2022 · 2 comments
Closed

third_party build error when using CUDA Dockerfile #331

lifefeel opened this issue Mar 21, 2022 · 2 comments

Comments

@lifefeel
Copy link

When I try to build dockerfile using dockerfile/cuda11.1.1.dockerfile, I get the following error:

~/superbenchmark main !1 ?2 ❯ docker buildx build \             
  --platform linux/amd64 --cache-to type=inline,mode=max \
  --tag superbench-dev --file dockerfile/cuda11.1.1.dockerfile .
[+] Building 172.9s (8/18)
 => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s
 => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0s
 => [internal] load .dockerignore                                                                                                                                                                   0.0s
 => => transferring context: 35B                                                                                                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s
 => [internal] load build context                                                                                                                                                                   0.6s
 => => transferring context: 788.47kB                                                                                                                                                               0.5s
 => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 173.1s (8/18)                                                                                                                                                                                 automake     build-essential     curl     dmidecode     git     jq     libaio-dev     lib   => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s0.8.tgz -O docker.tgz &&     tar --extract --file docker.tgz --strip-components 1 --director   => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0sshd &&     sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/s   => [internal] load .dockerignore                                                                                                                                                                   0.0sD_LINUX-5.2-2.2.3.0-ubuntu20.04-x86_64.tgz &&     tar xzf MLNX_OFED_LINUX-5.2-2.2.3.0-ubun  16 => => transferring context: 35B                                                                                                                                                                    0.0s                                                                                               => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s                                                                                               => [internal] load build context                                                                                                                                                                   0.6s                                                                                               => => transferring context: 788.47kB                                                                                                                                                               0.5s                                                                                               => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 173.2s (8/18)                                                                                                                                                                                 automake     build-essential     curl     dmidecode     git     jq     libaio-dev     lib   => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s0.8.tgz -O docker.tgz &&     tar --extract --file docker.tgz --strip-components 1 --director   => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0sshd &&     sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/s   => [internal] load .dockerignore                                                                                                                                                                   0.0sD_LINUX-5.2-2.2.3.0-ubuntu20.04-x86_64.tgz &&     tar xzf MLNX_OFED_LINUX-5.2-2.2.3.0-ubun  16 => => transferring context: 35B                                                                                                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s
 => [internal] load build context                                                                                                                                                                   0.6s
 => => transferring context: 788.47kB                                                                                                                                                               0.5s
 => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 173.4s (8/18)                                                                                                                                                                                 automake     build-essential     curl     dmidecode     git     jq     libaio-dev     lib   => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s0.8.tgz -O docker.tgz &&     tar --extract --file docker.tgz --strip-components 1 --director   => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0sshd &&     sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/s   => [internal] load .dockerignore                                                                                                                                                                   0.0sD_LINUX-5.2-2.2.3.0-ubuntu20.04-x86_64.tgz &&     tar xzf MLNX_OFED_LINUX-5.2-2.2.3.0-ubun  16 => => transferring context: 35B                                                                                                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s
 => [internal] load build context                                                                                                                                                                   0.6s
 => => transferring context: 788.47kB                                                                                                                                                               0.5s
 => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 183.3s (8/18)  [+] Buil[+] Building 665.5s (16/18)
 => [internal] load build definition from cuda11.1.1.dockerfile                                                0.0s  => => transferring dockerfile: 4.00kB                                                                         0.0st => [internal] load .dockerignore                                                                              0.0s
 => => transferring context: 35B                                                                               0.0s  => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                              1.4st => [internal] load build context                                                                              0.6s
 => => transferring context: 788.47kB                                                                          0.5s  => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b8  0.0sH => CACHED [ 2/14] RUN apt-get update &&     apt-get install -y --no-install-recommends     autoconf     auto  0.0s
 => [ 3/14] RUN cd /tmp &&     wget https://download.docker.com/linux/static/stable/x86_64/docker-20.10.8.tgz  9.5s  => [ 4/14] RUN mkdir -p /root/.ssh &&     touch /root/.ssh/authorized_keys &&     mkdir -p /var/run/sshd &&   0.6s/ => [ 5/14] RUN cd /tmp &&     wget -q http://content.mellanox.com/ofed/MLNX_OFED-5.2-2.2.3.0/MLNX_OFED_LIN  277.4s
 => [ 6/14] RUN cd /opt &&     wget -q https://azhpcstor.blob.core.windows.net/azhpc-images-store/hpcx-v2.8.  62.9s
 => [ 7/14] RUN cd /tmp &&     git clone https://github.com/Mellanox/nccl-rdma-sharp-plugins.git &&     cd n  22.1s
 => [ 8/14] RUN cd /tmp &&     git clone -b v2.10.3-1 https://github.com/NVIDIA/nccl.git &&     cd nccl &&   264.6s
 => [ 9/14] RUN cd /tmp &&     mkdir -p mlc &&     cd mlc &&     wget --user-agent="Mozilla/5.0 (X11; Fedora;  0.8s
 => [10/14] WORKDIR /opt/superbench                                                                            0.1s
 => [11/14] ADD third_party third_party                                                                        0.1s
 => ERROR [12/14] RUN make -j 40 -C third_party cuda                                                          25.8s
------
 > [12/14] RUN make -j 40 -C third_party cuda:
#0 0.415 make: Entering directory '/opt/superbench/third_party'
#0 0.415 mkdir -p /opt/superbench/bin
#0 0.418 mkdir -p /opt/superbench/lib
#0 0.445 if [ -d cuda-samples ]; then rm -rf cuda-samples; fi
#0 0.445 bash -c "source /opt/hpcx/hpcx-init.sh && hpcx_load && make CC=mpicc -C GPCNET all && hpcx_unload"
#0 0.465 git clone -b v11.1 https://github.com/NVIDIA/cuda-samples.git ./cuda-samples
#0 0.468 Cloning into './cuda-samples'...
#0 0.493 make[1]: Entering directory '/opt/superbench/third_party'
#0 0.493 make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
#0 0.495 make[1]: Leaving directory '/opt/superbench/third_party/GPCNET'
#0 0.495 make[1]: *** No rule to make target 'all'.  Stop.
#0 0.496 make: *** [Makefile:98: gpcnet] Error 2
#0 0.496 make: *** Waiting for unfinished jobs....
#0 20.08 Note: switching to 'c4e2869a2becb4b6d9ce5f64914406bf5e239662'.
#0 20.08
#0 20.08 You are in 'detached HEAD' state. You can look around, make experimental
#0 20.08 changes and commit them, and you can discard any commits you make in this
#0 20.08 state without impacting any branches by switching back to a branch.
#0 20.08
#0 20.08 If you want to create a new branch to retain commits you create, you may
#0 20.08 do so (now or later) by using -c with the switch command. Example:
#0 20.08
#0 20.08   git switch -c <new-branch-name>
#0 20.08
#0 20.08 Or undo this operation with:
#0 20.08
#0 20.08   git switch -
#0 20.08
#0 20.08 Turn off this advice by setting config variable advice.detachedHead to false
#0 20.08
#0 20.56 cd ./cuda-samples/Samples/bandwidthTest && make clean && make TARGET_ARCH=x86_64 SMS="70 75 80 86"
#0 20.59 make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
#0 20.59 make[1]: Entering directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 20.61 rm -f bandwidthTest bandwidthTest.o
#0 20.62 rm -rf ../../bin/x86_64/linux/release/bandwidthTest
#0 20.62 make[1]: Leaving directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 20.62 make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
#0 20.62 make[1]: Entering directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 20.65 /usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common  -m64    -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o bandwidthTest.o -c bandwidthTest.cu
#0 25.31 /usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o bandwidthTest bandwidthTest.o
#0 25.62 mkdir -p ../../bin/x86_64/linux/release
#0 25.62 cp bandwidthTest ../../bin/x86_64/linux/release
#0 25.63 make[1]: Leaving directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 25.63 cp -v ./cuda-samples/Samples/bandwidthTest/bandwidthTest /opt/superbench/bin/
#0 25.63 './cuda-samples/Samples/bandwidthTest/bandwidthTest' -> '/opt/superbench/bin/bandwidthTest'
#0 25.63 make: Leaving directory '/opt/superbench/third_party'
------
error: failed to solve: executor failed running [/bin/sh -c make -j ${NUM_MAKE_JOBS} -C third_party cuda]: exit code: 2

I cloned the recent main branch and the commit UUID is a9634ef

The problem is in step 12 of the docker build.

Please help. Thanks.

@abuccts
Copy link
Member

abuccts commented Mar 21, 2022

Hi @lifefeel,

Seems you didn't clone the submodules before building the image, could you try

 git submodule update --init --recursive -j 16

and check the output of ls third_party/*?

If you can see contents in each subdirectory, then re-try docker buildx build ....
Hope it works for you.

@lifefeel
Copy link
Author

Thank you for quick reply. It works well!

Close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants