Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

./insmod.sh fails #41

Closed
moravveji opened this issue Jul 6, 2018 · 13 comments
Closed

./insmod.sh fails #41

moravveji opened this issue Jul 6, 2018 · 13 comments
Labels

Comments

@moravveji
Copy link

Dear,

We have several GPU nodes (Skylake processors with 4x P100 cards per each node), and I would like to test if the RDMA is available on these nodes or not.
When I try to build the gdrcopy, I get the following error message:
mknod: ‘/dev/gdrdrv’: Operation not permitted
Here is the specification of the host:

$> uname -a Linux r23g34 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

In fact, there is not such a file at /dev/gdrdrv on our current system. Do you have an idea what is wrong here?

Thanks
Ehsan

@drossetti
Copy link
Member

Ehsan, insmod.sh requires that the user issuing the command have sudo privileges.

@moravveji
Copy link
Author

moravveji commented Jul 12, 2018

I definitely have a root permission. Let me copy-paste what I get when running "make":
gdrcopy$> sudo ./build.sh &> log.txt

And the tail of the log.txt reads:

`make PREFIX=/easybuild/work/gdrcopy/install CUDA=/software/CUDA/9.1.85 all install
echo "GDRAPI_ARCH=X86"
GDRAPI_ARCH=X86

cc -O2 -fPIC -I /software/CUDA/9.1.85/include -I gdrdrv/ -I /software/CUDA/9.1.85/include -D GDRAPI_ARCH=X86 -c -o gdrapi.o gdrapi.c
cc -O2 -fPIC -I /software/CUDA/9.1.85/include -I gdrdrv/ -I /software/CUDA/9.1.85/include -D GDRAPI_ARCH=X86 -c -mavx -o memcpy_avx.o memcpy_avx.c
cc -O2 -fPIC -I /software/CUDA/9.1.85/include -I gdrdrv/ -I /software/CUDA/9.1.85/include -D GDRAPI_ARCH=X86 -c -msse -o memcpy_sse.o memcpy_sse.c
cc -O2 -fPIC -I /software/CUDA/9.1.85/include -I gdrdrv/ -I /software/CUDA/9.1.85/include -D GDRAPI_ARCH=X86 -c -msse4.1 -o memcpy_sse41.o memcpy_sse41.c
cc -shared -Wl,-soname,libgdrapi.so.1 -o libgdrapi.so.1.2 gdrapi.o memcpy_avx.o memcpy_sse.o memcpy_sse41.o
ldconfig -n /easybuild/work/gdrcopy/gdrcopy
ln -sf libgdrapi.so.1.2 libgdrapi.so.1
ln -sf libgdrapi.so.1 libgdrapi.so
cd gdrdrv;
make
find: ‘/usr/src/nvidia-’: No such file or directory
dirname: missing operand
Try 'dirname --help' for more information.
make[1]: Entering directory /easybuild/work/gdrcopy/gdrcopy/gdrdrv' Picking NVIDIA driver sources from NVIDIA_SRC_DIR=NVIDIA_DRIVER_MISSING. If that does not meet your expectation, you might have a stale driver still around and that might cause problems. make[2]: Entering directory /usr/src/kernels/3.10.0-693.17.1.el7.x86_64'
find: ‘/usr/src/nvidia-
’: No such file or directory
dirname: missing operand
Try 'dirname --help' for more information.
CC [M] /easybuild/work/gdrcopy/gdrcopy/gdrdrv/nv-p2p-dummy.o
/easybuild/work/gdrcopy/gdrcopy/gdrdrv/nv-p2p-dummy.c:48:20: fatal error: nv-p2p.h: No such file or directory
#include "nv-p2p.h"
^
compilation terminated.
make[3]: *** [/easybuild/work/gdrcopy/gdrcopy/gdrdrv/nv-p2p-dummy.o] Error 1
make[2]: *** [module/easybuild/work/gdrcopy/gdrcopy/gdrdrv] Error 2
make[2]: Leaving directory /usr/src/kernels/3.10.0-693.17.1.el7.x86_64' make[1]: *** [module] Error 2 make[1]: Leaving directory /easybuild/work/gdrcopy/gdrcopy/gdrdrv'
make: *** [driver] Error 2
`

I am building against CUDA/9.1.85.

@moravveji
Copy link
Author

I made some progress with the previous errors, and now, I get a new error:
insmod: ERROR: could not insert module gdrdrv/gdrdrv.ko: Unknown symbol in module

@drossetti
Copy link
Member

Hard to tell.
Are you building and installing on the same machine?
There should be a detailed error in the kernel log. You could use 'dmesg' to dump that log and copy the relevant lines here.

@moravveji
Copy link
Author

Alright ... I'm coming back to this ticket, because I need gdrcopy for a CUDA-aware OpenMPI. I am attaching the redirected stderr/stdout from building gdrcopy in here, together with the very simple build script I am using.
In brief, I have two complains now, one about NVIDIA_SRC_DIR, and the other about CONFIG_RETPOLINE during the "make" step. In fact, I am not sure how to set these, so that they propagate properly to the make.

Furthermore, I need to know what is expected to be inside NVIDIA_SRC_DIR?
What do you see on your platform?

gdrcopy.zip

@moravveji
Copy link
Author

I would like to attract your attention to this ticket. In fact, my installation of CUDA-aware MPI is pending on compiling gdrcopy. Could you please take a look at my error logs, and also the questions I raised above?
Thanks a lot.
E.

@drossetti
Copy link
Member

Ehsan,
thank you for trying gdrcopy.
The excerpt from your build log, copied below, is clear enough:

  1. NVIDIA_SRC_DIR is auto set based on your local install dir of the GPU driver
  2. CONFIG_RETPOLINE is apparently not supported by your host compiler. I am not an expert, but I don't believe you are supposed to tweak the compiler command line for a kernel module. Either your Linux kernel automatically detects and enables retpoline or not.
make[1]: Entering directory `/easybuild/work/gdrcopy/gdrcopy/gdrdrv'
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-418.40.04/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
make[2]: Entering directory `/usr/src/kernels/3.10.0-957.10.1.el7.x86_64'
arch/x86/Makefile:166: *** CONFIG_RETPOLINE=y, but not supported by the compiler. Compiler update recommended..  Stop.
make[2]: Leaving directory `/usr/src/kernels/3.10.0-957.10.1.el7.x86_64'
make[1]: *** [module] Error 2
make[1]: Leaving directory `/easybuild/work/gdrcopy/gdrcopy/gdrdrv'
make: *** [driver] Error 2

@moravveji
Copy link
Author

Thanks Davide for your message; it brought some activity back to this ticket.
My problem is, whether or not I set the two env vars NVIDIA_SRC_DIR and/or CONFIG_RETPOLINE, my build always crashes at the same location, and throws the same error message. That made me wonder I am not doing it right.
Do you have any idea why my build crashes? And how to resolve this?

@drossetti
Copy link
Member

drossetti commented May 1, 2019

That kernel module build error is discussed on the net, e.g. on RH/CentOS forums/bugzilla.
For example see https://bugzilla.redhat.com/show_bug.cgi?id=1566297#c12
I think you might have updated the kernel but not the gcc RPM.

@moravveji
Copy link
Author

Thanis Davide for the hint. For some reason, when I use GCC/6.4.0 module on our compute nodes (with rpm -q gcc command givinb gcc-4.8.5-36.el7_6.1.x86_64), the installation keeps failing! However, I purge the GCC module, and stick to the system gcc and it builds flawlessly.
I still cannot comprehend why gdrcopy builds with an older GCC rather than a newer one!

@drossetti
Copy link
Member

drossetti commented May 3, 2019

BTW gdrdrv is a kernel module, which takes advantage of the Linux kernel build system, i.e. it does not have its own build system.
It looks like retpoline support is in gcc 7.3 or 8.x, but not in 6.x.
Most probably RH backported retpoline support onto their gcc 4.8.5 branch.
closing as this is a local customer server issue

@zhuanwancaishi
Copy link

dear , how dou you fix the problem "insmod: ERROR: could not insert module gdrdrv/gdrdrv.ko: Unknown symbol in module" ? i

@pakmarkthub
Copy link
Collaborator

Hi @zhuanwancaishi ,

There are multiple possibilities:

  1. Was nvidia driver (nvidia.ko) loaded before you tried insmod.sh?
  2. When you compiled gdrdrv, there should be a message printed out. Did it pick the correct nvidia driver and the linux kernel version you are running?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants