Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some make problems in fedora20 - incl a way to get it to work, somehow #1

Closed
standfest opened this issue Mar 17, 2014 · 12 comments
Closed

Comments

@standfest
Copy link
Contributor

hi,
in order to make it was necessary to edit io.cpp and add "#include " because its dependency in iostream has been removed with gcc 4.3.
further more i comment the line with setDevice because of the error "undefined reference to `setDevice'", which allowed me to make it. BUT now i have problems with CUDA. if i try the gpu kernel, i get following error:
$somoclu -x 100 -y 200 file folder -e 20 -k 1
-->
nVectors: 417 nVectorsPerRank: 417 nDimensions: 0
Epoch: 0 Radius: 50
** On entry to SGEMM parameter number 8 had an illegal value
!!!! kernel execution error.
Aborted
terminate called after throwing an instance of 'thrust::system::system_error'
what(): unload of CUDA runtime failed
Aborted (core dumped)

would you have any suggestions?
thanks a lot!

@standfest
Copy link
Contributor Author

ps: i found that In CUDA 5.0 and CUDA 5.5, the CUBLAS routine SGEMM() for operations NN and NT can give wrong results on Kepler Architecture SM35 when the following conditions are met :
4 * ldc * n >= 2^32 and m >= 256
where m, n, and ldc are respectively the number of rows, the number of columns, and the leading dimension of the resulting matrix C. [http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/, http://nvlabs.github.io/moderngpu/performance.html](btw i run an old fx4800) i do not know, maybe this input is helping.

@peterwittek
Copy link
Owner

The cstdlib dependency was added, thank you for spotting the problem.

Both CUDA 5.0 and 5.5 work fine on Fermi architecture. Unfortunately I
do not have access to Kepler hardware, so I am unable to do any kind of
testing. The configure.in file has the following lines:

if test nvcc --version|grep release|awk '{print $5}'|cut -d. -f1 -ge 5
; then
GENCODE_SM30="-gencode arch=compute_30,code=sm_30 -gencode
arch=compute_35,code=sm_35"
fi

If you remove these lines, only Compute Capability 2.0 code will be
generated by NVCC, which might just work on Kepler. It is certainly not
an optimal solution, but let me know if it works.

Thanks again.

On 2014-03-18 01:41, standfest wrote:

ps: i found that In CUDA 5.0 and CUDA 5.5, the CUBLAS routine SGEMM()
for operations NN and NT can give wrong results on Kepler Architecture
SM35 when the following conditions are met :
4 * ldc * n >= 2^32 and m >= 256
where m, n, and ldc are respectively the number of rows, the number
of columns, and the leading dimension of the resulting matrix C.
http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/,
http://nvlabs.github.io/moderngpu/performance.html [1] i do not know,
maybe this input is helping.

Reply to this email directly or view it on GitHub [2].

Links:

[1] http://sg161.singhost.net/btw%20i%20run%20an%20old%20fx4800
[2]
#1 (comment)

@standfest
Copy link
Contributor Author

thanks for your response. digging further into the problem (and changing my hardware back to a tesla c2070) i still struggle with this linking problem while compiling:

make -C src all
make[1]: Entering directory `/home/standfem/Downloads/peterwittek-somoclu-f9336f2/src'
/usr/local/cuda//bin/nvcc -DHAVE_CONFIG_H -I/usr/local/cuda//include -use_fast_math -gencode arch=compute_10,code=sm_10 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -Xcompiler "-O3 -fPIC -fopenmp" -I/usr/lib64/openmpi//include -I. -I.. -o denseGpuKernels.cu.co -c ./denseGpuKernels.cu
/usr/lib64/openmpi/bin//mpic++ -DHAVE_CONFIG_H -O3 -fPIC -fopenmp -L/usr/local/cuda//lib64 -L/usr/lib64/openmpi//lib -o somoclu sparseCpuKernels.o io.o denseCpuKernels.o mapDistanceFunctions.o training.o denseGpuKernels.cu.co somoclu.o  -lcudart -lcublas -lmpi
somoclu.o: In function `main':
somoclu.cpp:(.text.startup+0x56c): undefined reference to `setDevice'
collect2: error: ld returned 1 exit status
make[1]: *** [somoclu] Error 1
make[1]: Leaving directory `/home/standfem/Downloads/peterwittek-somoclu-f9336f2/src'
make: *** [all] Error 2

do you have any ideas what to do? maybe my configure output helps:

./configure --with-mpi-compilers=/usr/lib64/openmpi/bin/ --with-mpi=/usr/lib64/openmpi/ --with-cuda=/usr/local/cuda/
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking for gcc... gcc
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for gcc option to support OpenMP... -fopenmp
checking for nvcc... yes
checking MPI C++ compiler in /usr/lib64/openmpi/bin/... /usr/lib64/openmpi/bin//mpic++
checking MPI directory... /usr/lib64/openmpi/
checking how to run the C++ preprocessor... g++ -E
checking whether special compile flag for MPICH is required... no
configure: creating ./config.status
config.status: creating Makefile
config.status: creating src/Makefile
config.status: creating config.h
config.status: config.h is unchanged
-------------------------------------------------

 Somoclu Version 1.2

 Prefix: /usr/local.
 Compiler: /usr/lib64/openmpi/bin//mpic++ -O3 -fPIC -fopenmp -I/usr/lib64/openmpi//include -L/usr/lib64/openmpi//lib -lmpi

 Package features:
   OpenMP enabled: yes
   MPI enabled: yes
   CUDA enabled: yes

 Now type 'make [<target>]'
   where the optional <target> is:
     all                - build all binaries
     install            - install everything

--------------------------------------------------

@peterwittek
Copy link
Owner

There was a logical flaw with the preprocessor statements in the
setDevice function when CUDA was enabled but MPI was not. It was
corrected. This function is necessary even a single-GPU configuration,
as it contains a cudaSetDevice call, without which the GPU context may
or may not be initialized. Could you try the update?

@standfest
Copy link
Contributor Author

Thanks, now it is compiling without complaining - but with a persistent linking flaw:

somoclu -x 400 -y 300 file folder -e 20 -k 1
somoclu: error while loading shared libraries: libcudart.so.5.5: cannot open shared object file: No such file or directory

if i set

export LD_LIBRARY_PATH=/usr/local/cuda/lib64
export PATH=$PATH:/usr/local/cuda/bin

i cannot find libmpi.so.1 and vice versa. Maybe including the path in the ‘-rpath’ linker option could help - sadly i am a c++ noob and so far all my approaches in modifying the makefile fail. Any hints?

@peterwittek
Copy link
Owner

As for the MPI dependency, do not worry about it, unless you have more
than one GPU or more than one node. Just disable MPI with the configure
script.

Not finding the CUDA libraries is more troubling. You said earlier that
you tried both CUDA 5.0 and 5.5. Just to double check a trivial error,
is 5.5 the version sitting in /usr/local/cuda?

If the error persist, please post again the parameters for the configure
script.

Thanks and apologies for the delay.

@standfest
Copy link
Contributor Author

originally i had a symbolic link called CUDA pointing to CUDA-5.5, but now i deleted it and renamed CUDA-5.5 to CUDA. additionally i set the --without-mpi flag and was able to compile and run without the linking flaw - as long as i set

export LD_LIBRARY_PATH=/usr/local/cuda/lib64
export PATH=$PATH:/usr/local/cuda/bin

unfortunately i get this when testing it with -k 1 (0 and 2 are working, but there is a long waiting period after the final training iteration - i cannot imagine saving the data is taking so long, or is it?)

somoclu -x 400 -y 300 file folder -e 20 -k 1
nVectors: 417 nVectorsPerRank: 417 nDimensions: 0 
Epoch: 0 Radius: 200
 ** On entry to SGEMM  parameter number 8 had an illegal value
!!!! kernel execution error.
Aborted
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  unload of CUDA runtime failed
Aborted (core dumped)

so back to square one. at least here my log:

$ ./configure --without-mpi
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking for gcc... gcc
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for gcc option to support OpenMP... -fopenmp
checking for nvcc... yes
./configure: line 3263: nvcc: command not found
./configure: line 3263: test: -ge: unary operator expected
configure: creating ./config.status
config.status: creating Makefile
config.status: creating src/Makefile
config.status: creating config.h
config.status: config.h is unchanged
-------------------------------------------------

 Somoclu Version 1.2

 Prefix: /usr/local.
 Compiler: g++ -O3 -fPIC -fopenmp   

 Package features:
   OpenMP enabled: yes
   MPI enabled: no
   CUDA enabled: yes

 Now type 'make [<target>]'
   where the optional <target> is:
     all                - build all binaries
     install            - install everything

--------------------------------------------------
$ make
make -C src all
make[1]: Entering directory `/home/standfem/Downloads/peterwittek-somoclu-f9336f2/src'
g++ -DHAVE_CONFIG_H -O3 -fPIC -fopenmp  -I. -I.. -o sparseCpuKernels.o -c ./sparseCpuKernels.cpp
g++ -DHAVE_CONFIG_H -O3 -fPIC -fopenmp  -I. -I.. -o io.o -c ./io.cpp
g++ -DHAVE_CONFIG_H -O3 -fPIC -fopenmp  -I. -I.. -o denseCpuKernels.o -c ./denseCpuKernels.cpp
g++ -DHAVE_CONFIG_H -O3 -fPIC -fopenmp  -I. -I.. -o mapDistanceFunctions.o -c ./mapDistanceFunctions.cpp
g++ -DHAVE_CONFIG_H -O3 -fPIC -fopenmp  -I. -I.. -o training.o -c ./training.cpp
/usr/local/cuda/bin/nvcc -DHAVE_CONFIG_H -I/usr/local/cuda/include -use_fast_math -gencode arch=compute_10,code=sm_10 -gencode arch=compute_20,code=sm_20  -Xcompiler "-O3 -fPIC -fopenmp"  -I. -I.. -o denseGpuKernels.cu.co -c ./denseGpuKernels.cu
g++ -DHAVE_CONFIG_H -O3 -fPIC -fopenmp  -I. -I.. -o somoclu.o -c ./somoclu.cpp
g++ -DHAVE_CONFIG_H -O3 -fPIC -fopenmp -L/usr/local/cuda/lib64  -o somoclu sparseCpuKernels.o io.o denseCpuKernels.o mapDistanceFunctions.o training.o denseGpuKernels.cu.co somoclu.o  -lcudart -lcublas 
make[1]: Leaving directory `/home/standfem/Downloads/peterwittek-somoclu-f9336f2/src'
$ sudo make install
make -C src install
make[1]: Entering directory `/home/standfem/Downloads/peterwittek-somoclu-f9336f2/src'
/usr/bin/install -c -d /usr/local/bin
/usr/bin/install -c -m 0755 somoclu \
 /usr/local/bin
make[1]: Leaving directory `/home/standfem/Downloads/peterwittek-somoclu-f9336f2/src'

thank you for thinking about it!

@peterwittek
Copy link
Owner

I want to find out what goes wrong here. I will install a Fedora on my
laptop over the weekend to reproduce your error. Playing with a live
Fedora 20 distribution today, I noticed that it is an incredible pain to
get the proprietary driver and CUDA working. Since deviceQuery works for
you, I assume that CUDA is otherwise operational.

@peterwittek
Copy link
Owner

I cannot reproduce the problem. I started with a plain vanilla Fedora 20 install. Then I followed these instructions to get the proprietary driver working:

http://www.if-not-true-then-false.com/2014/fedora-20-nvidia-guide/

yum update kernel* selinux-policy*
reboot
yum localinstall --nogpgcheck http://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm http://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
yum install akmod-nvidia xorg-x11-drv-nvidia-libs kernel-devel acpid

Genuine weird stuff was going on with the initramfs, but eventually os-detect on Arch Linux figured out the correct boot configuration, blacklisted nouveau, and I had the Nvidia driver working:

lsmod|grep nvidia
nvidia              10686781  44 
drm                   283937  4 nvidia
i2c_core               38476  4 drm,i2c_i801,nvidia,videodev

Then I followed the instructions here:

http://fedoraproject.org/wiki/Cuda

I installed the prerequisites, also adding git, automake, and perl-Env:

yum install wget make gcc-c++ freeglut-devel libXi-devel libXmu-devel mesa-libGLU-devel git perl-Env automake

Then I switched over to these instructions for CUDA 5.5:

http://hobiger.org/blog/2013/12/19/fedora-20-and-cuda/

issuing the command

sh cuda_5.5.22_linux_64.run -override

I accepted the EULA, said yes to attempting the install on an unsupported configuration, did not install the drivers, said yes to installing, the path was /opt/cuda, and the CUDA samples were also installed to the default location ($HOME/NVIDIA_CUDA-5.5_Samples].

After compiling deviceQuery, it complained that the driver did not support this CUDA version. I downloaded the latest driver and installed it:

systemctl stop gdm
sh NVIDIA-Linux-x86_64-331.49.run
reboot

After this, deviceQuery reported my GPU, an old 330M with Compute Capability 1.2.

I cloned and compiled the git version of Somoclu:

git clone https://github.com/peterwittek/somoclu
cd somoclu
./autogen.sh
./configure --without-mpi
make -s
export  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cuda/lib64
src/somoclu -k 1 data/rgbs.txt data/gpu_test

A memory deallocation glitch crept in yesterday, I fixed it. Otherwise, it runs without problems. So I do not know what could be the issue on your machine.

@standfest
Copy link
Contributor Author

thanks for trying. i will look into my machine the day after tomorrow, maybe i'm going to reset it. i will update you on any findings.

@standfest
Copy link
Contributor Author

after all i finally found the time to redo the whole installation again, and now it worked quite well. the only thing not found instantly was libcudart.so.6 (apparently others have this problem too http://stackoverflow.com/questions/10808958/why-cant-libcudart-so-4-be-found-when-compiling-the-cuda-samples-under-ubuntu ) but following line helped:

sudo ldconfig /usr/local/cuda/lib64

thank you again for all your help and of course for your library,
cheers
matthias

@peterwittek
Copy link
Owner

I am glad it finally works.

Peter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants