Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(yet another) nvidia crash on ubuntu #2590

Open
carnicer opened this issue Apr 20, 2020 · 8 comments
Open

(yet another) nvidia crash on ubuntu #2590

carnicer opened this issue Apr 20, 2020 · 8 comments

Comments

@carnicer
Copy link

I have a (now almost idle) pretty cool workstation availabe that runs ubuntu.

The machine has 24 cores and a NVIDIA 340 if I am not mistaken (I know nothing about GPUs/CUDA/OpenCL etc).

I have just downloaded and built leela, but the leela executable crashes. I have googled a bit and it seemed to be a nvidia drivers version issue. I checked issues #1360 and #1363 in this repo and as I am writing this github tells me there are 2 more. But none of them is with my NVIDIA model (340).

I managed to update the nvidia drivers (which was not that easy since the nvidia package was configured to be on hold, and I also had to reboot and update other nvidia-related packages).

The latest available version in my ubuntu ppa repositories for nvidia 340 is 107. According to the nvidia releases page that was released on June 6, 2018, that is after the referenced issues claim a bug was fixed.

But this does not fix the problem.

I see that there is a newer nvidia driver release (340.108) from December 2019. But it is not available yet on my APT repo (which I updated following the instructions in this project README).

Do you think it will fix the issue? Has anyone been using nvidia 340 and which driver are you using?

Follows my leelaz output. Is there any way to capture logs or debug?

carnicer@pepinu:~/github/leela-zero/build$ ./leelaz
Using OpenCL batch size of 5
Using 10 thread(s).
RNG seed: 5836405442933166735
Leela Zero 0.17  Copyright (C) 2017-2019  Gian-Carlo Pascutto and contributors
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see the COPYING file for details.

BLAS Core: built-in Eigen 3.3.7 library.
Detecting residual layers...v1...256 channels...40 blocks.
Initializing OpenCL (autodetecting precision).
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.1 CUDA 6.5.51
Platform profile: FULL_PROFILE
Platform name:    NVIDIA CUDA
Platform vendor:  NVIDIA Corporation
Device ID:     0
Device name:   Quadro K2200
Device type:   GPU
Device vendor: NVIDIA Corporation
Device driver: 340.107
Device speed:  1124 MHz
Device cores:  5 CU
Device score:  1111
Selected platform: NVIDIA CUDA
Selected device: Quadro K2200
with OpenCL 1.1 capability.
Half precision compute support: No.
Tensor Core support: No.
Detected 1 OpenCL platforms.
Platform version: OpenCL 1.1 CUDA 6.5.51
Platform profile: FULL_PROFILE
Platform name:    NVIDIA CUDA
Platform vendor:  NVIDIA Corporation
Device ID:     0
Device name:   Quadro K2200
Device type:   GPU
Device vendor: NVIDIA Corporation
Device driver: 340.107
Device speed:  1124 MHz
Device cores:  5 CU
Device score:  1111
Selected platform: NVIDIA CUDA
Selected device: Quadro K2200
with OpenCL 1.1 capability.
Half precision compute support: No.
Tensor Core support: No.
Loaded existing SGEMM tuning.
Wavefront/Warp size: 32
Max workgroup size: 1024
Max workgroup dimensions: 1024 1024 64
terminate called after throwing an instance of 'cl::Error'
  what():  clCreateBuffer
Aborted (core dumped)
carnicer@pepinu:~/github/leela-zero/build$

@carnicer
Copy link
Author

I have tried to install the latest version (340.108) although it is not sure if that would fix the problem.

I have downloaded it from the nvidia 64-bit download page. It works, but it never overrides the apt version (340.107), when I type clinfo. And running ./leelaz continues to crash at startup.

I have tried many workarounds, such as uninstalling the apt-get packages (this almost breaks my installation).

I have tried running the nvidia installer with several options, including:

sudo ./NVIDIA-Linux-x86_64-340.108.run --add-this-kernel --dkms -s
sudo ./NVIDIA-Linux-x86_64-340.108.run --dkms -s

I have seen the this version has a precompiled ubuntu package, but it belongs to a ubuntu release newer than mine and I am not allowed even if I edit the apt sources. My edits are correct, the new version is found.

Everything without success. Needless to say, I have rebooted after every attempt. I am always getting this:

carnicer@pepinu:~/nvidia$ clinfo | grep Driver
  Driver Version                                  340.107
carnicer@pepinu:~/nvidia$ 

Any ideas?

@CGLemon
Copy link

CGLemon commented Apr 22, 2020

I guess it is because of your OpenCL version. Updating you drive to support OpenCL 1.2 maybe will be success.

@carnicer
Copy link
Author

Thanks CGLemon.

I had already tried, and that's what I get, some kind of package version 2.2.8 which may be something related to ubuntu.

At the end I realize that in clinfo there is a message that I could upgrade my OpenCL to version 1.2 and even 2.1. How do I do that?

In the project README, it says OpenCL 1.1 should be enough, and there is neither reference to OpenCL1.2 nor to how to install it.

carnicer@pepinu:~$ sudo -E apt-get install opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev ocl-icd-libopencl1
Reading package lists... Done
Building dependency tree
Reading state information... Done
ocl-icd-libopencl1 is already the newest version (2.2.8-1).
ocl-icd-opencl-dev is already the newest version (2.2.8-1).
opencl-headers is already the newest version (2.0~svn32091-2).
0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.
carnicer@pepinu:~$
carnicer@pepinu:~$ dpkg --get-selections | grep -i opencl
nvidia-opencl-icd-340                           install
ocl-icd-libopencl1:amd64                        install
ocl-icd-opencl-dev:amd64                        install
opencl-headers                                  install
unity-scope-openclipart                         install
carnicer@pepinu:~$ dpkg --get-selections | grep -i ocl
geoclue                                         install
geoclue-ubuntu-geoip                            install
libgeoclue0:amd64                               install
ocl-icd-libopencl1:amd64                        install
ocl-icd-opencl-dev:amd64                        install
carnicer@pepinu:~$
carnicer@pepinu:~$ clinfo | grep OpenCL
  Platform Version                                OpenCL 1.1 CUDA 6.5.51
  Device Version                                  OpenCL 1.1 CUDA
  Device OpenCL C Version                         OpenCL C 1.1
    Run OpenCL kernels                            Yes
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Profile                              OpenCL 1.2
        NOTE:   your OpenCL library declares to support OpenCL 1.2,
                but it seems to support up to OpenCL 2.1 too.
carnicer@pepinu:~$

@carnicer
Copy link
Author

I found this official intel github page with instructions, and I have done as follows (Ubuntu section):

carnicer@pepinu:~$ sudo -E add-apt-repository ppa:intel-opencl/intel-opencl
carnicer@pepinu:~$ sudo -E apt update
carnicer@pepinu:~$ sudo -E apt install intel-opencl-icd

This has installed some new packages.

But still clinfo is showing OpenCL 1.1 and leelaz crashes.

@CGLemon
Copy link

CGLemon commented Apr 23, 2020

Although OpenCL 1.1 is minimun version for leelaz. But the 'core dumped' may occurr wrong version. And OpenCL 1.2 is fine.
Updating your nvidia drive to Update OpenCL. The latest drive include the OpenCL 1.2.
The command 'apt-get install' just get header file and binary file. Can't Update OpenCL version

@carnicer
Copy link
Author

Thanks CGLemon.

It looks it is not possible to use leela with this version of ubuntu. I should upgrade the distro, but I am not allowed to that, I am already afraid that I have broken too many things (openCL package was locked).

Too bad.

If I have the time I will try to debug the code. Will that be useful? Does anybody know why it is crashing with a supposedly supported openCL/nvidia version combination?

@GosseRomkes
Copy link

Did you try to run Leela with a (very) small network as delivered by the stable 25.1 version?

@carnicer
Copy link
Author

carnicer commented May 27, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants