Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL KataGo is crashing during self-tuning (Intel Integrated Graphics sometimes buggy) #78

Open
featurecat opened this issue Oct 8, 2019 · 7 comments

Comments

@featurecat
Copy link
Contributor

In the Lizzie repository a user is having trouble starting KataGo. The crash is during self-tuning on the OpenCL official release of KataGo. Here is the issue which includes screenshots of the commandline while it is running. featurecat/lizzie#633

@lightvector
Copy link
Owner

I'll leave this open just for visibility, but just to post an update here for anyone seeing this thread - Intel Integrated Graphics has caused some issues in the past, not just for KataGo but for some other projects too. At least one older version's OpenCL implementation is buggy/incomplete.

In other cases, quite possibly well it might be something that I'm missing as well. For example, there are various queryable limits that OpenCL exposes in its API, perhaps KataGo is not respecting one of those limits, and one might expect those limits to be lower for Integrated Graphics than for graphics cards. But without the ability to reproduce it locally myself or to have a user who is themselves technically very experienced and capable of doing some code diving and serious debugging, I don't see a good way to make progress here.

So for now - if you're trying to run KataGo using OpenCL on Intel Integrated Graphics - there is some chance it won't work, although for some users I think it actually has worked too. If you are encountering such an error in exactly this case, and you are experienced at debugging and willing to try compiling KataGo yourself and to edit the code or test things out, let me know.

@Drachengo
Copy link

I recently ran into a similar problem on an Integrated graphics on my install of KataGo. I do believe that I have found a work around, but it seems that it will heavily impact performance, but it will at least work. Every time I would attempt to run it, it would crash in an attempt to tune. This seems to be from some bug or incompatibility with the graphics driver, whether it is with some of your code and the card/driver or it is on intel's side, I don't know. But running a genconfig and making sure to only select the cpu device itself, probably device 1 as it was in my case, if you are cpu/integrated, the tuning phase will not crash and will record its results appropriately. I can continue to look into this, but I am not good at lower level code, c++, and know very little about drivers, it is all far outside of what I normally do. But I can give it a try if you'd like.

tl;dr Integrated graphics seem to cause a large issue, as you know. Running a genconfig and setting the devices to the cpu only (not the integrated graphics), while it will probably have an affect on performance, prevents the crashing during tuning. I can continue to look into this issue if you'd like, but it isn't what I'm good at so it may take a while and I could quite honestly come up empty handed.

Thank you for your great work!

@lightvector lightvector changed the title OpenCL KataGo is crashing during self-tuning, on a windows10 machine OpenCL KataGo is crashing during self-tuning (Intel Integrated Graphics sometimes buggy) Apr 17, 2020
@lightvector
Copy link
Owner

Changed title just to be clearer for people browsing issues.

@SyxP
Copy link

SyxP commented May 3, 2020

I am using an AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) GPU and it seems to be stuck in tuning, it does not output anymore after the last line displayed below. In fact I cannot even kill via Ctrl+C it when it is "testing configurations".

I am using the OpenCL build but it is build from source with parameters cmake . -DBUILD_MCTS=1 -DUSE_BACKEND=OPENCL and I am on Arch Linux. Do let me know if you need more information.

2020-05-03 14:30:28+0800: Loading model and initializing benchmark...

Running quick initial benchmark at 16 threads!
2020-05-03 14:30:28+0800: nnRandSeed0 = 8513347526626713128
2020-05-03 14:30:28+0800: After dedups: nnModelFile0 = /usr/share/katago/networks/weights-b30.bin.gz useFP16 auto useNHWC auto
2020-05-03 14:30:29+0800: Found OpenCL Platform 0: Clover (Mesa) (OpenCL 1.1 Mesa 20.0.6)
2020-05-03 14:30:29+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2020-05-03 14:30:29+0800: Found OpenCL Device 0: AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) (score 11000101)
2020-05-03 14:30:29+0800: Using OpenCL Device 0: AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) OpenCL 1.1 Mesa 20.0.6
2020-05-03 14:30:29+0800: No existing tuning parameters found or parseable or valid at: /home/syx/.katago/opencltuning/tune6_gpuAMDRAVENDRM3360568arch11LLVM1000_x19_y19_c320_mv8.txt
2020-05-03 14:30:29+0800: Performing autotuning
2020-05-03 14:30:29+0800: Found OpenCL Platform 0: Clover (Mesa) (OpenCL 1.1 Mesa 20.0.6)
2020-05-03 14:30:29+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2020-05-03 14:30:29+0800: Found OpenCL Device 0: AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) (score 11000101)
2020-05-03 14:30:29+0800: Using OpenCL Device 0: AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) OpenCL 1.1 Mesa 20.0.6
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs

@lightvector
Copy link
Owner

lightvector commented May 3, 2020

@SyxP - I think this is not related to Intel Integrated Graphics so it might have been better in a separate issue. Or I guess this is also fine. Anyways the error you encountered is also a different known issue and in your case is possibly fixable.

I just now pushed a section in the main readme about things like this. Take a look at the entry on OpenCL Mesa there.

Hope that helps! :)

@SyxP
Copy link

SyxP commented May 3, 2020

Thanks this was indeed the issue, and it now works!

@ivorget
Copy link

ivorget commented Dec 16, 2020

I had (and still have) this intel gpu tuning crash with old v1.2 but the changes made in the OpenCl code since then seems to have fixed it - both tuning and games are working fine on v1.7.0.
Running on atom x5-z8350 with Intel HD 400 with latest driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants