-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Defer data preparation to native code. Use sparse input tensors. #1
Conversation
…lescing as it's faster without.
Segfault on running:
Likely backtrace:
|
Okay, the stream is destroyed while the async lambda is still running. |
Also, on linux / gcc, the compile seems to be unoptimized by default, and complains about the cdecl:
|
Maybe add: $ git diff
diff --git a/CMakeLists.txt b/CMakeLists.txt
index b973f06..7e0dbdb 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,6 +2,13 @@ cmake_minimum_required(VERSION 3.0)
project(data_loader)
+if(NOT CMAKE_BUILD_TYPE)
+ set(CMAKE_BUILD_TYPE Release)
+endif()
+
+set(CMAKE_CXX_FLAGS_DEBUG "-g")
+set(CMAKE_CXX_FLAGS_RELEASE "-O3")
+
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED 17)
this yields line numbers for the segfault, but the trace is different:
|
…m. Make the future batch a part of the stream.
I think the crash is fixed now. It could have given many difference backtraces. As for the cmake changes I'll change Release to RelWithDebInfo, but I'll refrain from adding gcc/clang specific flags and this is handled with properly setup CMAKE_BUILD_TYPE |
|
At first sight no segfaults with training. Speed is low Adding |
Workers won't work. It has to be 0. This is because 1. sparse tensors. 2. we would need to handle it explicitely in our code. 36 it/s is 600k positions/s, nice. |
I've added the flags explicitely. It's weird that they are not added automatically. They are ignored for MSVC so no guarding needed. |
Do Epochs still finish? It is at iteration 67000 and still at Epoch 0. |
Epochs won't finish. Normally with pytorch lightning an epoch is ended when the dataloader finishes. But since there's only one data loader I made it go cyclicly (infinitely) through the data. This is until the proper training data setup is done. IMO epochs shouldn't be tied to file size anyway because it's hard to manipulate their size then |
If you clean up, maybe using something like: diff --git a/external_nnue_data.py b/external_nnue_data.py
index 1102500..20c50df 100644
--- a/external_nnue_data.py
+++ b/external_nnue_data.py
@@ -2,7 +2,10 @@ import numpy as np
import os
import ctypes
-dll = ctypes.CDLL('c:/dev/nnue-pytorch/data_loader.dll')
+try:
+ dll = ctypes.CDLL('c:/dev/nnue-pytorch/data_loader.dll')
+except:
+ dll = ctypes.cdll.LoadLibrary('./libdata_loader.so') would be enough to make run on both linux and windows out of the box? |
@vondele I do glob for either of .dll or .so. I think it should be enough and it doesn't rely on exceptions from ctypes. Overall I cleaned this up. I removed the old data loading as it's not used anymore. halfkp.py now only contains constants. Tensor storage is uncoupled from the feature set; the feature set is identified by a string name and can be chosen dynamically when creating a data stream. Changed a lot of names. Test (demo) is cleaned up. @glinscott should be ready for a final review and merge |
@Sopel97 doesn't compile for me:
|
Okay, seems like gcc is unable to resolve some dependent names. I fixed it. |
yes, compiles now, but doesn't run:
The lib is named : |
Okay, fixed now. |
OK, runs... |
Can you try experiment a bit with the batch size and see what is the best for you? For me 8192 seems optimal, maybe twice that is slightly faster, more I run out of memory. |
|
This will of course depend a bit on the GPU. I wonder what the bottleneck will be on a real GPU server (say 4xV100 or similar). |
So looks like 4*8192 works best on your system. Thanks for testing this. |
That's also about 87% GPU-Util so probably limited by GPU (RTX 2070 super), in this case. |
@@ -1,132 +0,0 @@ | |||
import chess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave this one around for environments we can't compile the c++ version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively we could ship binaries...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Know what, I'll bring back the old data loader tomorrow, make it sparse, just so it's there, because you have a point. I think it's a matter for another PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's fine! Let's get this rolling, it's a gigantic improvement :).
This is some incredible work, thank you so much! Just a few small comments. |
Co-authored-by: Gary Linscott <glinscott@gmail.com>
I'll leave this here for now, because it needs a lot of cleanup before merging. It would be nice if it was tested on different platforms. For getting it to run on linux one needs to change the name of the lib loaded by ctypes from *.dll to *.so.
This introduces a module written in C++ that provides loading batched training data from a file (currently cycles through the file because StopIteration is broken). It's compiled using cmake into a shared library and used in python through ctypes. There's a thin layer on top of it in python that mimicks the structures used and converts the raw arrays to tensors.
With this I'm reaching 80-85k positions/s on my GPU, compared to previous (master) 1.6k/s.