Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with training #52

Closed
marcocalignano opened this issue Nov 14, 2017 · 6 comments
Closed

Problem with training #52

marcocalignano opened this issue Nov 14, 2017 · 6 comments

Comments

@marcocalignano
Copy link
Member

marcocalignano commented Nov 14, 2017

I am tring to train the net

marcuzzo@marcuzzo: ~/workspace/leela-zero[next]training/tf/parse.py build/train.0
Found 0 chunks
marcuzzo@marcuzzo: ~/workspace/leela-zero[next]$ training/tf/parse.py build/train.1
Found 0 chunks
marcuzzo@marcuzzo: ~/workspace/leela-zero[next]$ training/tf/parse.py build/train.2
Found 0 chunks

what am I doing wrong?

@roy7
Copy link
Collaborator

roy7 commented Nov 14, 2017

Try

gzip train.0
parse.py train

Parse I think will check for the number and expect the .gz on its own.

@marcocalignano
Copy link
Member Author

Next problem I have

name: GeForce GTX 760 major: 3 minor: 0 memoryClockRate(GHz): 1.0845
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.92GiB

and I get this :

2017-11-14 21:29:51.688804: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 227.25MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.

2017-11-14 21:29:51.692699: W tensorflow/stream_executor/cuda/cuda_dnn.cc:2223]

2017-11-14 21:29:51.693036: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 227.25MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.

@ssj-gz
Copy link

ssj-gz commented Nov 14, 2017

I had to halve the BATCH_SIZE in training/tf/parse.py (set it to 128) to avoid (I think?) that error on my GTX 950. I may have had to lower per_process_gpu_memory_fraction a little, too.

@marcocalignano
Copy link
Member Author

Thanks it works now! even if 224.638 pos/s I guess is not that good

@gcp
Copy link
Member

gcp commented Nov 15, 2017

I may have had to lower per_process_gpu_memory_fraction a little, too.

You can probably use larger batches if you set it higher, not lower. TensorFlow defaults to using 100% of GPU RAM but this is annoying if you want to run leelaz at the same time, so I changed the default to 75%.

If you lower the batch size, you should lower the learning rate in MomentumOptimizer a bit, by a factor of sqrt() the factor you lowered the batch size with.

@roy7
Copy link
Collaborator

roy7 commented Nov 25, 2017

@marcocalignano Is this issue ready to close? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants