Problem with training #52

marcocalignano · 2017-11-14T20:08:03Z

I am tring to train the net

marcuzzo@marcuzzo: ~/workspace/leela-zero[next]training/tf/parse.py build/train.0
Found 0 chunks
marcuzzo@marcuzzo: ~/workspace/leela-zero[next]$ training/tf/parse.py build/train.1
Found 0 chunks
marcuzzo@marcuzzo: ~/workspace/leela-zero[next]$ training/tf/parse.py build/train.2
Found 0 chunks

what am I doing wrong?

roy7 · 2017-11-14T20:17:31Z

Try

gzip train.0
parse.py train

Parse I think will check for the number and expect the .gz on its own.

marcocalignano · 2017-11-14T20:31:37Z

Next problem I have

name: GeForce GTX 760 major: 3 minor: 0 memoryClockRate(GHz): 1.0845
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.92GiB

and I get this :

2017-11-14 21:29:51.688804: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 227.25MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.

2017-11-14 21:29:51.692699: W tensorflow/stream_executor/cuda/cuda_dnn.cc:2223]

2017-11-14 21:29:51.693036: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 227.25MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.

ssj-gz · 2017-11-14T20:46:44Z

I had to halve the BATCH_SIZE in training/tf/parse.py (set it to 128) to avoid (I think?) that error on my GTX 950. I may have had to lower per_process_gpu_memory_fraction a little, too.

marcocalignano · 2017-11-14T21:37:07Z

Thanks it works now! even if 224.638 pos/s I guess is not that good

gcp · 2017-11-15T08:20:50Z

I may have had to lower per_process_gpu_memory_fraction a little, too.

You can probably use larger batches if you set it higher, not lower. TensorFlow defaults to using 100% of GPU RAM but this is annoying if you want to run leelaz at the same time, so I changed the default to 75%.

If you lower the batch size, you should lower the learning rate in MomentumOptimizer a bit, by a factor of sqrt() the factor you lowered the batch size with.

roy7 · 2017-11-25T13:45:58Z

@marcocalignano Is this issue ready to close? :)

gcp added the techsupport label Nov 15, 2017

marcocalignano closed this as completed Nov 26, 2017

OmnipotentEntity mentioned this issue Aug 6, 2018

parse.py InvalidArgumentError #1691

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with training #52

Problem with training #52

marcocalignano commented Nov 14, 2017 •

edited

Loading

roy7 commented Nov 14, 2017

marcocalignano commented Nov 14, 2017

ssj-gz commented Nov 14, 2017

marcocalignano commented Nov 14, 2017

gcp commented Nov 15, 2017

roy7 commented Nov 25, 2017

Problem with training #52

Problem with training #52

Comments

marcocalignano commented Nov 14, 2017 • edited Loading

roy7 commented Nov 14, 2017

marcocalignano commented Nov 14, 2017

ssj-gz commented Nov 14, 2017

marcocalignano commented Nov 14, 2017

gcp commented Nov 15, 2017

roy7 commented Nov 25, 2017

marcocalignano commented Nov 14, 2017 •

edited

Loading