Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource exhausted error #10

Open
takuseno opened this issue Feb 20, 2018 · 12 comments
Open

resource exhausted error #10

takuseno opened this issue Feb 20, 2018 · 12 comments

Comments

@takuseno
Copy link
Member

takuseno commented Feb 20, 2018

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,24487,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: deepq/DND/lookup/Tile = Tile[T=DT_FLOAT, Tmultiples=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](deepq/DND/lookup/Tile/input, deepq/DND/lookup/Tile/multiples)]]

@takuseno
Copy link
Member Author

2018-02-21 01:03:53.047199: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.49GiB. Current allocation summary follows.
2018-02-21 01:03:53.047267: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (256): Total Chunks: 64, Chunks in use: 63. 16.0KiB allocated for chunks. 15.8KiB in use in bin. 3.5KiB client-requested in use in bin.
2018-02-21 01:03:53.047305: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (512): Total Chunks: 1, Chunks in use: 1. 512B allocated for chunks. 512B in use in bin. 384B client-requested in use in bin.
2018-02-21 01:03:53.047327: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2018-02-21 01:03:53.047347: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2048): Total Chunks: 5, Chunks in use: 5. 10.0KiB allocated for chunks. 10.0KiB in use in bin. 10.0KiB client-requested in use in bin.
2018-02-21 01:03:53.047366: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-02-21 01:03:53.047385: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-02-21 01:03:53.047403: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16384): Total Chunks: 1, Chunks in use: 0. 22.2KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-02-21 01:03:53.047425: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (32768): Total Chunks: 5, Chunks in use: 5. 160.0KiB allocated for chunks. 160.0KiB in use in bin. 160.0KiB client-requested in use in bin.
2018-02-21 01:03:53.047447: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (65536): Total Chunks: 3, Chunks in use: 2. 288.0KiB allocated for chunks. 192.0KiB in use in bin. 191.7KiB client-requested in use in bin.
2018-02-21 01:03:53.047471: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (131072): Total Chunks: 11, Chunks in use: 11. 1.47MiB allocated for chunks. 1.47MiB in use in bin. 1.42MiB client-requested in use in bin.
2018-02-21 01:03:53.047492: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (262144): Total Chunks: 7, Chunks in use: 7. 2.67MiB allocated for chunks. 2.67MiB in use in bin. 2.67MiB client-requested in use in bin.
2018-02-21 01:03:53.047511: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (524288): Total Chunks: 2, Chunks in use: 2. 1.89MiB allocated for chunks. 1.89MiB in use in bin. 1.89MiB client-requested in use in bin.
2018-02-21 01:03:53.047535: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (1048576): Total Chunks: 1, Chunks in use: 1. 1.72MiB allocated for chunks. 1.72MiB in use in bin. 1.72MiB client-requested in use in bin.
2018-02-21 01:03:53.047556: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (2097152): Total Chunks: 2, Chunks in use: 1. 5.89MiB allocated for chunks. 3.45MiB in use in bin. 3.45MiB client-requested in use in bin.
2018-02-21 01:03:53.047576: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (4194304): Total Chunks: 1, Chunks in use: 1. 5.17MiB allocated for chunks. 5.17MiB in use in bin. 3.12MiB client-requested in use in bin.
2018-02-21 01:03:53.047601: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (8388608): Total Chunks: 5, Chunks in use: 5. 75.62MiB allocated for chunks. 75.62MiB in use in bin. 75.62MiB client-requested in use in bin.
2018-02-21 01:03:53.047620: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-02-21 01:03:53.047638: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-02-21 01:03:53.047657: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-02-21 01:03:53.047679: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (134217728): Total Chunks: 4, Chunks in use: 4. 781.25MiB allocated for chunks. 781.25MiB in use in bin. 781.25MiB client-requested in use in bin.
2018-02-21 01:03:53.047700: I tensorflow/core/common_runtime/bfc_allocator.cc:628] Bin (268435456): Total Chunks: 6, Chunks in use: 5. 8.98GiB allocated for chunks. 7.48GiB in use in bin. 7.48GiB client-requested in use in bin.
2018-02-21 01:03:53.047721: I tensorflow/core/common_runtime/bfc_allocator.cc:644] Bin for 1.49GiB was 256.00MiB, Chunk State:
2018-02-21 01:03:53.047746: I tensorflow/core/common_runtime/bfc_allocator.cc:650] Size: 1.49GiB | Requested Size: 6.2KiB | in_use: 0, prev: Size: 1.51GiB | Requested Size: 1.51GiB | in_use: 1

@takuseno
Copy link
Member Author

try running with config.gpu_options.allow_growth = True

@smatsumori
Copy link
Member

nec/dnd.py

Line 44 in 4f2e3d9

tiled_keys = tf.tile([keys], [tf.shape(h)[0], 1, 1])

hで入ってくるのって (keysize, batchsize)だよね.batchってlast axisに沿って拡張されるよね.
するとh.shape[0]でtileするのっておかしい気がするんだけど

@smatsumori
Copy link
Member

メモリサイズ考えると
keysize * float32 * batchsize * dndsize * capacity
512 * 32 * 4 * 4 * 5 * 10 ** 5 = 16GB
だから原理的に指数の部分が効いてくる.上手く並列化するのが手っ取り早いのか?

@takuseno
Copy link
Member Author

takuseno commented Apr 3, 2018

while_loopはloopと言いつつ自動的に並列化してくれるはず

@smatsumori
Copy link
Member

smatsumori commented Apr 3, 2018

なるほどね.while_loopは何に対してiterationを掛けるべきか.
keyのindex? or batch?

あと普通にbroadcastingがexplicitにtileしてやるよりメモリを節約できるらしいが,
なぜ我々は後者の書き方をしているのか.
tensorflow/tensorflow#1934
こちらも手っ取り早く試してみる.

got Dst tensor is not initialized.
seems this raises when GPU memory is full.
aymericdamien/TensorFlow-Examples#38

@takuseno
Copy link
Member Author

takuseno commented Apr 4, 2018

これbroadcastingできるの?

@smatsumori
Copy link
Member

perhaps.今夜試してみる.
tf.deviceで別々のGPUにDND割り当てたら解決しないかな.

@takuseno
Copy link
Member Author

takuseno commented Apr 5, 2018

I hope transferring data between GPUs doesn't take long time.

@smatsumori
Copy link
Member

smatsumori commented Apr 10, 2018

Converting sparse IndexSlices to a dense Tensor unknown shape. This may consume a large amount of memory.

References

https://stackoverflow.com/questions/35892412/tensorflow-dense-gradient-explanation

Now running with half tile half broadcasted.

  • check if it works

@jlindsey15
Copy link

I'm unable to run the model on any environment but CartPole due to ResourceExhausted errors. Any tips?

@smatsumori
Copy link
Member

@jlindsey15 Thank you for the comment! If you have multiple gpus, splitting DNDs into each device will solve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants