add batched fields for better GPU usage #11

marius311 · 2020-05-25T04:00:02Z

This adds "batched" Flat fields which behave just like normal fields but actually store several "batches" of fields at once. The motivation is to speed up GPU code, which is currently only using about 20% of a Tesla V100, due to having to wait on results of FFT kernels. Since we can't speed those up, the solution is to just do more of them in parallel. However, we only get the full speedup if we actually batch the field data together so kernels can efficiently go through it all at once, hence batched fields.

You can create batched fields in several ways:

FlatMap(rand(128,128,4)) # a batch of 4 128×128 fields
batch([f,f,f,f]) # put four non-batched fields `f` together in a single batch-4 field
batch(f,4) # same as above but data not actually copied four times

All code pretty much works transparently on batched fields as it does on normal fields, but just does everything in batches. This includes everything up to posterior gradients, sampling, etc... The only difference is that operations that return a scalar now return a BatchedReal (which itself can be added, multiplied, etc... as if it were a Real):

julia> dot(FlatMap(rand(128,128,4)), FlatMap(rand(128,128,4)))
BatchedFloat64[4114.738009871811, 4105.554880583013, 4109.014277485975, 4109.830420691169]

On a Tesla V100, some timings for a 256×256 Float32 mixed posterior gradient:

┌────────┬───────────┬─────────┐
│ Time   │ Batchsize │ Speedup │
├────────┼───────────┼─────────┤
│ 120 ms │         1 │    1.0× │
│ 192 ms │         2 │    1.3× │
│ 196 ms │         4 │    2.4× │
│ 217 ms │         8 │    4.4× │
│ 345 ms │        16 │    5.6× │
└────────┴───────────┴─────────┘

Looks like the sweet-spot for per-gradient speed-up while not hurting overall sequential runtime is somewhere near batchsize of 10 and we get about a 5x speedup.

marius311 added 17 commits May 18, 2020 02:07

WIP / how is this faster than Base?

c0ee51d

batched LenseFlow working

fba6d36

GPU batched LenseFlow working

a7e533a

wip

36e1385

S2 GPU batched LenseFlow working

abcc20a

batch diagonal ops working

0322554

reorganize

b8d3c13

batched dot, logdet, tr

e036d09

batched sampling on cpu working

ef8f3c8

batched sampling on gpu working

a026f8c

tweaks

3199df5

add chain unbatch utilities

47fceb2

fix tests

a0f6931

batch sampling parameters works too

03a549a

fix tests

2e4fd56

fix chain unbatch

219a23c

add batched S2 tests

491322e

marius311 force-pushed the batch branch from 1fd12c7 to 491322e Compare May 28, 2020 08:35

marius311 merged commit 4da71d9 into master May 28, 2020

marius311 deleted the batch branch May 28, 2020 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add batched fields for better GPU usage #11

add batched fields for better GPU usage #11

marius311 commented May 25, 2020

add batched fields for better GPU usage #11

add batched fields for better GPU usage #11

Conversation

marius311 commented May 25, 2020