GPU memory use #1748

Ttl · 2018-08-19T18:13:36Z

Next branch seems to require quite lot of GPU memory because of bigger Winograd transformations and precision autodetection.

Winograd F(4x4, 3x3) expands filters from 3x3 to 6x6 for total of four times increase in memory. At startup with precision autodetection both single and half precision are allocated at the same time. As a result we need about six times more memory than is required to store the weights. Using 40b network I measured about 1.5 GB peak GPU memory use at the startup and 960 MB in normal play with single precision.

How much memory use is excessive? I think there are few older cards that don't have this much memory. Colab is also limited to 512 MB.

I have a branch that does the filter transformation at runtime allowing the filters to be stored untransformed in GPU memory and dropping memory usage to about one quarter. It does have about 25% of performance hit with batch size of 1 so I wouldn't want to turn it on by default. Finding how much free GPU memory there is doesn't seem to be very easy with OpenCL either.

gcp · 2018-08-19T19:22:21Z

At startup with precision autodetection both single and half precision are allocated at the same time.

That should be fixable and a fast win wrt peak usage?

960 MB in normal play with single precision.
How much memory use is excessive? I think there are few older cards that don't have this much memory.

Requiring a 1GB card to run the 40b network would be acceptable, IMHO. If I understand correctly, fp16 storage would mean a 512M card might just be enough to run it too?

That said, I noticed my 4GB RX560 card gets a huge penalty with the 40b networks, even though in theory it should fit. Going to fp16 storage solves it. So performance might be a bit unexpected around the, eh, edges.

ihavnoid · 2018-08-20T01:35:19Z

That should be fixable and a fast win wrt peak usage?

The downside is that if we create one at a time, and end up picking the 'wrong' one, we have to recreate the net. I will test how long it takes to create an opencl net.

ihavnoid · 2018-08-20T02:00:05Z

A quick test shows that we add roughly a second initializing the 20b nets and pushing the weights to the newly created net on my i7-8700 + GTX1080. Probably much slower on older machines, but maybe it's better than simply crashing. The time consumed seems to be mostly on pushing the weights from CPU to GPU, but maybe there is some chance of optimization there, too.

gcp · 2018-10-23T08:48:20Z

Going to close this. We did a round of optimizations. Let's file new issues for further work.

ihavnoid mentioned this issue Aug 20, 2018

Network initialization restructuring #1750

Merged

gcp closed this as completed Oct 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory use #1748

GPU memory use #1748

Ttl commented Aug 19, 2018

gcp commented Aug 19, 2018

ihavnoid commented Aug 20, 2018 •

edited

ihavnoid commented Aug 20, 2018

gcp commented Oct 23, 2018

GPU memory use #1748

GPU memory use #1748

Comments

Ttl commented Aug 19, 2018

gcp commented Aug 19, 2018

ihavnoid commented Aug 20, 2018 • edited

ihavnoid commented Aug 20, 2018

gcp commented Oct 23, 2018

ihavnoid commented Aug 20, 2018 •

edited