Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory use #1748

Closed
Ttl opened this issue Aug 19, 2018 · 4 comments
Closed

GPU memory use #1748

Ttl opened this issue Aug 19, 2018 · 4 comments

Comments

@Ttl
Copy link
Member

Ttl commented Aug 19, 2018

Next branch seems to require quite lot of GPU memory because of bigger Winograd transformations and precision autodetection.

Winograd F(4x4, 3x3) expands filters from 3x3 to 6x6 for total of four times increase in memory. At startup with precision autodetection both single and half precision are allocated at the same time. As a result we need about six times more memory than is required to store the weights. Using 40b network I measured about 1.5 GB peak GPU memory use at the startup and 960 MB in normal play with single precision.

How much memory use is excessive? I think there are few older cards that don't have this much memory. Colab is also limited to 512 MB.

I have a branch that does the filter transformation at runtime allowing the filters to be stored untransformed in GPU memory and dropping memory usage to about one quarter. It does have about 25% of performance hit with batch size of 1 so I wouldn't want to turn it on by default. Finding how much free GPU memory there is doesn't seem to be very easy with OpenCL either.

@gcp
Copy link
Member

gcp commented Aug 19, 2018

At startup with precision autodetection both single and half precision are allocated at the same time.

That should be fixable and a fast win wrt peak usage?

960 MB in normal play with single precision.
How much memory use is excessive? I think there are few older cards that don't have this much memory.

Requiring a 1GB card to run the 40b network would be acceptable, IMHO. If I understand correctly, fp16 storage would mean a 512M card might just be enough to run it too?

That said, I noticed my 4GB RX560 card gets a huge penalty with the 40b networks, even though in theory it should fit. Going to fp16 storage solves it. So performance might be a bit unexpected around the, eh, edges.

@ihavnoid
Copy link
Member

ihavnoid commented Aug 20, 2018

That should be fixable and a fast win wrt peak usage?

The downside is that if we create one at a time, and end up picking the 'wrong' one, we have to recreate the net. I will test how long it takes to create an opencl net.

@ihavnoid
Copy link
Member

A quick test shows that we add roughly a second initializing the 20b nets and pushing the weights to the newly created net on my i7-8700 + GTX1080. Probably much slower on older machines, but maybe it's better than simply crashing. The time consumed seems to be mostly on pushing the weights from CPU to GPU, but maybe there is some chance of optimization there, too.

@gcp
Copy link
Member

gcp commented Oct 23, 2018

Going to close this. We did a round of optimizations. Let's file new issues for further work.

@gcp gcp closed this as completed Oct 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants