-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory use #1748
Comments
That should be fixable and a fast win wrt peak usage?
Requiring a 1GB card to run the 40b network would be acceptable, IMHO. If I understand correctly, fp16 storage would mean a 512M card might just be enough to run it too? That said, I noticed my 4GB RX560 card gets a huge penalty with the 40b networks, even though in theory it should fit. Going to fp16 storage solves it. So performance might be a bit unexpected around the, eh, edges. |
The downside is that if we create one at a time, and end up picking the 'wrong' one, we have to recreate the net. I will test how long it takes to create an opencl net. |
A quick test shows that we add roughly a second initializing the 20b nets and pushing the weights to the newly created net on my i7-8700 + GTX1080. Probably much slower on older machines, but maybe it's better than simply crashing. The time consumed seems to be mostly on pushing the weights from CPU to GPU, but maybe there is some chance of optimization there, too. |
Going to close this. We did a round of optimizations. Let's file new issues for further work. |
Next branch seems to require quite lot of GPU memory because of bigger Winograd transformations and precision autodetection.
Winograd F(4x4, 3x3) expands filters from 3x3 to 6x6 for total of four times increase in memory. At startup with precision autodetection both single and half precision are allocated at the same time. As a result we need about six times more memory than is required to store the weights. Using 40b network I measured about 1.5 GB peak GPU memory use at the startup and 960 MB in normal play with single precision.
How much memory use is excessive? I think there are few older cards that don't have this much memory. Colab is also limited to 512 MB.
I have a branch that does the filter transformation at runtime allowing the filters to be stored untransformed in GPU memory and dropping memory usage to about one quarter. It does have about 25% of performance hit with batch size of 1 so I wouldn't want to turn it on by default. Finding how much free GPU memory there is doesn't seem to be very easy with OpenCL either.
The text was updated successfully, but these errors were encountered: