You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tried on my laptop, and on a K520, and the results were:
unroll + matmult on cpu is a bit faster than direct cpu convolution. I suppose this is because memory access patterns a better
unroll + clblas was faster again
the most naive convolutional opencl kernel, ie not using any type of unroll or gemm, was the fastest
For batchsize=128, inputplanes=32, inputsize=128, numfilters=32, filtersize=5, on a K520 got:
convolution + cpu: 318s
unrolled + cpu: 218s
unrolled +clblas: invalid command queue
no unrolling, propagate1: 2s
Matrices are apparently a bit too big for unroll + clblas, so tried using a smaller batchsize:
batchsize=16, inputplanes=32, inputsize=128, numfilters=32, filtersize=5:
convolution + cpu: 39s
unroll + cpu: 26s
unroll + clblas GEMM: 2.2s
propagate1: 0.27s
Note that propagate1 is DeepCL's most generic, least optimized kernel. It doesnt use local memory (which is why it's generic, and works on anything really, unless it runs out of gpu global memory). Kernels using local memory are around 3-10 times faster than propagate1.
Overall: current conclusion: unroll + clblas GEMM doesnt seem promising?
=> closing issue.
The text was updated successfully, but these errors were encountered:
Following this article, http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/ http://www.reddit.com/r/MachineLearning/comments/338lfs/why_gemm_is_at_the_heart_of_deep_learning/ , decided should try this, in case gives an easy way to speed up DeepCL for large image sizes.
My verdict? Not useful :-(
Tried on my laptop, and on a K520, and the results were:
For batchsize=128, inputplanes=32, inputsize=128, numfilters=32, filtersize=5, on a K520 got:
Matrices are apparently a bit too big for unroll + clblas, so tried using a smaller batchsize:
batchsize=16, inputplanes=32, inputsize=128, numfilters=32, filtersize=5:
Note that propagate1 is DeepCL's most generic, least optimized kernel. It doesnt use local memory (which is why it's generic, and works on anything really, unless it runs out of gpu global memory). Kernels using local memory are around 3-10 times faster than propagate1.
Overall: current conclusion: unroll + clblas GEMM doesnt seem promising?
=> closing issue.
The text was updated successfully, but these errors were encountered: