Unstable performance #31

dfdx · 2015-09-28T23:48:20Z

Disclaimer: I used matrix multiplication from CUBLAS.jl as an example operations since CUDArt.jl doesn't provide anything like that, so results may be biased because of it. Anyway, I'll be glad to see any pointers.

With random CudaArray and identity matrix like this:

const A = rand(1024, 256)
const Im = eye(256, 256)
const d_A = CudaArray(A)
const d_Im = CudaArray(Im)

I do several performance tests like this:

CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A)

If you are not familiar with BLAS (or just don't like cryptic names), this code multiplies d_A by identity matrix d_Im and puts the result to d_A again. When I run same test on CPU, I always get very similar, consistent results. But on GPU benchmarks give totally different results:

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   0.018428 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   4.238584 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   2.931953 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   3.775450 seconds (25.00 k allocations: 937.500 KB)

# after some time
 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   0.020394 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   4.812287 seconds (25.00 k allocations: 937.500 KB)

So first call is really fast, but all subsequent calls take ~200x longer. After you wait some time (say, 10 seconds), multiplication becomes fast again, but also for one single test and then drops again.

Is it expected behavior? Do I use CudaArrays correctly at all?

The text was updated successfully, but these errors were encountered:

timholy · 2015-09-29T00:04:55Z

This is probably the exact same asynchronous/synchronous issue that explains #30---in other words, it's an artifact of how you're testing it.

dfdx · 2015-09-29T22:55:20Z

Indeed, adding device_synchronize() fixed differences in timings. However, I see that it also made computations on GPU 3 times slower:

# CPU 
 julia> @time for i=1:1000 gemm!('N', 'N', 1., A, Im, 0., A) end
   1.972331 seconds

# GPU
!julia> device_synchronize(); @time (for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end; device_synchronize())
   5.733445 seconds (25.00 k allocations: 937.500 KB)

Is it something expected too?

timholy · 2015-09-30T01:50:52Z

Yes. (Not necessarily the precise numerical factor, but the general slowdown.) You're inhibiting the GPU from getting started on the next job as resources get freed from the last one.

dfdx · 2015-09-30T07:26:51Z

That makes sense. Thank you!

dfdx closed this as completed Sep 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable performance #31

Unstable performance #31

dfdx commented Sep 28, 2015

timholy commented Sep 29, 2015

dfdx commented Sep 29, 2015

timholy commented Sep 30, 2015

dfdx commented Sep 30, 2015

Unstable performance #31

Unstable performance #31

Comments

dfdx commented Sep 28, 2015

timholy commented Sep 29, 2015

dfdx commented Sep 29, 2015

timholy commented Sep 30, 2015

dfdx commented Sep 30, 2015