Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable performance #31

Closed
dfdx opened this issue Sep 28, 2015 · 4 comments
Closed

Unstable performance #31

dfdx opened this issue Sep 28, 2015 · 4 comments

Comments

@dfdx
Copy link

dfdx commented Sep 28, 2015

Disclaimer: I used matrix multiplication from CUBLAS.jl as an example operations since CUDArt.jl doesn't provide anything like that, so results may be biased because of it. Anyway, I'll be glad to see any pointers.

With random CudaArray and identity matrix like this:

const A = rand(1024, 256)
const Im = eye(256, 256)
const d_A = CudaArray(A)
const d_Im = CudaArray(Im)

I do several performance tests like this:

CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A)

If you are not familiar with BLAS (or just don't like cryptic names), this code multiplies d_A by identity matrix d_Im and puts the result to d_A again. When I run same test on CPU, I always get very similar, consistent results. But on GPU benchmarks give totally different results:

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   0.018428 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   4.238584 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   2.931953 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   3.775450 seconds (25.00 k allocations: 937.500 KB)

# after some time
 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   0.020394 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   4.812287 seconds (25.00 k allocations: 937.500 KB)

So first call is really fast, but all subsequent calls take ~200x longer. After you wait some time (say, 10 seconds), multiplication becomes fast again, but also for one single test and then drops again.

Is it expected behavior? Do I use CudaArrays correctly at all?

@timholy
Copy link
Contributor

timholy commented Sep 29, 2015

This is probably the exact same asynchronous/synchronous issue that explains #30---in other words, it's an artifact of how you're testing it.

@dfdx
Copy link
Author

dfdx commented Sep 29, 2015

Indeed, adding device_synchronize() fixed differences in timings. However, I see that it also made computations on GPU 3 times slower:

# CPU 
 julia> @time for i=1:1000 gemm!('N', 'N', 1., A, Im, 0., A) end
   1.972331 seconds

# GPU
!julia> device_synchronize(); @time (for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end; device_synchronize())
   5.733445 seconds (25.00 k allocations: 937.500 KB)

Is it something expected too?

@timholy
Copy link
Contributor

timholy commented Sep 30, 2015

Yes. (Not necessarily the precise numerical factor, but the general slowdown.) You're inhibiting the GPU from getting started on the next job as resources get freed from the last one.

@dfdx
Copy link
Author

dfdx commented Sep 30, 2015

That makes sense. Thank you!

@dfdx dfdx closed this as completed Sep 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants