-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
help diagnosing a GPU performance issue #4647
Comments
@leofang for viz |
@leofang also got me started with the CUDA code! :) |
I timed this with |
The main difference I can see is that the kernel in raw cuda is using way more texture operations. |
I've attached profiles made with |
Can you please make the kernel as simple as you can while still demonstrating a performance difference, e.g. by doing things like removing the innermost while loop, and then attach PTX for each of the two cases? |
Yes! I'll try a few variants to see if I can make simpler. IDK how to get the PTX out of cupy. Any tips? |
There's a function called |
Alright. Here is the best I could do. Removing much else made the differences go away. The timings (gpu-dev) [mrbecker@cp1-p scripts]$ python perf_testing.py 500000
cupy+CUDA events: 1935.0027262369792 ms
cupy+CUDA wall : 2632.060448328654 ms
numba events: 5319.1396484375 ms
numba wall : 7121.022303899129 ms The code import sys
import time
import cupy as cp
import numpy as np
from numba import cuda
if len(sys.argv) > 2:
kind = sys.argv[2]
else:
kind = 'both'
# parameters
blocks = 512
threads = 512
if len(sys.argv) > 1:
npoints = int(sys.argv[1])
else:
npoints = 100_000
n1 = npoints
n2 = npoints
# make the data
DEFAULT_RBINS_SQUARED = (np.logspace(
np.log10(0.1/1e3), np.log10(40/1e3), 20)**2).astype(np.float32)
rng = np.random.RandomState(seed=42)
x1, y1, z1, w1 = rng.uniform(size=(4, n1)).astype(np.float32)
x2, y2, z2, w2 = rng.uniform(size=(4, n1)).astype(np.float32)
# array init
result = np.zeros_like(DEFAULT_RBINS_SQUARED)[:-1].astype(np.float32)
if kind in ['both', 'cupy']:
source_code = """\
extern "C"{
__global__ void brute_force_pairs_kernel(
float* x1, float* y1, float* z1, float* w1,
float* x2, float* y2, float* z2, float* w2,
float* rbins_squared, float* result,
int n1, int n2, int nbins) {
size_t start = threadIdx.x + blockIdx.x * blockDim.x;
size_t stride = blockDim.x * gridDim.x;
float g = 0;
for (size_t i = start; i < n1; i += stride) {
float px = x1[i];
float py = y1[i];
float pz = z1[i];
float pw = w1[i];
for (size_t j = 0; j < n2; j++) {
float qx = x2[j];
float qy = y2[j];
float qz = z2[j];
float qw = w2[j];
float dx = px - qx;
float dy = py - qy;
float dz = pz - qz;
float wprod = pw * qw;
float dsq = dx * dx + dy * dy + dz * dz;
g += (dsq * wprod);
}
}
result[0] = g;
}
}
"""
# get the PTX
from cupy.cuda.compiler import compile_using_nvrtc
with open('cupy_ptx.txt', 'w') as fp:
fp.write(compile_using_nvrtc(source_code))
# compile and load CUDA kernel using CuPy
brute_force_pairs_kernel = cp.RawKernel(
source_code, 'brute_force_pairs_kernel')
d_x1 = cp.asarray(x1, dtype=cp.float32)
d_y1 = cp.asarray(y1, dtype=cp.float32)
d_z1 = cp.asarray(z1, dtype=cp.float32)
d_w1 = cp.asarray(w1, dtype=cp.float32)
d_x2 = cp.asarray(x2, dtype=cp.float32)
d_y2 = cp.asarray(y2, dtype=cp.float32)
d_z2 = cp.asarray(z2, dtype=cp.float32)
d_w2 = cp.asarray(w2, dtype=cp.float32)
d_rbins_squared = cp.asarray(DEFAULT_RBINS_SQUARED, dtype=cp.float32)
d_result_cp = cp.asarray(result, dtype=cp.float32)
# for GPU timing using CuPy
start = cp.cuda.Event()
end = cp.cuda.Event()
timing_cp = 0
timing_cp_wall = 0
# running the kernel using CuPy's functionality
for i in range(4):
if i > 0: # warm-up not needed if using RawModule
start.record()
_s = time.time()
brute_force_pairs_kernel(
(blocks,), (threads,),
(d_x1, d_y1, d_z1, d_w1,
d_x2, d_y2, d_z2, d_w2,
d_rbins_squared, d_result_cp,
cp.int32(d_x1.shape[0]),
cp.int32(d_x2.shape[0]),
cp.int32(d_rbins_squared.shape[0]))
)
if i > 0: # warm-up not needed if using RawModule
end.record()
end.synchronize()
_e = time.time()
timing_cp += cp.cuda.get_elapsed_time(start, end)
timing_cp_wall += (_e - _s)
print('cupy+CUDA events:', timing_cp/3, 'ms')
print('cupy+CUDA wall :', timing_cp_wall/3*1000, 'ms')
d_result_cp = d_result_cp.copy()
if kind in ['both', 'numba']:
# for GPU timing using Numba
@cuda.jit
def count_weighted_pairs_3d_cuda(
x1, y1, z1, w1, x2, y2, z2, w2, rbins_squared, result):
start = cuda.grid(1)
stride = cuda.gridsize(1)
n1 = x1.shape[0]
n2 = x2.shape[0]
g = 0
for i in range(start, n1, stride):
px = x1[i]
py = y1[i]
pz = z1[i]
pw = w1[i]
for j in range(n2):
qx = x2[j]
qy = y2[j]
qz = z2[j]
qw = w2[j]
dx = px-qx
dy = py-qy
dz = pz-qz
wprod = pw*qw
dsq = dx*dx + dy*dy + dz*dz
g += (dsq * wprod)
result[0] = g
start = cuda.event()
end = cuda.event()
timing_nb = 0
timing_nb_wall = 0
d_x1 = cuda.to_device(x1.astype(np.float32))
d_y1 = cuda.to_device(y1.astype(np.float32))
d_z1 = cuda.to_device(z1.astype(np.float32))
d_w1 = cuda.to_device(w1.astype(np.float32))
d_x2 = cuda.to_device(x2.astype(np.float32))
d_y2 = cuda.to_device(y2.astype(np.float32))
d_z2 = cuda.to_device(z2.astype(np.float32))
d_w2 = cuda.to_device(w2.astype(np.float32))
d_rbins_squared = cuda.to_device(
DEFAULT_RBINS_SQUARED.astype(np.float32))
d_result_nb = cuda.device_array_like(result.astype(np.float32))
# running the Numba jit kernel
for i in range(4):
if i > 0:
start.record()
_s = time.time()
count_weighted_pairs_3d_cuda[blocks, threads](
d_x1, d_y1, d_z1, d_w1,
d_x2, d_y2, d_z2, d_w2,
d_rbins_squared, d_result_nb)
if i > 0:
end.record()
end.synchronize()
_e = time.time()
timing_nb += cuda.event_elapsed_time(start, end)
timing_nb_wall += (_e - _s)
print('numba events:', timing_nb/3, 'ms')
print('numba wall :', timing_nb_wall/3*1000, 'ms')
# print(count_weighted_pairs_3d_cuda.inspect_types())
with open('numba_ptx.txt', 'w') as fp:
fp.write(
list(count_weighted_pairs_3d_cuda.definitions.values())[0].ptx)
if kind in ['both'] and npoints <= 10:
# check that the CUDA kernel agrees with the Numba kernel
assert cp.allclose(d_result_cp, d_result_nb, rtol=5E-4) The PTX is attached: Archive.zip |
In the Numba generated PTX there is the notably suspicious: mul.lo.s64 %rd91, %rd146, %rd38;
add.s64 %rd92, %rd91, %rd36;
mul.lo.s64 %rd93, %rd146, %rd40;
add.s64 %rd94, %rd93, %rd39;
mul.lo.s64 %rd95, %rd146, %rd42;
add.s64 %rd96, %rd95, %rd41;
mul.lo.s64 %rd97, %rd146, %rd44;
add.s64 %rd98, %rd97, %rd43;
ld.f32 %f41, [%rd92];
sub.f32 %f42, %f1, %f41;
ld.f32 %f43, [%rd94];
sub.f32 %f44, %f2, %f43;
ld.f32 %f45, [%rd96];
sub.f32 %f46, %f3, %f45;
ld.f32 %f47, [%rd98]; which is basically doing multiply adds to compute the value of the induction variable @cuda.jit(max_registers=256)
def count_weighted_pairs_3d_cuda(
x1, y1, z1, w1, x2, y2, z2, w2, rbins_squared, result):
start = types.uint32(cuda.grid(1))
stride = types.uint32(cuda.gridsize(1))
n1 = types.uint32(x1.shape[types.uint32(0)])
n2 = types.uint32(x2.shape[types.uint32(0)])
g = types.float32(0)
i = types.uint32(start)
while i < n1:
px = x1[i]
py = y1[i]
pz = z1[i]
pw = w1[i]
i = types.uint32(i + stride)
j = types.uint32(0)
while j < n2:
qx = x2[j]
qy = y2[j]
qz = z2[j]
qw = w2[j]
dx = px-qx
dy = py-qy
dz = pz-qz
wprod = pw*qw
dsq = dx*dx + dy*dy + dz*dz
g += (dsq * wprod)
j = types.uint32(j + types.uint32(1))
result[0] = g gives me:
which is getting a lot closer. The offending PTX region now looks like: add.s64 %rd108, %rd35, %rd145;
add.s64 %rd109, %rd34, %rd146;
add.s64 %rd110, %rd33, %rd147;
ld.f32 %f53, [%rd141];
sub.f32 %f54, %f2, %f53;
ld.f32 %f55, [%rd108];
sub.f32 %f56, %f3, %f55;
ld.f32 %f57, [%rd109];
sub.f32 %f58, %f4, %f57;
ld.f32 %f59, [%rd110]; |
Thanks @stuartarchibald! The CUDA code is using |
The use of |
Hi @stuartarchibald! What is the path to getting this issue addressed? Is this something within numba, or is it in LLVM? |
@beckermr It's something that needs to happen in Numba for the CUDA target. The most guaranteed way to get this to would would be to write a Numba compiler transform pass to switch out ranges for while loops. I also checked, this is just impacting the |
That sounds like I might be able to handle it. Can you point me to the section of code and give me some hints? I can try my hand at a PR and learn something new! |
Sounds great, though this may get tricky so please shout if you get stuck :) The 0.46 release has an updated API for creating new compilers, compiler pipelines and compiler passes. The 2nd half of the 0.46 release developers demo notebook, you can see it on binder here, has working examples. Further, there's a lot of documentation on customising the compiler here. Given that this transform is just operating on Numba IR, it'd probably be easier and quicker to just develop it on a standard Again, if you get stuck/need a hand, the Numba core developers are also available on gitter.im real time chat. Thanks! |
Thanks, @stuartarchibald, for such a detailed instruction. Just curious,
Doesn't this imply that @beckermr could simply steal the CPU side's pass code and make it work on GPU? Could you point us to where the CPU counterpart is? |
In this case, I think Stuart means that the LLVM target for the CPU is able to optimize away any inefficiencies in the generate code, whereas NVVM is not. The optimization passes in the two compiler libraries don't do the same thing (which is not surprising), so we'll need to work around that by generating different code for the GPU. |
Right ok. I’m definitely having trouble understanding the llvm ir but I’ll keep at it. More help or hints would be awesome. |
I've started to experiment with the test code above, and at first I didn't get the same mul/add pattern in the output PTX. I was able to get that output by checking out commits from early October when this issue was made, and I narrowed it down to this commit that fixes the "bug": 599ecce Here are the first 10 lines of the loop body before the fix:
And after the fix:
Looking at the diff from that PR, it seems like using arrays with stride 0 as function arguments would result in the old behavior, and I confirmed this by creating inputs like this:
instead of this:
And then I see mul/adds in the numba PTX but not the cupy PTX, and a significant performance degradation:
So this issue is still present although it now seems a lot more niche. I'm not sure how useful it will be to fix it, but for our class project it's still a good starting point. |
Yes, definitely happy to see progress on this! |
@stuartarchibald Do you think a compiler pass would be feasible / desirable to generate code that is specialized for input array stride? I'm finding in experiments that there is significant slowdown for non-contiguous arrays, since their stride doesn't get hard-coded. We are already specializing for the two cases of stride == datatype_width (contiguous) and stride != datatype_width (non-contiguous), so it doesn't seem outlandish to further specialize the non-contiguous cases by stride size. I imagine we'd need to
Once stride is a constant in numba IR, NVVM should handle hard-coding stride into the PTX. PS - I looked into how cupy handles this to see if we could use their solution, and I noticed that cupy's output PTX hardcodes offsets equal to the size of the datatype (i.e. not aware of stride at all). Then if you pass a stride-0 array it reads the wrong (potentially out of bounds) elements. I made an issue about it, maybe the result will be interesting for numba as well. |
@noahstier I think that this is worth exploring. If the stride is encoded in the Array type then it would become a compile time constant and permissible for use in e.g. looping. The downside may be that more compilation happens as a result of the variation in |
Cool! We'll continue to investigate. Re more compilation, we will want to do some performance analysis which should include compile times. So it's on our radar. One thing we should look into is what tests the compiler can run to determine if the optimization is worthwhile. I.e. it could specialize only for arrays of a given size, only up to a predefined number of specializations, etc. |
Great, thanks. My concern isn't so much the time for an individual compilation, more that there's a huge amount of potential variants for stride, so compiling for each one could get costly opposed to the |
I need to re-run the test code above with #6112 to see if it makes any difference here - making the indices unsigned removes simplifies the computation of some indices in PTX. |
Reporting a bug
the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
to write one see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).
I am coding a simple algorithm in numba and raw CUDA C using
cupy
. I am seeing a 3x performance difference, which is a lot.The code below outputs on my local P100:
Here is the code:
The text was updated successfully, but these errors were encountered: