Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache numba.cuda functions on repeated deserialization #4590

Open
mrocklin opened this issue Sep 19, 2019 · 2 comments
Open

Cache numba.cuda functions on repeated deserialization #4590

mrocklin opened this issue Sep 19, 2019 · 2 comments
Assignees
Labels
CUDA CUDA related issue/PR feature_request performance - run time Performance issue occurring at run time.

Comments

@mrocklin
Copy link

mrocklin commented Sep 19, 2019

As of #3026 , Numba kindly returns the same function when an equivalent bytestring is deserialized many times. This is great for systems like Dask, which may send around the same numba function many times.

Currently, it looks like this isn't being done for numba.cuda functions, which ends up being a bottleneck in Dask + Numba GPU workloads.

cc @seibert

@seibert
Copy link
Contributor

seibert commented Sep 19, 2019

Notes to myself: Looking more closely, it seems that serialization of CUDA functions is happening in a different way than the CPU Dispatcher objects. The function cache needs to hang off of numba.cuda.compiler.CUDAKernel and related classes. (Grep for __reduce__ to find them all.)

@seibert
Copy link
Contributor

seibert commented Oct 8, 2019

@sklam: There are 5 different kinds of CUDA objects which can be serialized (all in numba.cuda.compiler):

  • DeviceFunctionTemplate
  • DeviceFunction
  • CUDAKernelBase
  • CachedCUFunction
  • CUDAKernel

Given the range here, I'm wondering if I need to figure out some kind of metaclass that does the common behavior of:

  • Generates a new UUID when object is initialized
  • Creates a class-level _memo (weak dict of UUID->object) and _recent (fixed size deque of objects to ensure recently deserialized objects stay in _memo)
  • Provides some generic implementation of reduce and rebuild that prefer the cached version of the function

Given the above, I think we'll need to use this metaclass (or mixin, not sure how meta we need to get) with 7 different classes. 2 on CPU and 5 on GPU

@stuartarchibald stuartarchibald added feature_request performance - run time Performance issue occurring at run time. and removed needtriage labels Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA CUDA related issue/PR feature_request performance - run time Performance issue occurring at run time.
Projects
None yet
Development

No branches or pull requests

3 participants