Add GPU tests for DLPack support #59

leofang · 2021-07-01T06:41:48Z

DO NOT MERGE. To be continued tomorrow...

With Open MPI 4.1.1, I am seeing MPI_ERR_TRUNCATE in testAllgatherv3, so I skip them for now:

$ mpirun --mca opal_cuda_support 1 -n 1 python test/runtests.py --no-numba --cupy -e "test_cco_nb_vec" -e "test_cco_vec"

Tomorrow I will continue investigating the only errors in test_msgspec.py.

dalcinl · 2021-07-01T07:00:00Z

src/mpi4py/MPI/asdlpack.pxi

@@ -151,7 +152,8 @@ cdef inline Py_ssize_t dlpack_get_itemsize(const DLTensor *dltensor) nogil:
 #------------------------------------------------------------------------------

 cdef int Py_CheckDLPackBuffer(object obj):
-    try: return <bint>hasattr(obj, '__dlpack__')
+    # we check __dlpack_device__ to avoid potential side effects


Side effects? Now way! The __dlpack__ attribute is a method! What side effect could possible be looking up for a method WITHOUT calling it? I think someone elsewhere suggested this approach. I don't like it and don't see the point. Are CPU-only libraries forced to expose this __dlpack_device__ method?

Side effects? Now way! The __dlpack__ attribute is a method!

You are right. This is what happens when you have a long day babysitting and try to work in midnight...😞 I forgot they are callables, not Python properties.

Are CPU-only libraries forced to expose this __dlpack_device__ method?

Yes. __dlpack__ and __dlpack_device__ must come in pair together. Libraries that implement from_dlpack() are expected to check __dlpack_device__ first and act accordingly.

OK, but then, my original code is still valid, as I'm looking up methods first, and calling them afterwards. And I would very much prefer to implement things such that __dlpack_device__ is not mandatory.

I would very much prefer to implement things such that __dlpack_device__ is not mandatory.

Unfortunately there's no agreed-upon default for the behavior of missing __dlpack_device__ 😅 For example to CuPy defaulting to kDLCPU does not make sense.

I think you should look at my code more carefully. It is true that if __dlpack_device__ is not available, I set the device_type variable to kDLCPU. However, the only effect of that is calling dlpack() to get the capsule, and not dlpack(stream=-1). The exporter could still expose a GPU buffer (perhaps on the default stream), and my could would still work properly. I'm going to keep insisting: the canonical way to check for DLPack suport should be to look up for the __dlpack__ attribute, as I did originally.

Could you check commit 4acb6bd that I just force-pushed? I reverted most of changes to asdlpack.pxi and now check both __dlpack_device__ and __dlpack__, which I must insist.

However, the only effect of that is calling dlpack() to get the capsule, and not dlpack(stream=-1). The exporter could still expose a GPU buffer (perhaps on the default stream), and my could would still work properly.

I disagree that it "would still work properly". For GPU buffers, the possibility of passing stream=None incurs extra overhead to either set up a stream ordering or synchronize. This has to be avoided to the maximal extent allowed, as we already ask users to do this themselves.

__dlpack_device__ is the first function we query to obtain device info (data-apis/array-api#142 (comment); see also my pseudo code in #56 (comment) which is followed by everyone, NumPy, CuPy, PyTorch, etc), which potentially would include more info in the future (say the API/ABI version) than a 2-tuple. If this is bothering/ambiguous to you I can send a PR to make it clear in the API standard, but if checking both methods can eliminate this corner case I don't see why we don't wanna do it.

I think you are failing to see my point, probably because your brain is locked up in all the mess that involved putting the DLPack protocol together. Please consider the following pseudo code with the logic to get the capsule. This pseudocode is equivalent to my original implementation, without all the exception handing,

dlpack = obj.__dlpack__ # required attribute if hasattr(obj, '__dlpack_device__'): # optional attribute device_type, _ = obj.__dlpack_device__() if device_type == kDLCPU: capsule = dlpack() else: capsule = dlpack(stream=-1) else: capsule = dlpack()

This implementation makes the __dlpack_device__ attribute optional. Optional for the exporter, not for mpi4py (the consumer). mpi4py does look for __dlpack_device__, and if the buffer is GPU, mpi4py does call obj.__dlpack__(stream=-1) properly . But if, for whatever reason, the exporter library decides to not expose a __dlpack_device__ method, my code still works and requests the DLPack capsule by calling obj.__dlpack__() without the stream argument. If the exporter decides to not expose the __dlpack_device__, well, they should be prepared for the plain call to obj.__dlpack__(). Why some exporter may not expose __dlpack_device__ ? Perhaps because the exporter only deals with CPU buffers, and then __dlpack_device__ is noisy overhead. Or perhaps the exporter handles GPU buffers always within the default stream, or always within a custom private stream, and there is no point anyway on exposing __dlpack_device__. Remember that the DLTensor metadata also contains the device type and id.

In short, I truly believe that the Python dlpack specification should NOT make __dlpack_device__ mandatory. If they decide to do so anyway, well, I'm still free to treat __dlpack_device__ as optional.

Can you please explain to me what is wrong with my pseudo code above? If a exporter exposes __dlpack_device__, I honor it, and handle the GPU path appropriately using the stream argument. Otherwise, I request for the capsule without a stream argument. In short, if a exporter handles GPUs and multiple streams, then it is the exporter's responsibility to expose the __dlpack_device__ method, and mpi4py will honor it. Is my logic flawed? If yes, where? I'm just making __dlpack_device__ optional rather than raise AttributeError.

src/mpi4py/MPI/asdlpack.pxi

leofang · 2021-07-01T22:02:46Z

Now this passes with CuPy v9.1. I will test against CuPy's master branch later (which supports the new DLPack protocol).

mpirun --mca opal_cuda_support 1 -n 1 python test/runtests.py --no-numba --cupy -f -v

dalcinl · 2021-07-01T22:08:43Z

@leofang I have the feeling that you jumped over the branch and starting making changes, without actually testing things with CuPy using the original code. Did you really got any failures that really required to change my implementation? Am I violating any standard recommendation (if all this mess can be called a standard) ?

leofang · 2021-07-01T22:45:13Z

@leofang I have the feeling that you jumped over the branch and starting making changes, without actually testing things with CuPy using the original code.

You lost me bro, I thought we wanted to add DLPack tests for GPU arrays? I did run the test suite on your branch first and it worked fine.

Did you really got any failures that really required to change my implementation?
Am I violating any standard recommendation (if all this mess can be called a standard) ?

Assuming you're referring to __dlpack_device__, then yes, please see my #59 (comment).

leofang · 2021-07-01T22:48:55Z

Now this passes with CuPy v9.1. I will test against CuPy's master branch later (which supports the new DLPack protocol).
mpirun --mca opal_cuda_support 1 -n 1 python test/runtests.py --no-numba --cupy -f -v

Done. All green.

leofang · 2021-07-01T23:11:44Z

Looking into something weird...

leofang · 2021-07-02T00:05:57Z

Looking into something weird...

Fixed my silliness. I hate WFH...

dalcinl · 2021-07-02T09:02:26Z

Am I violating any standard recommendation (if all this mess can be called a standard) ?

Assuming you're referring to __dlpack_device__, then yes, please see my #59 (comment).

Again, I'm handling the __dlpack_device__ as if it were optional rather than mandatory. IMHO, the spec should be updated to work that way. It should be the exporter's choice whether to expose __dlpack_device__ or not.

Do you agree that __dlpack_device__ is just extra noise for CPU-only libraries? So then why the method has to be mandatory? Looks like the GPU folks proposing the spec want everyone to eat their own dog food.

We I can still support the spec treating the __dlpack_device__ method as optional in mpi4py, it is our prerogative. Of course, unless that a real example/use case is provided that would show this approach is flawed and would lead to actual bugs/issues. If

dalcinl · 2021-07-19T16:31:57Z

@leofang Back to this one... What's wrong with my original approach of not making __dlpack_device__ required and then if not present assuming the exported buffer is CPU? IMHO, that should be part of the spec. The GPU folks should not force their own dog food on the CPU world. For CPU exporters, having to implement __dlpack_device__ is a nuisance.

dalcinl · 2021-07-19T16:35:03Z

Let me explain my reasoning. I truly believe that some aspects of the DLPack spec are misguided and overreaching. I envision some future tweaks may be added. So the less you "require" from the current spec the better. Assuming a CPU buffer is __dlpack_device__ is not present is quite reasonable, IMHO.

leofang · 2021-07-19T17:49:31Z

@dalcinl Sorry to be blunt, but I don't understand why you have to keep bugging me on this. I feel it's a waste of time discussing why __dlpack_device__ should not be mandatory. I already told you about the stream handling. Imagine I have a GPU-only library and I forget to implement __dlpack_device__ for whatever silly reason. Without my change your users would come and complain to you that mpi4py keeps synchronizing (which is not true, but they don't know/care). Why would you put yourself in such a situation instead of just throwing an error so that users know the support is incomplete?

If this is still not convincing then I think we're in a pickle and I don't know what else I can say. You're thinking purely from the CPU-only mindset, against the protocol design that wants to handle generalizability. Why is CPU so special among all the DLPack device types? If I were an ML person (luckily I am not) I'd argue GPU should be the default. Moreover, the cost to look up an attribute is really nothing compared to other operations in the whole HPC/ML workload, so can we please move on?

dalcinl · 2021-07-19T19:34:04Z

@dalcinl Sorry to be blunt, but I don't understand why you have to keep bugging me on this. I feel it's a waste of time discussing why __dlpack_device__ should not be mandatory.

Well, that's a consequence of a specification that is not top quality, with no current provisions for backward/forward compatibility. What's the next step? Require __dlpack_version__ ? And next require __dlpack_we_forgot_about_another_detail__ ?

I already told you about the stream handling. Imagine I have a GPU-only library and I forget to implement __dlpack_device__ for whatever silly reason.

Shame on you if you are so careless. Not because of the mistake, because we are humans and we all make mistakes. Shame on you for not having minimal testing to catch your bugs.

Without my change your users would come and complain to you that mpi4py keeps synchronizing (which is not true, but they don't know/care). Why would you put yourself in such a situation instead of just throwing an error so that users know the support is incomplete?

Do you mean like the error reports and complains that I have received over 15 years about bugs in the backend MPI implementations?

If this is still not convincing then I think we're in a pickle and I don't know what else I can say. You're thinking purely from the CPU-only mindset, against the protocol design that wants to handle generalizability.
Why is CPU so special among all the DLPack device types? If I were an ML person (luckily I am not) I'd argue GPU should be the default.

I got confused with my own argument. In my original code, if the __dlpack_device__ method was not implemented, then obj.__dlpack__() was called without arguments, so that meant stream=None for GPU exporters. Which would still be OK for simple GPU exporter that does not handle/support streams other than the default.

Moreover, the cost to look up an attribute is really nothing compared to other operations in the whole HPC/ML workload,

My complains were never about the cost of attribute lookup. Moreover, my original code still have to pay the cost of the attribute lookup related to __dlpack_device__.

so can we please move on?

Sure thing. If we we cannot handle dissent, much less we can expect to reach to reach to an agreement.

PS: I'll merge your branch into mine tomorrow. I still have to give it a second thought to the handling capsule ownership.

dalcinl · 2021-07-20T10:36:47Z

test/arrayimpl.py

+        def __dlpack__(self, stream=None):
+            cupy.cuda.get_current_stream().synchronize()
+            if self.has_dlpack:
+                return self.array.__dlpack__(stream=-1)


Why stream=-1 and not stream=stream ? Is it because of the previous synchronize() call?

-1 comes from the Consumer, which is mpi4py. The previous synchronize() is just mimicking users doing the right job (recall the change we did with as_raw() in #60).

dalcinl · 2021-07-20T12:35:51Z

The fun continues in #68 .

leofang · 2021-07-20T14:32:56Z

Well, that's a consequence of a specification that is not top quality, with no current provisions for backward/forward compatibility. What's the next step? Require __dlpack_version__ ? And next require __dlpack_we_forgot_about_another_detail__ ?

Well, yeah it's still evolving... We do have the DLPACK_VERSION macro around for the time being.

By the way, in v0.6 the format is changed from 060 to 60:
https://github.com/dmlc/dlpack/blob/277508879878e0a5b5b43599b1bea11f66eb3c6c/include/dlpack/dlpack.h#L15-L16

On the bright side, there are more people like us who care about the compatibility issue given the nature of its wider accessibility than CAI's, so hopefully we'll come up with a solution soon.

Shame on you if you are so careless. Not because of the mistake, because we are humans and we all make mistakes. Shame on you for not having minimal testing to catch your bugs.

Bite me 🥳

Seriously, for the sake of the community we need both attributes implemented. If an object from this crappy library is passed to NumPy/CuPy/PyTorch, an AttributeError would be raised. Even NumPy requires its presence, so everyone needs to be a nice player.

Do you mean like the error reports and complains that I have received over 15 years about bugs in the backend MPI implementations?

Exactly. All I wanted is to shield you from some of the nuisance.

I got confused with my own argument. In my original code, if the __dlpack_device__ method was not implemented, then obj.__dlpack__() was called without arguments, so that meant stream=None for GPU exporters. Which would still be OK for simple GPU exporter that does not handle/support streams other than the default.

Yes, stream=None was designed for such exporters. But, its semantics is different from stream=-1 which stands for "I couldn't care less about streams, just leave me alone", and this is really the one we should use.

My complains were never about the cost of attribute lookup.

Well, you did say it's a "noisy overhead" in #59 (comment) 😄

Sure thing. If we we cannot handle dissent, much less we can expect to reach to reach to an agreement.

Thank you 🙏

I still have to give it a second thought to the handling capsule ownership.

Oh? Is this still a concern?

dalcinl · 2021-07-20T15:03:25Z

Well, you did say it's a "noisy overhead" in #59 (comment)

I was referring to the fact that forcing CPU exporters (e.g. NumPy) to implement __dlpack_device__ is noisy overhead. That's why I still think __dlpack_device__ should be optional, and if missing, imply device type kDLCPU. CPUs still run the world, make total sense to me as a default.

I still have to give it a second thought to the handling capsule ownership.

Oh? Is this still a concern?

Well, if I read the wording and the intention of the spec, then what we are doing is fine. But I'm just thinking about ways some exporters may decide to break my assumptions. Basically, as long as calling obj.__dlpack__() do not have side effects, in the sense of changing the internal state of obj in between of getting the capsule and executing the deleter, then our current support is rock solid.

In short, our current code works this way:

get the capsule
run the deleter
use the buffer pointer

If the only intention of DLPack is as a stateless protocol to communicate the buffer metadata, then we are fine.

However, if an exporter forcibly requires things to be done the following order:

get the capsule
use the buffer pointer
run the deleter

then we have a problem. In other words, If the deleter somehow invalidates the buffer pointer, then our implementation will not work. But my understanding of the spec it that should not be the case. But some of the wording is a bit confusing, so I may be wrong. Or implementors of the spec can get confused, and do things wrong.

dalcinl added 2 commits June 20, 2021 11:14

dlpack: Add support for DLPack

9893755

dlpack: Add exhaustive tests for DLPack interface

7338c11

dalcinl reviewed Jul 1, 2021

View reviewed changes

src/mpi4py/MPI/asdlpack.pxi Outdated Show resolved Hide resolved

dalcinl reviewed Jul 1, 2021

View reviewed changes

src/mpi4py/MPI/asdlpack.pxi Outdated Show resolved Hide resolved

leofang added 6 commits July 1, 2021 15:38

simplify a bit DLPack support and add comment

4acb6bd

add DLPackCuPy arrayimpl

237359e

test if __dlpack_device__ is defined

47669e0

rename GPUBuf to CAIBuf for clarity

442fcb9

rename DLPackBuf to DLPackCPUBuf for clarity

79f64d5

add simple message tests via DLPack GPU support

0780e63

leofang force-pushed the dlpack_gpu_test branch from ed1f45c to 0780e63 Compare July 1, 2021 21:57

leofang changed the title ~~[WIP] Add GPU tests for DLPack support~~ Add GPU tests for DLPack support Jul 1, 2021

leofang marked this pull request as ready for review July 1, 2021 22:48

leofang marked this pull request as draft July 1, 2021 23:11

fix DLPackCuPy tests

0d955ae

leofang marked this pull request as ready for review July 2, 2021 00:05

dalcinl reviewed Jul 20, 2021

View reviewed changes

dalcinl force-pushed the feature/dlpack branch from 7338c11 to b9e9c75 Compare July 20, 2021 11:03

dalcinl closed this Jul 20, 2021

leofang deleted the dlpack_gpu_test branch July 20, 2021 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU tests for DLPack support #59

Add GPU tests for DLPack support #59

leofang commented Jul 1, 2021

dalcinl Jul 1, 2021

leofang Jul 1, 2021

dalcinl Jul 1, 2021

leofang Jul 1, 2021

dalcinl Jul 1, 2021

leofang Jul 1, 2021

dalcinl Jul 2, 2021

leofang commented Jul 1, 2021

dalcinl commented Jul 1, 2021

leofang commented Jul 1, 2021

leofang commented Jul 1, 2021

leofang commented Jul 1, 2021

leofang commented Jul 2, 2021

dalcinl commented Jul 2, 2021

dalcinl commented Jul 19, 2021

dalcinl commented Jul 19, 2021

leofang commented Jul 19, 2021

dalcinl commented Jul 19, 2021

dalcinl Jul 20, 2021

leofang Jul 20, 2021

dalcinl commented Jul 20, 2021

leofang commented Jul 20, 2021

dalcinl commented Jul 20, 2021

Add GPU tests for DLPack support #59

Add GPU tests for DLPack support #59

Conversation

leofang commented Jul 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leofang commented Jul 1, 2021

dalcinl commented Jul 1, 2021

leofang commented Jul 1, 2021

leofang commented Jul 1, 2021

leofang commented Jul 1, 2021

leofang commented Jul 2, 2021

dalcinl commented Jul 2, 2021

dalcinl commented Jul 19, 2021

dalcinl commented Jul 19, 2021

leofang commented Jul 19, 2021

dalcinl commented Jul 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalcinl commented Jul 20, 2021

leofang commented Jul 20, 2021

dalcinl commented Jul 20, 2021