Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU tests for DLPack support #59

Closed
wants to merge 9 commits into from

Conversation

leofang
Copy link
Member

@leofang leofang commented Jul 1, 2021

DO NOT MERGE. To be continued tomorrow...

With Open MPI 4.1.1, I am seeing MPI_ERR_TRUNCATE in testAllgatherv3, so I skip them for now:

$ mpirun --mca opal_cuda_support 1 -n 1 python test/runtests.py --no-numba --cupy -e "test_cco_nb_vec" -e "test_cco_vec"

Tomorrow I will continue investigating the only errors in test_msgspec.py.

@@ -151,7 +152,8 @@ cdef inline Py_ssize_t dlpack_get_itemsize(const DLTensor *dltensor) nogil:
#------------------------------------------------------------------------------

cdef int Py_CheckDLPackBuffer(object obj):
try: return <bint>hasattr(obj, '__dlpack__')
# we check __dlpack_device__ to avoid potential side effects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side effects? Now way! The __dlpack__ attribute is a method! What side effect could possible be looking up for a method WITHOUT calling it? I think someone elsewhere suggested this approach. I don't like it and don't see the point. Are CPU-only libraries forced to expose this __dlpack_device__ method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side effects? Now way! The __dlpack__ attribute is a method!

You are right. This is what happens when you have a long day babysitting and try to work in midnight...😞 I forgot they are callables, not Python properties.

Are CPU-only libraries forced to expose this __dlpack_device__ method?

Yes. __dlpack__ and __dlpack_device__ must come in pair together. Libraries that implement from_dlpack() are expected to check __dlpack_device__ first and act accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but then, my original code is still valid, as I'm looking up methods first, and calling them afterwards. And I would very much prefer to implement things such that __dlpack_device__ is not mandatory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would very much prefer to implement things such that __dlpack_device__ is not mandatory.

Unfortunately there's no agreed-upon default for the behavior of missing __dlpack_device__ 😅 For example to CuPy defaulting to kDLCPU does not make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should look at my code more carefully. It is true that if __dlpack_device__ is not available, I set the device_type variable to kDLCPU. However, the only effect of that is calling dlpack() to get the capsule, and not dlpack(stream=-1). The exporter could still expose a GPU buffer (perhaps on the default stream), and my could would still work properly. I'm going to keep insisting: the canonical way to check for DLPack suport should be to look up for the __dlpack__ attribute, as I did originally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check commit 4acb6bd that I just force-pushed? I reverted most of changes to asdlpack.pxi and now check both __dlpack_device__ and __dlpack__, which I must insist.

However, the only effect of that is calling dlpack() to get the capsule, and not dlpack(stream=-1). The exporter could still expose a GPU buffer (perhaps on the default stream), and my could would still work properly.

I disagree that it "would still work properly". For GPU buffers, the possibility of passing stream=None incurs extra overhead to either set up a stream ordering or synchronize. This has to be avoided to the maximal extent allowed, as we already ask users to do this themselves.

__dlpack_device__ is the first function we query to obtain device info (data-apis/array-api#142 (comment); see also my pseudo code in #56 (comment) which is followed by everyone, NumPy, CuPy, PyTorch, etc), which potentially would include more info in the future (say the API/ABI version) than a 2-tuple. If this is bothering/ambiguous to you I can send a PR to make it clear in the API standard, but if checking both methods can eliminate this corner case I don't see why we don't wanna do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are failing to see my point, probably because your brain is locked up in all the mess that involved putting the DLPack protocol together. Please consider the following pseudo code with the logic to get the capsule. This pseudocode is equivalent to my original implementation, without all the exception handing,

dlpack = obj.__dlpack__  # required attribute
if hasattr(obj, '__dlpack_device__'): # optional attribute
    device_type, _ = obj.__dlpack_device__()
    if device_type == kDLCPU:
        capsule = dlpack()
    else:
        capsule = dlpack(stream=-1)
else:
    capsule = dlpack()

This implementation makes the __dlpack_device__ attribute optional. Optional for the exporter, not for mpi4py (the consumer). mpi4py does look for __dlpack_device__, and if the buffer is GPU, mpi4py does call obj.__dlpack__(stream=-1) properly . But if, for whatever reason, the exporter library decides to not expose a __dlpack_device__ method, my code still works and requests the DLPack capsule by calling obj.__dlpack__() without the stream argument. If the exporter decides to not expose the __dlpack_device__, well, they should be prepared for the plain call to obj.__dlpack__(). Why some exporter may not expose __dlpack_device__ ? Perhaps because the exporter only deals with CPU buffers, and then __dlpack_device__ is noisy overhead. Or perhaps the exporter handles GPU buffers always within the default stream, or always within a custom private stream, and there is no point anyway on exposing __dlpack_device__. Remember that the DLTensor metadata also contains the device type and id.

In short, I truly believe that the Python dlpack specification should NOT make __dlpack_device__ mandatory. If they decide to do so anyway, well, I'm still free to treat __dlpack_device__ as optional.

Can you please explain to me what is wrong with my pseudo code above? If a exporter exposes __dlpack_device__, I honor it, and handle the GPU path appropriately using the stream argument. Otherwise, I request for the capsule without a stream argument. In short, if a exporter handles GPUs and multiple streams, then it is the exporter's responsibility to expose the __dlpack_device__ method, and mpi4py will honor it. Is my logic flawed? If yes, where? I'm just making __dlpack_device__ optional rather than raise AttributeError.

src/mpi4py/MPI/asdlpack.pxi Outdated Show resolved Hide resolved
src/mpi4py/MPI/asdlpack.pxi Outdated Show resolved Hide resolved
@leofang
Copy link
Member Author

leofang commented Jul 1, 2021

Now this passes with CuPy v9.1. I will test against CuPy's master branch later (which supports the new DLPack protocol).

mpirun --mca opal_cuda_support 1 -n 1 python test/runtests.py --no-numba --cupy -f -v

@dalcinl
Copy link
Member

dalcinl commented Jul 1, 2021

@leofang I have the feeling that you jumped over the branch and starting making changes, without actually testing things with CuPy using the original code. Did you really got any failures that really required to change my implementation? Am I violating any standard recommendation (if all this mess can be called a standard) ?

@leofang
Copy link
Member Author

leofang commented Jul 1, 2021

@leofang I have the feeling that you jumped over the branch and starting making changes, without actually testing things with CuPy using the original code.

You lost me bro, I thought we wanted to add DLPack tests for GPU arrays? I did run the test suite on your branch first and it worked fine.

Did you really got any failures that really required to change my implementation?
Am I violating any standard recommendation (if all this mess can be called a standard) ?

Assuming you're referring to __dlpack_device__, then yes, please see my #59 (comment).

@leofang leofang changed the title [WIP] Add GPU tests for DLPack support Add GPU tests for DLPack support Jul 1, 2021
@leofang leofang marked this pull request as ready for review July 1, 2021 22:48
@leofang
Copy link
Member Author

leofang commented Jul 1, 2021

Now this passes with CuPy v9.1. I will test against CuPy's master branch later (which supports the new DLPack protocol).

mpirun --mca opal_cuda_support 1 -n 1 python test/runtests.py --no-numba --cupy -f -v

Done. All green.

@leofang leofang marked this pull request as draft July 1, 2021 23:11
@leofang
Copy link
Member Author

leofang commented Jul 1, 2021

Looking into something weird...

@leofang leofang marked this pull request as ready for review July 2, 2021 00:05
@leofang
Copy link
Member Author

leofang commented Jul 2, 2021

Looking into something weird...

Fixed my silliness. I hate WFH...

@dalcinl
Copy link
Member

dalcinl commented Jul 2, 2021

Am I violating any standard recommendation (if all this mess can be called a standard) ?

Assuming you're referring to __dlpack_device__, then yes, please see my #59 (comment).

Again, I'm handling the __dlpack_device__ as if it were optional rather than mandatory. IMHO, the spec should be updated to work that way. It should be the exporter's choice whether to expose __dlpack_device__ or not.

Do you agree that __dlpack_device__ is just extra noise for CPU-only libraries? So then why the method has to be mandatory? Looks like the GPU folks proposing the spec want everyone to eat their own dog food.

We I can still support the spec treating the __dlpack_device__ method as optional in mpi4py, it is our prerogative. Of course, unless that a real example/use case is provided that would show this approach is flawed and would lead to actual bugs/issues. If

@dalcinl
Copy link
Member

dalcinl commented Jul 19, 2021

@leofang Back to this one... What's wrong with my original approach of not making __dlpack_device__ required and then if not present assuming the exported buffer is CPU? IMHO, that should be part of the spec. The GPU folks should not force their own dog food on the CPU world. For CPU exporters, having to implement __dlpack_device__ is a nuisance.

@dalcinl
Copy link
Member

dalcinl commented Jul 19, 2021

Let me explain my reasoning. I truly believe that some aspects of the DLPack spec are misguided and overreaching. I envision some future tweaks may be added. So the less you "require" from the current spec the better. Assuming a CPU buffer is __dlpack_device__ is not present is quite reasonable, IMHO.

@leofang
Copy link
Member Author

leofang commented Jul 19, 2021

@dalcinl Sorry to be blunt, but I don't understand why you have to keep bugging me on this. I feel it's a waste of time discussing why __dlpack_device__ should not be mandatory. I already told you about the stream handling. Imagine I have a GPU-only library and I forget to implement __dlpack_device__ for whatever silly reason. Without my change your users would come and complain to you that mpi4py keeps synchronizing (which is not true, but they don't know/care). Why would you put yourself in such a situation instead of just throwing an error so that users know the support is incomplete?

If this is still not convincing then I think we're in a pickle and I don't know what else I can say. You're thinking purely from the CPU-only mindset, against the protocol design that wants to handle generalizability. Why is CPU so special among all the DLPack device types? If I were an ML person (luckily I am not) I'd argue GPU should be the default. Moreover, the cost to look up an attribute is really nothing compared to other operations in the whole HPC/ML workload, so can we please move on?

@dalcinl
Copy link
Member

dalcinl commented Jul 19, 2021

@dalcinl Sorry to be blunt, but I don't understand why you have to keep bugging me on this. I feel it's a waste of time discussing why __dlpack_device__ should not be mandatory.

Well, that's a consequence of a specification that is not top quality, with no current provisions for backward/forward compatibility. What's the next step? Require __dlpack_version__ ? And next require __dlpack_we_forgot_about_another_detail__ ?

I already told you about the stream handling. Imagine I have a GPU-only library and I forget to implement __dlpack_device__ for whatever silly reason.

Shame on you if you are so careless. Not because of the mistake, because we are humans and we all make mistakes. Shame on you for not having minimal testing to catch your bugs.

Without my change your users would come and complain to you that mpi4py keeps synchronizing (which is not true, but they don't know/care). Why would you put yourself in such a situation instead of just throwing an error so that users know the support is incomplete?

Do you mean like the error reports and complains that I have received over 15 years about bugs in the backend MPI implementations?

If this is still not convincing then I think we're in a pickle and I don't know what else I can say. You're thinking purely from the CPU-only mindset, against the protocol design that wants to handle generalizability.
Why is CPU so special among all the DLPack device types? If I were an ML person (luckily I am not) I'd argue GPU should be the default.

I got confused with my own argument. In my original code, if the __dlpack_device__ method was not implemented, then obj.__dlpack__() was called without arguments, so that meant stream=None for GPU exporters. Which would still be OK for simple GPU exporter that does not handle/support streams other than the default.

Moreover, the cost to look up an attribute is really nothing compared to other operations in the whole HPC/ML workload,

My complains were never about the cost of attribute lookup. Moreover, my original code still have to pay the cost of the attribute lookup related to __dlpack_device__.

so can we please move on?

Sure thing. If we we cannot handle dissent, much less we can expect to reach to reach to an agreement.

PS: I'll merge your branch into mine tomorrow. I still have to give it a second thought to the handling capsule ownership.

def __dlpack__(self, stream=None):
cupy.cuda.get_current_stream().synchronize()
if self.has_dlpack:
return self.array.__dlpack__(stream=-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why stream=-1 and not stream=stream ? Is it because of the previous synchronize() call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 comes from the Consumer, which is mpi4py. The previous synchronize() is just mimicking users doing the right job (recall the change we did with as_raw() in #60).

@dalcinl dalcinl closed this Jul 20, 2021
@dalcinl
Copy link
Member

dalcinl commented Jul 20, 2021

The fun continues in #68 .

@leofang leofang deleted the dlpack_gpu_test branch July 20, 2021 14:18
@leofang
Copy link
Member Author

leofang commented Jul 20, 2021

Well, that's a consequence of a specification that is not top quality, with no current provisions for backward/forward compatibility. What's the next step? Require __dlpack_version__ ? And next require __dlpack_we_forgot_about_another_detail__ ?

Well, yeah it's still evolving... We do have the DLPACK_VERSION macro around for the time being.

By the way, in v0.6 the format is changed from 060 to 60:
https://github.com/dmlc/dlpack/blob/277508879878e0a5b5b43599b1bea11f66eb3c6c/include/dlpack/dlpack.h#L15-L16

On the bright side, there are more people like us who care about the compatibility issue given the nature of its wider accessibility than CAI's, so hopefully we'll come up with a solution soon.

Shame on you if you are so careless. Not because of the mistake, because we are humans and we all make mistakes. Shame on you for not having minimal testing to catch your bugs.

Bite me 🥳

Seriously, for the sake of the community we need both attributes implemented. If an object from this crappy library is passed to NumPy/CuPy/PyTorch, an AttributeError would be raised. Even NumPy requires its presence, so everyone needs to be a nice player.

Do you mean like the error reports and complains that I have received over 15 years about bugs in the backend MPI implementations?

Exactly. All I wanted is to shield you from some of the nuisance.

I got confused with my own argument. In my original code, if the __dlpack_device__ method was not implemented, then obj.__dlpack__() was called without arguments, so that meant stream=None for GPU exporters. Which would still be OK for simple GPU exporter that does not handle/support streams other than the default.

Yes, stream=None was designed for such exporters. But, its semantics is different from stream=-1 which stands for "I couldn't care less about streams, just leave me alone", and this is really the one we should use.

My complains were never about the cost of attribute lookup.

Well, you did say it's a "noisy overhead" in #59 (comment) 😄

Sure thing. If we we cannot handle dissent, much less we can expect to reach to reach to an agreement.

Thank you 🙏

I still have to give it a second thought to the handling capsule ownership.

Oh? Is this still a concern?

@dalcinl
Copy link
Member

dalcinl commented Jul 20, 2021

Well, you did say it's a "noisy overhead" in #59 (comment)

I was referring to the fact that forcing CPU exporters (e.g. NumPy) to implement __dlpack_device__ is noisy overhead. That's why I still think __dlpack_device__ should be optional, and if missing, imply device type kDLCPU. CPUs still run the world, make total sense to me as a default.

I still have to give it a second thought to the handling capsule ownership.

Oh? Is this still a concern?

Well, if I read the wording and the intention of the spec, then what we are doing is fine. But I'm just thinking about ways some exporters may decide to break my assumptions. Basically, as long as calling obj.__dlpack__() do not have side effects, in the sense of changing the internal state of obj in between of getting the capsule and executing the deleter, then our current support is rock solid.

In short, our current code works this way:

  1. get the capsule
  2. run the deleter
  3. use the buffer pointer

If the only intention of DLPack is as a stateless protocol to communicate the buffer metadata, then we are fine.

However, if an exporter forcibly requires things to be done the following order:

  1. get the capsule
  2. use the buffer pointer
  3. run the deleter

then we have a problem. In other words, If the deleter somehow invalidates the buffer pointer, then our implementation will not work. But my understanding of the spec it that should not be the case. But some of the wording is a bit confusing, so I may be wrong. Or implementors of the spec can get confused, and do things wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants