New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU tests for DLPack support #59
Conversation
src/mpi4py/MPI/asdlpack.pxi
Outdated
@@ -151,7 +152,8 @@ cdef inline Py_ssize_t dlpack_get_itemsize(const DLTensor *dltensor) nogil: | |||
#------------------------------------------------------------------------------ | |||
|
|||
cdef int Py_CheckDLPackBuffer(object obj): | |||
try: return <bint>hasattr(obj, '__dlpack__') | |||
# we check __dlpack_device__ to avoid potential side effects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side effects? Now way! The __dlpack__
attribute is a method! What side effect could possible be looking up for a method WITHOUT calling it? I think someone elsewhere suggested this approach. I don't like it and don't see the point. Are CPU-only libraries forced to expose this __dlpack_device__
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side effects? Now way! The
__dlpack__
attribute is a method!
You are right. This is what happens when you have a long day babysitting and try to work in midnight...😞 I forgot they are callables, not Python properties.
Are CPU-only libraries forced to expose this
__dlpack_device__
method?
Yes. __dlpack__
and __dlpack_device__
must come in pair together. Libraries that implement from_dlpack()
are expected to check __dlpack_device__
first and act accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but then, my original code is still valid, as I'm looking up methods first, and calling them afterwards. And I would very much prefer to implement things such that __dlpack_device__
is not mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would very much prefer to implement things such that
__dlpack_device__
is not mandatory.
Unfortunately there's no agreed-upon default for the behavior of missing __dlpack_device__
😅 For example to CuPy defaulting to kDLCPU
does not make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should look at my code more carefully. It is true that if __dlpack_device__
is not available, I set the device_type
variable to kDLCPU
. However, the only effect of that is calling dlpack()
to get the capsule, and not dlpack(stream=-1)
. The exporter could still expose a GPU buffer (perhaps on the default stream), and my could would still work properly. I'm going to keep insisting: the canonical way to check for DLPack suport should be to look up for the __dlpack__
attribute, as I did originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check commit 4acb6bd that I just force-pushed? I reverted most of changes to asdlpack.pxi
and now check both __dlpack_device__
and __dlpack__
, which I must insist.
However, the only effect of that is calling
dlpack()
to get the capsule, and notdlpack(stream=-1)
. The exporter could still expose a GPU buffer (perhaps on the default stream), and my could would still work properly.
I disagree that it "would still work properly". For GPU buffers, the possibility of passing stream=None
incurs extra overhead to either set up a stream ordering or synchronize. This has to be avoided to the maximal extent allowed, as we already ask users to do this themselves.
__dlpack_device__
is the first function we query to obtain device info (data-apis/array-api#142 (comment); see also my pseudo code in #56 (comment) which is followed by everyone, NumPy, CuPy, PyTorch, etc), which potentially would include more info in the future (say the API/ABI version) than a 2-tuple. If this is bothering/ambiguous to you I can send a PR to make it clear in the API standard, but if checking both methods can eliminate this corner case I don't see why we don't wanna do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are failing to see my point, probably because your brain is locked up in all the mess that involved putting the DLPack protocol together. Please consider the following pseudo code with the logic to get the capsule. This pseudocode is equivalent to my original implementation, without all the exception handing,
dlpack = obj.__dlpack__ # required attribute
if hasattr(obj, '__dlpack_device__'): # optional attribute
device_type, _ = obj.__dlpack_device__()
if device_type == kDLCPU:
capsule = dlpack()
else:
capsule = dlpack(stream=-1)
else:
capsule = dlpack()
This implementation makes the __dlpack_device__
attribute optional. Optional for the exporter, not for mpi4py (the consumer). mpi4py does look for __dlpack_device__
, and if the buffer is GPU, mpi4py does call obj.__dlpack__(stream=-1)
properly . But if, for whatever reason, the exporter library decides to not expose a __dlpack_device__
method, my code still works and requests the DLPack capsule by calling obj.__dlpack__()
without the stream argument. If the exporter decides to not expose the __dlpack_device__
, well, they should be prepared for the plain call to obj.__dlpack__()
. Why some exporter may not expose __dlpack_device__
? Perhaps because the exporter only deals with CPU buffers, and then __dlpack_device__
is noisy overhead. Or perhaps the exporter handles GPU buffers always within the default stream, or always within a custom private stream, and there is no point anyway on exposing __dlpack_device__
. Remember that the DLTensor
metadata also contains the device type and id.
In short, I truly believe that the Python dlpack specification should NOT make __dlpack_device__
mandatory. If they decide to do so anyway, well, I'm still free to treat __dlpack_device__
as optional.
Can you please explain to me what is wrong with my pseudo code above? If a exporter exposes __dlpack_device__
, I honor it, and handle the GPU path appropriately using the stream
argument. Otherwise, I request for the capsule without a stream
argument. In short, if a exporter handles GPUs and multiple streams, then it is the exporter's responsibility to expose the __dlpack_device__
method, and mpi4py will honor it. Is my logic flawed? If yes, where? I'm just making __dlpack_device__
optional rather than raise AttributeError
.
Now this passes with CuPy v9.1. I will test against CuPy's master branch later (which supports the new DLPack protocol).
|
@leofang I have the feeling that you jumped over the branch and starting making changes, without actually testing things with CuPy using the original code. Did you really got any failures that really required to change my implementation? Am I violating any standard recommendation (if all this mess can be called a standard) ? |
You lost me bro, I thought we wanted to add DLPack tests for GPU arrays? I did run the test suite on your branch first and it worked fine.
Assuming you're referring to |
Done. All green. |
Looking into something weird... |
Fixed my silliness. I hate WFH... |
Again, I'm handling the Do you agree that We I can still support the spec treating the |
@leofang Back to this one... What's wrong with my original approach of not making |
Let me explain my reasoning. I truly believe that some aspects of the DLPack spec are misguided and overreaching. I envision some future tweaks may be added. So the less you "require" from the current spec the better. Assuming a CPU buffer is |
@dalcinl Sorry to be blunt, but I don't understand why you have to keep bugging me on this. I feel it's a waste of time discussing why If this is still not convincing then I think we're in a pickle and I don't know what else I can say. You're thinking purely from the CPU-only mindset, against the protocol design that wants to handle generalizability. Why is CPU so special among all the DLPack device types? If I were an ML person (luckily I am not) I'd argue GPU should be the default. Moreover, the cost to look up an attribute is really nothing compared to other operations in the whole HPC/ML workload, so can we please move on? |
Well, that's a consequence of a specification that is not top quality, with no current provisions for backward/forward compatibility. What's the next step? Require
Shame on you if you are so careless. Not because of the mistake, because we are humans and we all make mistakes. Shame on you for not having minimal testing to catch your bugs.
Do you mean like the error reports and complains that I have received over 15 years about bugs in the backend MPI implementations?
I got confused with my own argument. In my original code, if the
My complains were never about the cost of attribute lookup. Moreover, my original code still have to pay the cost of the attribute lookup related to
Sure thing. If we we cannot handle dissent, much less we can expect to reach to reach to an agreement. PS: I'll merge your branch into mine tomorrow. I still have to give it a second thought to the handling capsule ownership. |
def __dlpack__(self, stream=None): | ||
cupy.cuda.get_current_stream().synchronize() | ||
if self.has_dlpack: | ||
return self.array.__dlpack__(stream=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why stream=-1
and not stream=stream
? Is it because of the previous synchronize()
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1
comes from the Consumer, which is mpi4py. The previous synchronize()
is just mimicking users doing the right job (recall the change we did with as_raw()
in #60).
The fun continues in #68 . |
Well, yeah it's still evolving... We do have the By the way, in v0.6 the format is changed from On the bright side, there are more people like us who care about the compatibility issue given the nature of its wider accessibility than CAI's, so hopefully we'll come up with a solution soon.
Bite me 🥳 Seriously, for the sake of the community we need both attributes implemented. If an object from this crappy library is passed to NumPy/CuPy/PyTorch, an
Exactly. All I wanted is to shield you from some of the nuisance.
Yes,
Well, you did say it's a "noisy overhead" in #59 (comment) 😄
Thank you 🙏
Oh? Is this still a concern? |
I was referring to the fact that forcing CPU exporters (e.g. NumPy) to implement
Well, if I read the wording and the intention of the spec, then what we are doing is fine. But I'm just thinking about ways some exporters may decide to break my assumptions. Basically, as long as calling In short, our current code works this way:
If the only intention of DLPack is as a stateless protocol to communicate the buffer metadata, then we are fine. However, if an exporter forcibly requires things to be done the following order:
then we have a problem. In other words, If the deleter somehow invalidates the buffer pointer, then our implementation will not work. But my understanding of the spec it that should not be the case. But some of the wording is a bit confusing, so I may be wrong. Or implementors of the spec can get confused, and do things wrong. |
DO NOT MERGE. To be continued tomorrow...
With Open MPI 4.1.1, I am seeing
MPI_ERR_TRUNCATE
intestAllgatherv3
, so I skip them for now:Tomorrow I will continue investigating the only errors in
test_msgspec.py
.