Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the new DLPack exchange protocol from Python Array API #56

Closed
leofang opened this issue Jun 4, 2021 · 7 comments · Fixed by #68
Closed

Support the new DLPack exchange protocol from Python Array API #56

leofang opened this issue Jun 4, 2021 · 7 comments · Fixed by #68
Assignees

Comments

@leofang leofang self-assigned this Jun 4, 2021
@dalcinl
Copy link
Member

dalcinl commented Jun 4, 2021

IMHO, DLPack has a minor but fatal spec flaw: Then did not add a version entry on the DLTensor structure (and perhaps also DLManagedTensor). Without that, it is very hard to update the spec in a backward incompatible way (for example, to support quad-precision complex numbers).

@dalcinl
Copy link
Member

dalcinl commented Jun 4, 2021

Tentative and untested implementation in the feature/dlpack branch. I had the code for some time, but it was missing sanity error checks, and had no tests to validate. But it is a start.

PS: If we manage to get this in, then we make a release.

@dalcinl
Copy link
Member

dalcinl commented Jun 4, 2021

@leofang After we implement DLPack support, mpi4py should be listed in here.

@leofang
Copy link
Member Author

leofang commented Jun 4, 2021

IMHO, DLPack has a minor but fatal spec flaw: Then did not add a version entry on the DLTensor structure (and perhaps also DLManagedTensor). Without that, it is very hard to update the spec in a backward incompatible way

Yes, this is being discussed in this meta-issue dmlc/dlpack#74. But at least we can use the macro DLPACK_VERSION from the DLPack header. Currently no ABI-breaking change has been made so far.

(for example, to support quad-precision complex numbers).

It is interesting to see this brought up again! Someone also mentioned about it: dmlc/dlpack#73, but does MPI 3.1 currently support quad precision?!

Tentative and untested implementation in the feature/dlpack branch. I had the code for some time, but it was missing sanity error checks, and had no tests to validate. But it is a start.

Oh wow, that's quick...Thanks Lisandro! But I took a quick look and it does not meet the standard -- Can I take over and polish it? We need

  1. Vendor a copy of DLPack header: It's not explicitly mentioned in the documentation yet, but it is expected by the community. This also allows us to access DLPACK_VERSION and potentially C APIs added in the future (see the above linked issue).
  2. Turn __dlpack__ and __dlpack_device__ into methods, as they are not attributes. In particular, we need to call __dlpack__(stream=-1), where -1 means don't do any synchronization or stream ordering. This is the backdoor clause I added to the spec for mpi4py.

PS: If we manage to get this in, then we make a release.

I was counting on you to say this 😄

After we implement DLPack support, mpi4py should be listed in here.

Sure, will do.

@dalcinl
Copy link
Member

dalcinl commented Jun 4, 2021

DLPACK_VERSION is useless for consumers. You need to know the version on the DLTensor struct, that way, if there are ABI differences between two versions, consumers can decide whether to error or handle the different versions. When you define a data exchange format, adding a version tag to the exchange data is a basic format requirement that no one should object. Unfortunately, it is a bit late to introduce it in DLPack without breaking the ABI. The version slot should be the very first entry in the DLTensor struct.

BTW, why the DLPACK_VERSION uses an octal integer literal to define the version number ???

No, MPI does not support quad precision. But that's not my point. If DLPack wants to ever support quad complex, it has to break ABI, or introduce some new ugly special case. If they decide to break ABI, for quad complex or any any other reason, consumers have no way to know what's the version the producer returning. I'll repeat it again: DLPACK_VERSION is useless for consumers.

About my conde not being compliant, you should probably elaborate.

  1. I fail to see what's the benefit of vendoring header files, not at this point. What we vendor today may get outdated tomorrow. That's why I insist that DLPACK_VERSION is useless, what is really needed is a version slot in the DLTensor structure.
  2. Read better: dlpack_attr is being called (with no arguments) to get the capsule. It would be trivial to rename the variable as dlpack_method if the name itches you. NumPy does not seem to support passing stream=-1, looks like it is hardwired to error if stream is not None.

I have no idea about the details involving stream=-1. But I AFAICT DLPack is a exchange format that support both CPU's and GPUs. I was expecting that calling capsule = obj.__dlpack__() would have been enough to get either a CPU or GPU address and pass it down to GPU-aware MPI. In other words, I expect mpi4py support for DLPack to work with either CPU or GPU buffers. If we have to special-case DLPack and GPUs passing a stream argument that some exporters like NumPy will not like, well, then things will not go smoothly and perhaps mpi4py should stay away of such mess. Can you point me to any docs that clarify the semantics of stream=None vs stream=-1?

@leofang
Copy link
Member Author

leofang commented Jun 4, 2021

Unfortunately, it is a bit late to introduce it in DLPack without breaking the ABI.

There is probably a way out by defining a helper C API in the DLPack header:

typedef struct {
    int abi_version;
    int api_version;
} dlpack_info;

dlpack_info get_dlpack_info() {
    dlpack_info info;
    info.abi_version = SOME_MACRO;  // defined in the same header, bumped in a new release with ABI-breaking change
    info.api_version = ANOTHER_MACRO;  // or just DLPACK_VERSION
    return info;
}

This is what is under discussion (dmlc/dlpack#74) and part of reasons we want to vendor the header.

BTW, why the DLPACK_VERSION uses an octal integer literal to define the version number ???

I agree it's an awful choice but it's not something we get to pick 😇

No, MPI does not support quad precision. But that's not my point.

Oh OK...Thought I missed something...

If DLPack wants to ever support quad complex, it has to break ABI, or introduce some new ugly special case. If they decide to break ABI, for quad complex or any any other reason, consumers have no way to know what's the version the producer returning. I'll repeat it again: DLPACK_VERSION is useless for consumers.

I will also repeat again that it is under discussion 🙂 So if you have time to join the game, please do feel free to do so by first going through dmlc/dlpack#74 and the linked issues therein... I was trying to avoid dragging you into all these tedious discussions, but I would be happy to have you voice out your thoughts. I am doing my best to keep the ball rolling, while I am not alone it does take a lot of work to get everyone on board.

  1. I fail to see what's the benefit of vendoring header files, not at this point. What we vendor today may get outdated tomorrow. That's why I insist that DLPACK_VERSION is useless, what is really needed is a version slot in the DLTensor structure.

I mentioned one benefit above. But your argument can be turned to against yourself: What you implement here can go outdated, and the worst thing is it is hard to tell because there is no accompanying DLPack header to diff against.

In the present form you need to at least add a comment to say which version (or commit) the Cython implementation is based on, like what I did for CuPy: https://github.com/cupy/cupy/blob/master/cupy/_core/include/cupy/dlpack/README.md But having the header around is so much easier to inspect and update.

I can point you to at least one place where people expect vendoring: dmlc/dlpack#69 (comment). If you're interested, you can take a look at the NumPy PR and the PyTorch PR. CuPy and cuDF have also done that. Everyone is vendoring a copy to reduce the maintenance overhead.

  1. Read better: dlpack_attr is being called (with no arguments) to get the capsule. It would be trivial to rename the variable as dlpack_method if the name itches you. NumPy does not seem to support passing stream=-1, looks like it is hardwired to error if stream is not None.

I have no idea about the details involving stream=-1. But I AFAICT DLPack is a exchange format that support both CPU's and GPUs. I was expecting that calling capsule = obj.__dlpack__() would have been enough to get either a CPU or GPU address and pass it down to GPU-aware MPI. In other words, I expect mpi4py support for DLPack to work with either CPU or GPU buffers. If we have to special-case DLPack and GPUs passing a stream argument that some exporters like NumPy will not like, well, then things will not go smoothly and perhaps mpi4py should stay away of such mess. Can you point me to any docs that clarify the semantics of stream=None vs stream=-1?

It is necessary for CPU libraries to use stream=None and GPU libraries to use stream=-1, because the semantics of stream=None on GPU is different: it refers to the legacy default stream. Additional information on this protocol is currently contained in the docstrings of __dlpack__ and __dlpack_device__ but it will be rewritten for better clarity soon.

If this helps: Recall that in CAI we used stream=None to signal "no need to sync whatsoever". It is equivalent to stream=-1 here in DLPack.

The special case you are referring to can be queried via __dlpack_device__. Please see how it is done in my CuPy PR and I expect this can be naturally generalized to handle CPU buffers as well, so I don't see any problem for mpi4py. If it helps, the pseudo code is like this:

def from_dlpack(x):
    device_type, device_id = x. __dlpack_device__()
    if device_type == cpu:
       capsule = x.__dlpack__(stream=None)
    elif device_type == gpu:
       my_stream = get_the_stream_to_use()  # <-- for mpi4py this is -1
       capsule = x.__dlpack__(stream=my_stream)
    else:
       raise NotImplemntedError
    return _capsule_to_my_array_or_buffer(capsule)

ps. I still need your answer to my question:

Can I take over and polish it?

@dalcinl
Copy link
Member

dalcinl commented Jun 6, 2021

I force-pushed a few enhancements. I'm still not sure a about whether the destructor should be called or not. Your CuPy code does not call the DLManagedTensor deleter, however the spec seem to require that:

The producer must set the PyCapsule name to "dltensor" so that it can be inspected by name. The consumer must set the PyCapsule name to "used_dltensor" , and call the deleter of the DLPackManagedTensor when it no longer needs the data.

PS: I found another minor issue in DLPack. When you care about ABI, you should avoid using enum types inside structs. Thus, this line should read for example unsigned device_type.

PS: Inconsistencies: DLDataTypeCode uses U prefixes for the integer enum values. However, DLDeviceType does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants