Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store xnd objects in plasma #4

Closed
saulshanabrook opened this issue Mar 5, 2018 · 31 comments
Closed

Store xnd objects in plasma #4

saulshanabrook opened this issue Mar 5, 2018 · 31 comments

Comments

@saulshanabrook
Copy link
Member

We should be able to store xnd objects in the Plasma store, similar to how you can store arrow memory. In the blog post introducing Apache Arrow's usage with Plasma, they outline the motivation for the store. Similar to arrow, it would be beneficial to store xnd objects, so that multiple processes on on one computer could all access them without copying or serialization.

@saulshanabrook
Copy link
Member Author

Originally I thought I could store an xnd object in Plasma without copying, but I no longer believe this to be the case. The create command returns a new memory buffer that Plasma creates and hands you a pointer to. It doesn't seem to support taking ownership over an existing piece of memory.

This does make sense, since Plasma stores immutable data.

@teoliphant
Copy link
Member

teoliphant commented Mar 5, 2018 via email

@saulshanabrook
Copy link
Member Author

saulshanabrook commented Mar 5, 2018

Yeah that makes sense.

So do we have to be able to serialize the XndObject to that contiguous buffer?

typedef struct {
    PyObject_HEAD
    MemoryBlockObject *mblock; /* owner of the primary type and memory block */
    PyObject *type;            /* owner of the current type */
    xnd_t xnd;                 /* typed view, does not own anything */
} XndObject;

In trying to get a sense of exactly where the memory is stored, is it possible to inspect these fields
in a Python REPL on an an xnd instance?

@skrah
Copy link
Member

skrah commented Mar 5, 2018 via email

@teoliphant
Copy link
Member

teoliphant commented Mar 5, 2018 via email

@saulshanabrook
Copy link
Member Author

So would the desired user flow be something like:

  1. Create plasma client connected to store
  2. Tell xnd to use this client
  3. Create some containers with xnd
  4. Retrieve their plasma ids
  5. Start another process
  6. Create plasma client again
  7. Get access to xnd containers using plasma ids

Right now, memory allocation in xnd uses the ndt_aligned_calloc function. This happens when creating a new xnd container, when creating a Ref, and when creating bytes.

If we could get a pointer back from Plasma's create function, then we could use that instead of ndt_aligned_calloc. If we are storying bytes or references, then this means storing multiple Plasma objects.

Is it right to assume that different plasma clients could have different pointer addresses for the same plasma object? Because we are using virtual memory?

If this is the case, then it might not be good to store the pointer for a Ref directly in the xnd data, but instead store the plasma ID instead (same with bytes).

@saulshanabrook
Copy link
Member Author

saulshanabrook commented Mar 5, 2018

I opened an issue asking if it was possible to get a pointer back from the plasma store, to write to apache/arrow#1706

@skrah
Copy link
Member

skrah commented Mar 6, 2018

Here is my current take on it.

Option 1:

I would not like to replace existing allocation functions. Instead, I'd like to proceed cautiously and create a separate pxnd (p for parallel) object.

We would only support pointer-free types. The pxnd object would have a datashape Python string object and a plasma object id and could therefore be pickled.

Creating this object would allocate a data block in plasma.

I assume that the picklable pxndobject would be sent via a pipe to consumer processes.

In such a process, the consumer would call something like pxnd.get(), which would use the existing MemoryBlockObject in the same way as the from_buffer function and return a regular xnd object.

That is to say, the memory in the xnd_master buffer is marked as external, but creation of the xnd object largely proceeds in the usual way. This is the same way the existing PEP-3118 imports are handled.

This all hinges on the possibility of getting a raw pointer from the plasma data store.

If Arrow headers are required for this, I'd like to check for these headers in ./configure and make the compilation of the pxnd object conditional on the presence of the headers.

In general, I'd like to keep the pxnd object in a separate section or file and be able to compile and use xnd without plasma support.

Option 2:

Only allow NumPy types and hope that the existing plasma functions take a PEP-3118 buffer.
In that case, we'd need to add PEP-3118 exporting facilities to xnd.

@teoliphant
Copy link
Member

teoliphant commented Mar 6, 2018 via email

@skrah
Copy link
Member

skrah commented Mar 6, 2018 via email

@saulshanabrook
Copy link
Member Author

The C++ plasma code looks like it can give you back a pointer to the buffer(https://arrow.apache.org/docs/cpp/classplasma_1_1_plasma_client.html#ac18ab9cc792c620a97a3dcb165e0ecd7), but I am not sure if it's possible for use that, since we are using C and they don't export anything to C.

The Cython wrapper might also be helpful, but I also don't think it's possible for us to use that in our C code, since they don't make anything public. I opened a thread on the arrow listserve to see if they have any guidance on using the plasma API from C.

@teoliphant
Copy link
Member

teoliphant commented Mar 7, 2018 via email

@wesm
Copy link

wesm commented Mar 7, 2018

adding @robertnishihara @pcmoritz @pitrou to go along with the mailing list thread. Plasma buffers implement the buffer protocol so it seems like you should be able to use that

@pitrou
Copy link

pitrou commented Mar 7, 2018

Is xnd able to use foreign buffers as backing memory? If not, is there a reason why? Basically Plasma needs to allocate the memory itself in a certain way, to be able to hand it to its various client processes.

(I'm assuming there may be limitations such as contiguity or restriction to certain well-behaved types)

@pitrou
Copy link

pitrou commented Mar 7, 2018

To be clear, my question is whether xnd supports the equivalent of np.frombuffer() to create xnd objects.

@wesm
Copy link

wesm commented Mar 7, 2018

There are some comments here about type metadata:

  1. We need to make sure that plasma can accept ndtypes descriptions for the
    memory

Plasma is unconcerned with the contents of the memory. So you will need to take responsibility for writing a serialized metadata header to prefix the xnd memory. We are using Flatbuffers for our serialized metadata for Arrow

@saulshanabrook
Copy link
Member Author

saulshanabrook commented Mar 7, 2018

Is xnd able to use foreign buffers as backing memory?

Yes. So we should be able to call the Plasma Python API from our C code that exposes our own Python API. By accessing that buffer in C, then we can get access to the pointer.

I will try that approach before moving forward on any kind of C wrapper around the Plasma C++ API.

@skrah
Copy link
Member

skrah commented Mar 7, 2018 via email

@teoliphant
Copy link
Member

The problem with using from_buffer is that PEP-3118 does not support full ndtypes. What I'd like to see is plasma support for any structure that can be described by libndtypes (i.e. datashape). We will need this eventually. It would be good to get collaborating early if possible.

@saulshanabrook
Copy link
Member Author

Does plasma set the format of the the buffer that is returned?

@wesm
Copy link

wesm commented Mar 7, 2018

No -- the Plasma data is just bytes. You'll need to define a serialized data header for xnd data. The way we do it for Arrow objects is:

  • 4 bytes for metadata size
  • serialized metadata (we use Flatbuffers)
  • message body

See https://github.com/apache/arrow/blob/master/format/IPC.md#encapsulated-message-format

@saulshanabrook
Copy link
Member Author

No -- the Plasma data is just bytes. You'll need to define a serialized data header for xnd data.

@wesm OK, that makes sense. Just to make sure I understand correctly, the plasma python API provides a PEP 3118 buffer but without the shape, format, strides, ndim, suboffets fields, so it's really just a pointer to the data. So you can write whatever data format you want, and any type information if you want to use that later to read the data. i.e. there isn't anything arrow specific in plasma.

Is xnd able to use foreign buffers as backing memory? If not, is there a reason why? Basically Plasma needs to allocate the memory itself in a certain way, to be able to hand it to its various client processes.

@pitrou Ideally, we would access Plasma with libxnd, which isn't specific to Python. That is more direct than going through the Python API and also other language bindings for libxnd will be able to use that integration.

@wesm
Copy link

wesm commented Mar 8, 2018

So you can write whatever data format you want, and any type information if you want to use that later to read the data. i.e. there isn't anything arrow specific in plasma.

Yep, that's right. We aren't forcing anyone to use any particular metadata representation. So you're free to define a metadata prefix that uses libndtypes, or anything you like

@saulshanabrook
Copy link
Member Author

I am closing this for now, since I moved the plasma integration to it's own repo. https://github.com/plures/pxnd If I hit any roadblocks there that I think need to be addressed on the Arrow side, I will open an issue on that repo and reach out on the mailing list.

@wesm
Copy link

wesm commented Mar 16, 2018

My two cents: I suspect that a more monorepo-like approach will make your life much easier rather than having to maintain interconnected C APIs amongst various Python packages with C libraries / C extensions. You can also make changes to multiple components at once without causing your CI setup to fall apart (i.e. if you have lots of repos, you may end up with multiple PRs related to a single change, each of which may cause the CI for any of the repos in the stack to break temporarily)

@skrah
Copy link
Member

skrah commented Mar 16, 2018

This is true. It worked because I was really the only developer up to now. :)

I would like to stabilize gumath a bit -- currently it's sometimes convenient to change ndtypes/xnd and worry later about gumath. But in a couple of months I think we can have something like a monorepo.

@skrah
Copy link
Member

skrah commented Mar 16, 2018

Also, if we want to add __truediv__ etc. to xnd objects instead of just using libgumath functions like truediv(), we have a circular relationship between xnd/gumath.

@pcmoritz
Copy link

pcmoritz commented Mar 16, 2018 via email

@skrah
Copy link
Member

skrah commented Mar 16, 2018

@pcmoritz Do you mean the part that is independent from xnd? I think that is a good idea, it would be very nice to have general plasma C bindings.

@pcmoritz
Copy link

pcmoritz commented Mar 16, 2018 via email

@saulshanabrook
Copy link
Member Author

saulshanabrook commented Mar 16, 2018

@wesm Yeah I am finding I might need to make a few changes to xnd as I am developing pxnd, so a monorepo might make sense eventually. It is nice to have the build separate so that xnd doesn't have to depend on Plasma and since authorship is so far split between them.

@pcmoritz Sure, I can do that soon. I am still iterating on them a lot to make sure they work correctly for us. It's my first time writing this sort of code, so I wanna see it working end to end before I believe it's correct. Once they are stable, I am happy to move them out, probably next week or the week after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants