Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickling support by converting to numpy #238

Merged
merged 2 commits into from
Apr 17, 2023
Merged

Pickling support by converting to numpy #238

merged 2 commits into from
Apr 17, 2023

Conversation

agoscinski
Copy link
Contributor

@agoscinski agoscinski commented Mar 29, 2023

First draft of pickling by converting everything to numpy arrays. Creates additional copies for conversion to numpy array, but when numpy arrays and pickle protocol 5 is used, no spurious copies should be made for the values. Need to check this. Also current test only works, because it only handles numpy arrays. I am sure with torch tensor it will fail (gradients are not converted)


📚 Documentation preview 📚: https://equistore--238.org.readthedocs.build/en/238/

@github-actions
Copy link

github-actions bot commented Mar 29, 2023

Here is a pre-built version of the code in this pull request: wheels.zip, you can install it locally by unzipping wheels.zip and using pip to install the file matching your system

@Luthaf
Copy link
Member

Luthaf commented Mar 30, 2023

I'm not fond of having a different format than the one we already use for equistore.save/equistore.load, especially since that means maintaining the two of them when modifying the data structure (such as in #227).

Why did you go this route instead of using equistore.save/equistore.load and making version of them adapted for pickle?

@agoscinski
Copy link
Contributor Author

agoscinski commented Mar 30, 2023

I think this implementation is pretty close to the use_numpy route in save and load and it should directly benefit from the implementations of numpy arrays being not copied on pickling (even though the conversion to numpy will add a copy).

I am not sure how to support the other route (use_numpy=False), and I cannot foresee how much implementation work it is to make it work. I think we need a byte object view of the TensorMap (like memoryview(np.array([1,2,3]))) to imitate the behaviour in save and load. As far as I understand for save and load we actually don't use npz format (use_numpy=False branch), we just write the byteobject into some file (not sure what crate::io::save exactly does, could not find doc page). But for pickling we have to provide the byteobject (best only a view) to the Pickler. I am not sure how we can do this. In numpy one can just create a byte view of the array (view memoryview(np.array([1,2,3])) or object bytearray(np.array([1,2,3])), so one can simply create a PickleBuffer from it. That prevents additional in-band copies in the pickling process. One can do then something like return TensorMap._from_ptr, (PickleBuffer(self),), None in the __reduce_ex__. This would need to be also done for a TensorBlock if we want to make it able to pickle it.

@Luthaf
Copy link
Member

Luthaf commented Mar 30, 2023

As far as I understand for save and load we actually don't use npz format (use_numpy=False branch)

No, we do! Both branches create a NPZ file, which is just a ZIP file containing NPY data. The rust branch re-implements a writer/parser for NPY files (the equistore_core::io rust module)

not sure what crate::io::save exactly does, could not find doc page

this is the one https://lab-cosmo.github.io/equistore/latest/reference/rust/equistore/io/fn.save.html, crate expands to the current crate (i.e. package in Python terms), here equistore.

But for pickling we have to provide the byteobject (best only a view) to the Pickler.

Yes, this is what we discussed last time. In my mind, the simplest way to do this would be to add an eqs_tensormap_save_buffer/eqs_tensormap_load_buffer function to the C API, looking like this:

eqs_status_t eqs_tensormap_save_buffer(eqs_tensormap_t* tensor, uint8_t** buffer, size_t* length);
eqs_status_t eqs_tensormap_load_buffer(const uint8_t* buffer, size_t length, create_array_callback);

and then the returned buffer can be passed to the Pickler directly!

The load function should be pretty easy to write, a simple replacement of https://github.com/lab-cosmo/equistore/blob/6c484592f3532bba97b67b8f88e71b3e47f7939e/equistore-core/src/c_api/io.rs#L105-L107

with something like

buffer = std::slice::from_raw_parts(buffer, length);
let tensor = crate::io::load(buffer, create_array)?; 

The save function is bit harder, because we need to allocate for the buffer, but it is not clear who should free the data. We could either allocate with malloc and make python free the data when it is done with free, or pass an explicit allocator to the function. I'm happy to help to find a permanent solution here, but a temporary one which would be backward compatible is to do this

def __setstate__(self, buffer):
    with open("tmp.npz", "wb") as fd:
       fd.write(buffer)

    # not sure if one can reset self? Otherwise we need to copy the fields
    self = equistore.io.load("tmp.npz")

def __getstate__(self):
    equistore.io.save("tmp.npz")
    with open("tmp.npz", "rb") as fd:
       buffer = fd.read()

    return buffer

@agoscinski
Copy link
Contributor Author

First thanks for the long comment, it makes a lot of things clearer to me. I think the naming confused a lot: python's binary buffer and a byte buffer are very different (maybe first below what I mean with this). There are a

and then the returned buffer can be passed to the Pickler directly!

This part is not very clear to me. I cannot really create a buffer on the Python side, because I don't know how large the object is. Okay I can transfer this information to the Python side and create a buffer.
and pass it to Rust where it writes to the buffer.

size = lib.eqs_tensormap_size_as_buffer()
buffer = bytes(b"0"*size)
lib.eqs_tensormap_save_buffer( (ctypes.c_byte * size).from_buffer(buffer), size)

Or the Rust function could return a buffer (ptr + size) and on the Python side this can be wrapped directly with the bytes function and passed to the pickler
https://discuss.python.org/t/ctypes-how-to-convert-a-c-ubyte-array-to-python-bytes/13994/2
But both cases will result in copies of a byte object.

Preventing the byte buffer copy

What I understood from public resources, to avoid any copy one has to support the python buffer protocol (it's an interface to me, don't know why the naming), so python knows how to read from the rust object without copying from it (e.g. making a memoryview).
https://docs.python.org/3/c-api/buffer.html
https://github.com/python/cpython/blob/main/Include/pybuffer.h

Since this requires CPython, this would require another layer of interface we need to implement I think (C-API <-> CPython <-> Python).
http://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/#Adding-the-Buffer-Protocol

There is a crate pyo3 that offers CPython bindings for Rust, but this seems also like a lot of work to restructure the classes in the Rust code which we wrap (is this the right word?) Python.
https://docs.rs/pyo3/0.11.1/x86_64-apple-darwin/pyo3/class/buffer/index.html
Issue talking about the python buffer protocol PyO3/pyo3#1385

I guess performance-wise that one copy does not matter much, so I would start with your suggested temporary solution for this PR (just dumping to a file from Rust side and reading it from python side) and then start a new PR with your suggestion to pass a byte buffer. What do you think

@Luthaf
Copy link
Member

Luthaf commented Apr 3, 2023

Another possibility to avoid a copy of the buffer seems to be the memoryview type. In particular, PyMemoryView_FromMemory might be doing what we need, and it seems like we can call it dynamically from ctypes (instead of linking to the CPython C-API): https://stackoverflow.com/a/72968176/4692076.

We still need to be able to pass pointer + length to Python, and Python needs to be able to free the pointer, so we could have a small wrapper class like this

class MallocBuffer:
    def __init__(self, ptr, size):
        self.ptr = ptr
        self.size = size
    
    def view(self):
        assert self.ptr is not None
        return memoryview(...)  # use PyMemoryView_FromMemory
    
    def __del__(self):
        # free the pointer when we are done with it
        libc.free(self.ptr)
        self.ptr = None

The C API can then be a single function, returning pointer + size; where the pointer is allocated with malloc/realloc. I remember it was a bit problematic to access free from ctypes though, so we would want to check this.

Regarding the plan, I would go with something like

  • step 0: write to a temporary files & read it. This allow us to use pickle for other things in a backward compatible way
  • step 1: write the load function, which should be the easiest
  • step 2: write the save function

What I understood from public resources, to avoid any copy one has to support the python buffer protocol (it's an interface to me, don't know why the naming)

Python protocols are the same as interfaces in other languages.

@agoscinski agoscinski marked this pull request as ready for review April 5, 2023 19:35
@agoscinski
Copy link
Contributor Author

Before the rebase it worked, wtih the new master For the script

import pickle
import equistore.io

from utils import tensor_map

tm = tensor_map()

equistore.io.save('tmp.npz', tm)

with open('tmp.npz', 'rb') as f: b = f.read()

with open('tmp.pikcle', 'wb') as f: pickle.dump(b, f)

equistore.io.load('tmp.pikcle')

I get

Traceback (most recent call last):
  File "/home/alexgo/code/c-api-pickle/equistoresave.py", line 14, in <module>
    equistore.io.load('tmp.pikcle')
  File "/opt/anaconda3/lib/python3.10/site-packages/equistore/io.py", line 34, in load
    return load_custom_array(path, create_numpy_array)
  File "/opt/anaconda3/lib/python3.10/site-packages/equistore/io.py", line 69, in load_custom_array
    return TensorMap._from_ptr(ptr)
  File "/opt/anaconda3/lib/python3.10/site-packages/equistore/tensor.py", line 68, in _from_ptr
    _check_pointer(ptr)
  File "/opt/anaconda3/lib/python3.10/site-packages/equistore/status.py", line 48, in _check_pointer
    raise EquistoreError(last_error())
equistore.status.EquistoreError: serialization format error: invalid header in file: expected a string, got ' '

@Luthaf
Copy link
Member

Luthaf commented Apr 6, 2023

I don't expect this to work, the pickle format adds some more data around what we give it. Something like this should work though

import pickle
import equistore.io

from utils import tensor_map

tm = tensor_map()

equistore.io.save('tmp.npz', tm)

with open('tmp.npz', 'rb') as f: b = f.read()

with open('tmp.pikcle', 'wb') as f: pickle.dump(b, f)

with open('tmp.pikcle', 'rb') as f: c = pickle.load(f)

with open('tmp-2.npz', 'wb') as f: f.write(c)

equistore.io.load('tmp-2.npz')

@agoscinski
Copy link
Contributor Author

I don't expect this to work, the pickle format adds some more data around what we give it. Something like this should work though

I agree, but it worked before the rebase, so I assumed pickle does not add anything if it gets one binary object. I also tried what you have sent, and this gave me the same error message. So I suspect it the reason lies somewhere else.

@Luthaf
Copy link
Member

Luthaf commented Apr 6, 2023

Yes, looks like I forgot to strip a string somewhere in #240. I'll update the code, sorry about that!

@Luthaf
Copy link
Member

Luthaf commented Apr 11, 2023

Whoops, github intepreted my "Fix " as close this PR ^_^ Sorry!

Copy link
Member

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!

python/src/equistore/tensor.py Show resolved Hide resolved
python/src/equistore/tensor.py Outdated Show resolved Hide resolved
python/src/equistore/tensor.py Outdated Show resolved Hide resolved
python/src/equistore/tensor.py Outdated Show resolved Hide resolved
@agoscinski agoscinski force-pushed the numpy-pickling branch 5 times, most recently from 0e2a744 to 17e8d63 Compare April 13, 2023 16:47
@agoscinski agoscinski requested a review from Luthaf April 13, 2023 17:26
@agoscinski
Copy link
Contributor Author

wanted to fix warning but did not want to create a separate PR, so I kept 2 commits

Copy link
Member

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small changes and it is good to go!

python/src/equistore/tensor.py Outdated Show resolved Hide resolved
python/src/equistore/tensor.py Outdated Show resolved Hide resolved
We use the equistore.io.load and save functions because we do not have
yet a function to pass byte objects to rust. This is a temporary
implementation to support pickling until such an function exists on
the rust side.
@Luthaf Luthaf merged commit 8e2e1f0 into master Apr 17, 2023
@Luthaf Luthaf deleted the numpy-pickling branch April 17, 2023 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants