Pickling support by converting to numpy #238

agoscinski · 2023-03-29T17:47:03Z

First draft of pickling by converting everything to numpy arrays. Creates additional copies for conversion to numpy array, but when numpy arrays and pickle protocol 5 is used, no spurious copies should be made for the values. Need to check this. Also current test only works, because it only handles numpy arrays. I am sure with torch tensor it will fail (gradients are not converted)

📚 Documentation preview 📚: https://equistore--238.org.readthedocs.build/en/238/

github-actions · 2023-03-29T18:02:05Z

Here is a pre-built version of the code in this pull request: wheels.zip, you can install it locally by unzipping wheels.zip and using pip to install the file matching your system

Luthaf · 2023-03-30T07:50:51Z

I'm not fond of having a different format than the one we already use for equistore.save/equistore.load, especially since that means maintaining the two of them when modifying the data structure (such as in #227).

Why did you go this route instead of using equistore.save/equistore.load and making version of them adapted for pickle?

agoscinski · 2023-03-30T12:17:06Z

I think this implementation is pretty close to the use_numpy route in save and load and it should directly benefit from the implementations of numpy arrays being not copied on pickling (even though the conversion to numpy will add a copy).

I am not sure how to support the other route (use_numpy=False), and I cannot foresee how much implementation work it is to make it work. I think we need a byte object view of the TensorMap (like memoryview(np.array([1,2,3]))) to imitate the behaviour in save and load. As far as I understand for save and load we actually don't use npz format (use_numpy=False branch), we just write the byteobject into some file (not sure what crate::io::save exactly does, could not find doc page). But for pickling we have to provide the byteobject (best only a view) to the Pickler. I am not sure how we can do this. In numpy one can just create a byte view of the array (view memoryview(np.array([1,2,3])) or object bytearray(np.array([1,2,3])), so one can simply create a PickleBuffer from it. That prevents additional in-band copies in the pickling process. One can do then something like return TensorMap._from_ptr, (PickleBuffer(self),), None in the __reduce_ex__. This would need to be also done for a TensorBlock if we want to make it able to pickle it.

Luthaf · 2023-03-30T12:38:29Z

As far as I understand for save and load we actually don't use npz format (use_numpy=False branch)

No, we do! Both branches create a NPZ file, which is just a ZIP file containing NPY data. The rust branch re-implements a writer/parser for NPY files (the equistore_core::io rust module)

not sure what crate::io::save exactly does, could not find doc page

this is the one https://lab-cosmo.github.io/equistore/latest/reference/rust/equistore/io/fn.save.html, crate expands to the current crate (i.e. package in Python terms), here equistore.

But for pickling we have to provide the byteobject (best only a view) to the Pickler.

Yes, this is what we discussed last time. In my mind, the simplest way to do this would be to add an eqs_tensormap_save_buffer/eqs_tensormap_load_buffer function to the C API, looking like this:

eqs_status_t eqs_tensormap_save_buffer(eqs_tensormap_t* tensor, uint8_t** buffer, size_t* length);
eqs_status_t eqs_tensormap_load_buffer(const uint8_t* buffer, size_t length, create_array_callback);

and then the returned buffer can be passed to the Pickler directly!

The load function should be pretty easy to write, a simple replacement of https://github.com/lab-cosmo/equistore/blob/6c484592f3532bba97b67b8f88e71b3e47f7939e/equistore-core/src/c_api/io.rs#L105-L107

with something like

buffer = std::slice::from_raw_parts(buffer, length);
let tensor = crate::io::load(buffer, create_array)?;

The save function is bit harder, because we need to allocate for the buffer, but it is not clear who should free the data. We could either allocate with malloc and make python free the data when it is done with free, or pass an explicit allocator to the function. I'm happy to help to find a permanent solution here, but a temporary one which would be backward compatible is to do this

def __setstate__(self, buffer):
    with open("tmp.npz", "wb") as fd:
       fd.write(buffer)

    # not sure if one can reset self? Otherwise we need to copy the fields
    self = equistore.io.load("tmp.npz")

def __getstate__(self):
    equistore.io.save("tmp.npz")
    with open("tmp.npz", "rb") as fd:
       buffer = fd.read()

    return buffer

agoscinski · 2023-04-02T18:09:56Z

First thanks for the long comment, it makes a lot of things clearer to me. I think the naming confused a lot: python's binary buffer and a byte buffer are very different (maybe first below what I mean with this). There are a

and then the returned buffer can be passed to the Pickler directly!

This part is not very clear to me. I cannot really create a buffer on the Python side, because I don't know how large the object is. Okay I can transfer this information to the Python side and create a buffer.
and pass it to Rust where it writes to the buffer.

size = lib.eqs_tensormap_size_as_buffer()
buffer = bytes(b"0"*size)
lib.eqs_tensormap_save_buffer( (ctypes.c_byte * size).from_buffer(buffer), size)

Or the Rust function could return a buffer (ptr + size) and on the Python side this can be wrapped directly with the bytes function and passed to the pickler
https://discuss.python.org/t/ctypes-how-to-convert-a-c-ubyte-array-to-python-bytes/13994/2
But both cases will result in copies of a byte object.

Preventing the byte buffer copy

What I understood from public resources, to avoid any copy one has to support the python buffer protocol (it's an interface to me, don't know why the naming), so python knows how to read from the rust object without copying from it (e.g. making a memoryview).
https://docs.python.org/3/c-api/buffer.html
https://github.com/python/cpython/blob/main/Include/pybuffer.h

Since this requires CPython, this would require another layer of interface we need to implement I think (C-API <-> CPython <-> Python).
http://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/#Adding-the-Buffer-Protocol

There is a crate pyo3 that offers CPython bindings for Rust, but this seems also like a lot of work to restructure the classes in the Rust code which we wrap (is this the right word?) Python.
https://docs.rs/pyo3/0.11.1/x86_64-apple-darwin/pyo3/class/buffer/index.html
Issue talking about the python buffer protocol PyO3/pyo3#1385

I guess performance-wise that one copy does not matter much, so I would start with your suggested temporary solution for this PR (just dumping to a file from Rust side and reading it from python side) and then start a new PR with your suggestion to pass a byte buffer. What do you think

Luthaf · 2023-04-03T07:42:56Z

Another possibility to avoid a copy of the buffer seems to be the memoryview type. In particular, PyMemoryView_FromMemory might be doing what we need, and it seems like we can call it dynamically from ctypes (instead of linking to the CPython C-API): https://stackoverflow.com/a/72968176/4692076.

We still need to be able to pass pointer + length to Python, and Python needs to be able to free the pointer, so we could have a small wrapper class like this

class MallocBuffer:
    def __init__(self, ptr, size):
        self.ptr = ptr
        self.size = size
    
    def view(self):
        assert self.ptr is not None
        return memoryview(...)  # use PyMemoryView_FromMemory
    
    def __del__(self):
        # free the pointer when we are done with it
        libc.free(self.ptr)
        self.ptr = None

The C API can then be a single function, returning pointer + size; where the pointer is allocated with malloc/realloc. I remember it was a bit problematic to access free from ctypes though, so we would want to check this.

Regarding the plan, I would go with something like

step 0: write to a temporary files & read it. This allow us to use pickle for other things in a backward compatible way
step 1: write the load function, which should be the easiest
step 2: write the save function

What I understood from public resources, to avoid any copy one has to support the python buffer protocol (it's an interface to me, don't know why the naming)

Python protocols are the same as interfaces in other languages.

agoscinski · 2023-04-05T19:58:12Z

Before the rebase it worked, wtih the new master For the script

import pickle
import equistore.io

from utils import tensor_map

tm = tensor_map()

equistore.io.save('tmp.npz', tm)

with open('tmp.npz', 'rb') as f: b = f.read()

with open('tmp.pikcle', 'wb') as f: pickle.dump(b, f)

equistore.io.load('tmp.pikcle')

I get

Traceback (most recent call last):
  File "/home/alexgo/code/c-api-pickle/equistoresave.py", line 14, in <module>
    equistore.io.load('tmp.pikcle')
  File "/opt/anaconda3/lib/python3.10/site-packages/equistore/io.py", line 34, in load
    return load_custom_array(path, create_numpy_array)
  File "/opt/anaconda3/lib/python3.10/site-packages/equistore/io.py", line 69, in load_custom_array
    return TensorMap._from_ptr(ptr)
  File "/opt/anaconda3/lib/python3.10/site-packages/equistore/tensor.py", line 68, in _from_ptr
    _check_pointer(ptr)
  File "/opt/anaconda3/lib/python3.10/site-packages/equistore/status.py", line 48, in _check_pointer
    raise EquistoreError(last_error())
equistore.status.EquistoreError: serialization format error: invalid header in file: expected a string, got ' '

Luthaf · 2023-04-06T07:46:48Z

I don't expect this to work, the pickle format adds some more data around what we give it. Something like this should work though

import pickle
import equistore.io

from utils import tensor_map

tm = tensor_map()

equistore.io.save('tmp.npz', tm)

with open('tmp.npz', 'rb') as f: b = f.read()

with open('tmp.pikcle', 'wb') as f: pickle.dump(b, f)

with open('tmp.pikcle', 'rb') as f: c = pickle.load(f)

with open('tmp-2.npz', 'wb') as f: f.write(c)

equistore.io.load('tmp-2.npz')

agoscinski · 2023-04-06T13:54:12Z

I don't expect this to work, the pickle format adds some more data around what we give it. Something like this should work though

I agree, but it worked before the rebase, so I assumed pickle does not add anything if it gets one binary object. I also tried what you have sent, and this gave me the same error message. So I suspect it the reason lies somewhere else.

Luthaf · 2023-04-06T13:56:09Z

Yes, looks like I forgot to strip a string somewhere in #240. I'll update the code, sorry about that!

Luthaf · 2023-04-11T12:52:45Z

Whoops, github intepreted my "Fix " as close this PR ^_^ Sorry!

python/src/equistore/tensor.py

Luthaf

Looks good overall!

python/src/equistore/tensor.py

agoscinski · 2023-04-13T17:27:47Z

wanted to fix warning but did not want to create a separate PR, so I kept 2 commits

Luthaf

Two small changes and it is good to go!

python/src/equistore/tensor.py

We use the equistore.io.load and save functions because we do not have yet a function to pass byte objects to rust. This is a temporary implementation to support pickling until such an function exists on the rust side.

agoscinski marked this pull request as ready for review April 5, 2023 19:35

Luthaf mentioned this pull request Apr 6, 2023

Skip whitespaces after opening { in npy header #249

Merged

PicoCentauri closed this in #249 Apr 11, 2023

Luthaf reopened this Apr 11, 2023

agoscinski force-pushed the numpy-pickling branch from 60981e5 to e41f4b7 Compare April 12, 2023 07:07

agoscinski mentioned this pull request Apr 12, 2023

Implement Python's pickle protocol for TensorMap #94

Closed

agoscinski force-pushed the numpy-pickling branch 3 times, most recently from 6dee9aa to ad777db Compare April 12, 2023 09:37

Luthaf reviewed Apr 12, 2023

View reviewed changes

python/src/equistore/tensor.py Outdated Show resolved Hide resolved

agoscinski force-pushed the numpy-pickling branch from ad777db to 1f6c7f6 Compare April 12, 2023 11:40

agoscinski requested a review from Luthaf April 12, 2023 13:14

Luthaf reviewed Apr 12, 2023

View reviewed changes

python/src/equistore/tensor.py Show resolved Hide resolved

python/src/equistore/tensor.py Outdated Show resolved Hide resolved

python/src/equistore/tensor.py Outdated Show resolved Hide resolved

python/src/equistore/tensor.py Outdated Show resolved Hide resolved

agoscinski force-pushed the numpy-pickling branch 5 times, most recently from 0e2a744 to 17e8d63 Compare April 13, 2023 16:47

agoscinski requested a review from Luthaf April 13, 2023 17:26

Luthaf approved these changes Apr 14, 2023

View reviewed changes

python/src/equistore/tensor.py Outdated Show resolved Hide resolved

python/src/equistore/tensor.py Outdated Show resolved Hide resolved

agoscinski added 2 commits April 15, 2023 17:43

rudimentary pickle support by creating tmp files

a112220

We use the equistore.io.load and save functions because we do not have yet a function to pass byte objects to rust. This is a temporary implementation to support pickling until such an function exists on the rust side.

fixing warning message in io.py

e59fd9c

agoscinski force-pushed the numpy-pickling branch from 42ec800 to e59fd9c Compare April 15, 2023 15:43

agoscinski requested a review from Luthaf April 15, 2023 15:44

Luthaf approved these changes Apr 17, 2023

View reviewed changes

Luthaf merged commit 8e2e1f0 into master Apr 17, 2023

Luthaf deleted the numpy-pickling branch April 17, 2023 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pickling support by converting to numpy #238

Pickling support by converting to numpy #238

agoscinski commented Mar 29, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Mar 29, 2023 •

edited

Loading

Luthaf commented Mar 30, 2023

agoscinski commented Mar 30, 2023 •

edited

Loading

Luthaf commented Mar 30, 2023 •

edited

Loading

agoscinski commented Apr 2, 2023

Luthaf commented Apr 3, 2023

agoscinski commented Apr 5, 2023

Luthaf commented Apr 6, 2023

agoscinski commented Apr 6, 2023

Luthaf commented Apr 6, 2023

Luthaf commented Apr 11, 2023

Luthaf left a comment

agoscinski commented Apr 13, 2023

Luthaf left a comment

Pickling support by converting to numpy #238

Pickling support by converting to numpy #238

Conversation

agoscinski commented Mar 29, 2023 • edited by github-actions bot Loading

github-actions bot commented Mar 29, 2023 • edited Loading

Luthaf commented Mar 30, 2023

agoscinski commented Mar 30, 2023 • edited Loading

Luthaf commented Mar 30, 2023 • edited Loading

agoscinski commented Apr 2, 2023

Preventing the byte buffer copy

Luthaf commented Apr 3, 2023

agoscinski commented Apr 5, 2023

Luthaf commented Apr 6, 2023

agoscinski commented Apr 6, 2023

Luthaf commented Apr 6, 2023

Luthaf commented Apr 11, 2023

Luthaf left a comment

Choose a reason for hiding this comment

agoscinski commented Apr 13, 2023

Luthaf left a comment

Choose a reason for hiding this comment

agoscinski commented Mar 29, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Mar 29, 2023 •

edited

Loading

agoscinski commented Mar 30, 2023 •

edited

Loading

Luthaf commented Mar 30, 2023 •

edited

Loading