-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pickling support by converting to numpy #238
Conversation
Here is a pre-built version of the code in this pull request: wheels.zip, you can install it locally by unzipping |
I'm not fond of having a different format than the one we already use for Why did you go this route instead of using |
I think this implementation is pretty close to the use_numpy route in save and load and it should directly benefit from the implementations of numpy arrays being not copied on pickling (even though the conversion to numpy will add a copy). I am not sure how to support the other route (use_numpy=False), and I cannot foresee how much implementation work it is to make it work. I think we need a byte object view of the TensorMap (like |
No, we do! Both branches create a NPZ file, which is just a ZIP file containing NPY data. The rust branch re-implements a writer/parser for NPY files (the
this is the one https://lab-cosmo.github.io/equistore/latest/reference/rust/equistore/io/fn.save.html,
Yes, this is what we discussed last time. In my mind, the simplest way to do this would be to add an eqs_status_t eqs_tensormap_save_buffer(eqs_tensormap_t* tensor, uint8_t** buffer, size_t* length);
eqs_status_t eqs_tensormap_load_buffer(const uint8_t* buffer, size_t length, create_array_callback); and then the returned buffer can be passed to the Pickler directly! The with something like buffer = std::slice::from_raw_parts(buffer, length);
let tensor = crate::io::load(buffer, create_array)?; The def __setstate__(self, buffer):
with open("tmp.npz", "wb") as fd:
fd.write(buffer)
# not sure if one can reset self? Otherwise we need to copy the fields
self = equistore.io.load("tmp.npz")
def __getstate__(self):
equistore.io.save("tmp.npz")
with open("tmp.npz", "rb") as fd:
buffer = fd.read()
return buffer |
First thanks for the long comment, it makes a lot of things clearer to me. I think the naming confused a lot: python's binary buffer and a byte buffer are very different (maybe first below what I mean with this). There are a
This part is not very clear to me. I cannot really create a buffer on the Python side, because I don't know how large the object is. Okay I can transfer this information to the Python side and create a buffer. size = lib.eqs_tensormap_size_as_buffer()
buffer = bytes(b"0"*size)
lib.eqs_tensormap_save_buffer( (ctypes.c_byte * size).from_buffer(buffer), size) Or the Rust function could return a buffer (ptr + size) and on the Python side this can be wrapped directly with the Preventing the byte buffer copyWhat I understood from public resources, to avoid any copy one has to support the python buffer protocol (it's an interface to me, don't know why the naming), so python knows how to read from the rust object without copying from it (e.g. making a memoryview). Since this requires CPython, this would require another layer of interface we need to implement I think (C-API <-> CPython <-> Python). There is a crate pyo3 that offers CPython bindings for Rust, but this seems also like a lot of work to restructure the classes in the Rust code which we wrap (is this the right word?) Python. I guess performance-wise that one copy does not matter much, so I would start with your suggested temporary solution for this PR (just dumping to a file from Rust side and reading it from python side) and then start a new PR with your suggestion to pass a byte buffer. What do you think |
Another possibility to avoid a copy of the buffer seems to be the memoryview type. In particular, PyMemoryView_FromMemory might be doing what we need, and it seems like we can call it dynamically from ctypes (instead of linking to the CPython C-API): https://stackoverflow.com/a/72968176/4692076. We still need to be able to pass pointer + length to Python, and Python needs to be able to free the pointer, so we could have a small wrapper class like this class MallocBuffer:
def __init__(self, ptr, size):
self.ptr = ptr
self.size = size
def view(self):
assert self.ptr is not None
return memoryview(...) # use PyMemoryView_FromMemory
def __del__(self):
# free the pointer when we are done with it
libc.free(self.ptr)
self.ptr = None The C API can then be a single function, returning pointer + size; where the pointer is allocated with Regarding the plan, I would go with something like
Python protocols are the same as interfaces in other languages. |
Before the rebase it worked, wtih the new master For the script import pickle
import equistore.io
from utils import tensor_map
tm = tensor_map()
equistore.io.save('tmp.npz', tm)
with open('tmp.npz', 'rb') as f: b = f.read()
with open('tmp.pikcle', 'wb') as f: pickle.dump(b, f)
equistore.io.load('tmp.pikcle') I get
|
I don't expect this to work, the pickle format adds some more data around what we give it. Something like this should work though import pickle
import equistore.io
from utils import tensor_map
tm = tensor_map()
equistore.io.save('tmp.npz', tm)
with open('tmp.npz', 'rb') as f: b = f.read()
with open('tmp.pikcle', 'wb') as f: pickle.dump(b, f)
with open('tmp.pikcle', 'rb') as f: c = pickle.load(f)
with open('tmp-2.npz', 'wb') as f: f.write(c)
equistore.io.load('tmp-2.npz') |
I agree, but it worked before the rebase, so I assumed pickle does not add anything if it gets one binary object. I also tried what you have sent, and this gave me the same error message. So I suspect it the reason lies somewhere else. |
Yes, looks like I forgot to strip a string somewhere in #240. I'll update the code, sorry about that! |
Whoops, github intepreted my "Fix " as close this PR ^_^ Sorry! |
60981e5
to
e41f4b7
Compare
6dee9aa
to
ad777db
Compare
ad777db
to
1f6c7f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall!
0e2a744
to
17e8d63
Compare
wanted to fix warning but did not want to create a separate PR, so I kept 2 commits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small changes and it is good to go!
We use the equistore.io.load and save functions because we do not have yet a function to pass byte objects to rust. This is a temporary implementation to support pickling until such an function exists on the rust side.
42ec800
to
e59fd9c
Compare
First draft of pickling by converting everything to numpy arrays. Creates additional copies for conversion to numpy array, but when numpy arrays and pickle protocol 5 is used, no spurious copies should be made for the values. Need to check this. Also current test only works, because it only handles numpy arrays. I am sure with torch tensor it will fail (gradients are not converted)
📚 Documentation preview 📚: https://equistore--238.org.readthedocs.build/en/238/