-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support one rust type serializing as many dtypes #15
base: master
Are you sure you want to change the base?
Conversation
Summary of all changesAdditions / Big changesThe new serialization traitsSerializable has been split up quite a bit, because not all operations can be done on all types. /// Trait that permits reading a type from an `.npy` file.
pub trait Deserialize: Sized {
/// Think of this as a `Fn(&[u8]) -> Self`, with bonuses.
///
/// Unfortunately, until rust supports existential associated types, actual closures
/// cannot be used here, and you must define something that manually implements [`TypeRead`].
type Reader: TypeRead<Value=Self>;
/// Get a function that deserializes a single data field at a time
///
/// The function receives a byte buffer containing at least
/// `dtype.num_bytes()` bytes.
///
/// # Errors
///
/// Returns `Err` if the `DType` is not compatible with `Self`.
fn reader(dtype: &DType) -> Result<Self::Reader, DTypeError>;
}
/// Trait that permits writing a type to an `.npy` file.
pub trait Serialize {
/// Think of this as some sort of `for<W: io::Write> Fn(W, &Self) -> io::Result<()>`.
///
/// Unfortunately, rust does not have generic closures, so you must manually define
/// your own implementor of the [`TypeWrite`] trait.
type Writer: TypeWrite<Value=Self>;
/// Get a function that serializes a single data field at a time.
///
/// # Errors
///
/// Returns `Err` if the `DType` is not compatible with `Self`.
fn writer(dtype: &DType) -> Result<Self::Writer, DTypeError>;
}
/// Subtrait of [`Serialize`] for types which have a reasonable default `DType`.
///
/// This opens up some simpler APIs for serialization. (e.g. [`::to_file`])
pub trait AutoSerialize: Serialize {
/// A suggested format for serialization.
///
/// The builtin implementations for primitive types generally prefer `|` endianness if possible,
/// else the machine endian format.
fn default_dtype() -> DType;
}
Where's
|
Results of the existing Before
After
Yeowch. This is probably due to the double indirection in the current integer/float impls, which likely prevents inlining. There's no longer any good reason for them to have this indirection since I've decided not to support promotions like Edit: Further benchmarksEdit: After fixing indirection
Edit: After adding dedicated Little Endian newtypes
The issue seems to be the fact that it no longer statically knows the strides. I have an idea for how to fix this: pub trait TypeRead {
type Value;
fn read_one<'a>(&self, bytes: &'a [u8]) -> (Self::Value, &'a [u8]);
} |
The For this reason I've decided against the inclusion of the fixed-endian wrapper types in this PR. Before:
After:
|
I removed a couple of things that can be backwards-compatibly added back later. What remains are basically the things I need for unit tests. |
This enables these derives to be qualified under `npy`, for disambiguation from `serde`: extern crate npy; #[derive(npy::Serialize, npy::Deserialize)] struct MyStruct { ... } This has a couple of downsides with regard to maintainence: * npy_derive can no longer be a dev-dependency because it must become an optional dependency. * Many tests and examples need the feature. We need to list all of these in Cargo.toml. * Because this crate is 2015 edition, as soon as we list *any* tests and examples, we must list *all* of them; including the ones that don't need the feature! --- This commit had to update `.travis.yml` to start using the feature. I took this opportunity to also add `--examples` (which the default script does not do) to ensure that examples build correctly.
The new derive macros will need this...
This had to wait until after the derives were added so that the tests could use the derives.
This is the single most important commit in the PR. All breaking changes to existing public APIs are contained in here. Serialize is completely removed. Examples and tests are not yet updated, so they are broken in this commit.
Fix a couple of things I missed while rewriting and reorganizing the commit history.
I tried a variety of things to optimize this function: * Replacing usage of get_unchecked with reuse of the remainder returned by read_one, so that the stride can be statically known rather than having to be looked up. (this is what optimized the old read benchmark) * Putting an assertion up front to prove that the data vector is long enough. But whatever I do, performance won't budge. In the f32 benchmark, a very hot bounds check still occurs on every read to ensure that the length of the data is at least 4 bytes. So I'm adding the benchmark, but leaving the function itself alone.
This is finished now! I completely rewrote the git history from scratch, organizing all of the changes into easily reviewable groups. You should be able to read each individual commit one by one and make sense of them. |
2015 edition crates do not use NLL yet in the latest stable compiler, so our derive macro must be conservative. Apparently, in the latest nightly, this was changed; 2015 edition will at some point use NLL in the future. This is why I did not notice the problem at first!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you for your heroic effort to improve and generalize this library! And I'm sorry I didn't have time to review your changes earlier.
I didn't review all the commits & changes just yet, but I'm quite sure I agree with all your architectural decisions. Your solution of using two-stage serialization to efficiently support all possible formats is particularly neat.
I commented on a few things I found while browsing the commits, but those are just nit-picks. I will try to finish the review so that we can merge everything ASAP.
Edit: And thanks for the detailed write-up, it really helped understand the changes. :-)
Co-Authored-By: Pavel Potocek <pavelpotocek@gmail.com>
It was an artefact of an old design. Out of paranoia, I added some assertions to the Serialize/Deserialize impls to make sure the endianness is valid. These are redundant since it is checked in TypeStr::from_str, but most other such properties are at least implicitly checked by the `_` arm in impls of `reader` and `writer` and I wanted to be safe.
I'm not really sure why I had it return a clone in the first place...
Two notes:
|
@potocpav have you had any time to finish the review? |
There's a couple of WTFs in this that I am aware about after cleaning them up on my fork. I can include those changes here if you want:
|
I'm also now having second thoughts about reading DateTimes/TimeDeltas as u64/i64 in this PR. Basically, I think that at some point I'll want to add back the widening conversions for integers (because integer arrays produced in python code will often have a type that is dynamically chosen between I really wish I could find a way to make this PR smaller... (the only bit that I think can really be pulled out is the |
Rats, I just realized DateTime should be an |
I don't want to yet commit to a specific API for serializing DateTime/TimeDelta, so that we can keep open the option of widening conversions.
Is there any progress on this, just cause I've come across this issue myself trying to deserialise u8 numpy arrays |
So it is now two years later; Pavel never responded again to this PR or the other, and I found myself needing
It can be found here: https://github.com/ExpHP/npyz Hopefully one day these additions can be merged upstream, so that they can finally be under the name |
This is now ready for review!
I have chosen to finally submit a PR now that
cargo test --all
succeeds, and the behavior of all of the existing examples has been verified. Some work still remains to be done.Overview
|Sn
and binary blobs|Vn
(asVec<u8>
)<M8[us]
and timedeltas<m8[us]
(asu64
andi64
)Serializable
had to be split into three traits to accomodate these features. More about this later.np.float16
,np.float128
, and unicode strings in the future.This is a large PR. However, as of Thursday June 6, I rewrote the commit history so that each commit can be easily reviewed. (there is no longer anything introduced in one commit that gets changed in another)
Closes #11.
Closes #12.
Closes #19.
I will add a comment which itemizes all of the changes to the public API and the reasoning behind them.