Support one rust type serializing as many dtypes #15

ExpHP · 2019-06-04T18:50:18Z

This is now ready for review!

I have chosen to finally submit a PR now that cargo test --all succeeds, and the behavior of all of the existing examples has been verified. Some work still remains to be done.

Overview

Adds support for non-little endianness.
Adds de/serialization of the following dtypes:
- Byte strings |Sn and binary blobs |Vn (as Vec<u8>)
- Datetimes <M8[us] and timedeltas <m8[us] (as u64 and i64)
Serializable had to be split into three traits to accomodate these features. More about this later.
Adds full parsing and validation of string descrs, hopefully making it easier to add support for complex numbers, bools, np.float16, np.float128, and unicode strings in the future.

This is a large PR. However, as of Thursday June 6, I rewrote the commit history so that each commit can be easily reviewed. (there is no longer anything introduced in one commit that gets changed in another)

Closes #11.
Closes #12.
Closes #19.

I will add a comment which itemizes all of the changes to the public API and the reasoning behind them.

ExpHP · 2019-06-04T20:54:22Z

Summary of all changes

Additions / Big changes

The new serialization traits

Serializable has been split up quite a bit, because not all operations can be done on all types.

/// Trait that permits reading a type from an `.npy` file.
pub trait Deserialize: Sized {
    /// Think of this as a `Fn(&[u8]) -> Self`, with bonuses.
    ///
    /// Unfortunately, until rust supports existential associated types, actual closures
    /// cannot be used here, and you must define something that manually implements [`TypeRead`].
    type Reader: TypeRead<Value=Self>;

    /// Get a function that deserializes a single data field at a time
    ///
    /// The function receives a byte buffer containing at least
    /// `dtype.num_bytes()` bytes.
    ///
    /// # Errors
    ///
    /// Returns `Err` if the `DType` is not compatible with `Self`.
    fn reader(dtype: &DType) -> Result<Self::Reader, DTypeError>;
}

/// Trait that permits writing a type to an `.npy` file.
pub trait Serialize {
    /// Think of this as some sort of `for<W: io::Write> Fn(W, &Self) -> io::Result<()>`.
    ///
    /// Unfortunately, rust does not have generic closures, so you must manually define
    /// your own implementor of the [`TypeWrite`] trait.
    type Writer: TypeWrite<Value=Self>;

    /// Get a function that serializes a single data field at a time.
    ///
    /// # Errors
    ///
    /// Returns `Err` if the `DType` is not compatible with `Self`.
    fn writer(dtype: &DType) -> Result<Self::Writer, DTypeError>;
}

/// Subtrait of [`Serialize`] for types which have a reasonable default `DType`.
///
/// This opens up some simpler APIs for serialization. (e.g. [`::to_file`])
pub trait AutoSerialize: Serialize {
    /// A suggested format for serialization.
    ///
    /// The builtin implementations for primitive types generally prefer `|` endianness if possible,
    /// else the machine endian format.
    fn default_dtype() -> DType;
}

Deserialize is implemented for primitive ints, floats, and Vec<u8>. (but not [u8], which is unsized)
Serialize is implemented for primitive ints, floats, Vec<u8>, [u8], and for things behind a variety of pointer types.
AutoSerialize is implemented for primitive ints and floats. (but not Vec<u8>/[u8], for which |Sn and |Vn both sound reasonable and both have disadvantages)

Where's `n_bytes`?

It's on DType now. There is no other way to possibly support e.g. |V42.

impl DType {
    pub fn num_bytes(&self) -> usize;
}

Worth noting is that, unfortunately, this means the compiler can no longer constant-fold the sizes for large records. To mitigate that, we now have...

Two-stage de/serialization

So, what's the deal with the Reader and Writer types? Basically, types now need to validate their DTypes and possibly do different things based on what it contains.

To ensure that this can be done efficiently, de/serialization now takes place in two stages:

DType validation. (and potentially caching useful info like offsets)
The actual reading/writing.

The first step is, of course, done by Serialize::writer and Deserialize::reader already seen above. The second step is done by these methods:

pub trait TypeRead {
    type Value;

    fn read_one<'a>(&self, bytes: &'a [u8]) -> (Self::Value, &'a [u8]);
}

pub trait TypeWrite {
    type Value: ?Sized;

    fn write_one<W: io::Write>(&self, writer: W, value: &Self::Value) -> io::Result<()>
    where Self: Sized;
}

Needless to say, manually implementing these traits has become a fair bit of a chore now. Please see the updated roundtrip example.

The error type exposes a single public constructor for use by manual impls of the traits.

/// Indicates that a particular rust type does not support serialization or deserialization
/// as a given [`DType`].
#[derive(Debug, Clone)]
pub struct DTypeError(ErrorKind);

impl fmt::Display for DTypeError { ... }
impl std::error::Error for DTypeError { ... }

impl DTypeError {
    pub fn custom<S: AsRef<str>>(msg: S) -> Self;
}

One more trait: `TypeWriteDyn`

There's... this... thing.

pub trait TypeWriteDyn: TypeWrite {
    #[doc(hidden)]
    fn write_one_dyn(&self, writer: &mut dyn io::Write, value: &Self::Value) -> io::Result<()>;
}

impl<T: TypeWrite> TypeWriteDyn for T { }

Long story short: dyn TypeWrite can't do anything, so you should use dyn TypeWriteDyn instead.

I'd remove this from the PR if I could to save it for later consideration... but currently some of the built-in impls use it. I need to do some benchmarking first.

`Outfile::open_with_dtype`

The existing methods for writing files require AutoSerialize instead of Serialize. One new method, OutFile::open_with_dtype, was added for types that only implement the latter.

impl<Row: AutoSerialize> OutFile<Row> {
    /// Create a file, using the default format for the given type.
    pub fn open<P: AsRef<Path>>(path: P) -> io::Result<Self>;
}

impl<Row: Serialize> OutFile<Row> {
    pub fn open_with_dtype<P: AsRef<Path>>(dtype: &DType, path: P) -> io::Result<Self>;
}

pub fn to_file<S, T, P>(filename: P, data: T) -> ::std::io::Result<()> where
        P: AsRef<Path>,
        S: AutoSerialize,
        T: IntoIterator<Item=S>;

`DType::Plain.ty` changed from `String` -> `TypeStr`

The new TypeStr type is a fully parsed form of a stringlike descr. This is necessary so that reader and writer can easily match on various properties of the descr without having to look at string text.

/// Represents an Array Interface type-string.
///
/// This is more or less the `DType` of a scalar type.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct TypeStr {
    /* no public fields */
}

impl fmt::Display for TypeStr { ... }
impl str::FromStr for TypeStr {
    type Error = ParseTypeStrError;
    ...
}

Parsing one generally produces validation errors for anything not accepted by the np.dtype function.

It currently exposes no public API for inspection or manipulation.

The error is just a garden-variety error type (Debug, Clone, Display, FromStr). I purposefully didn't use nom because nom gives terrible error messages.

There is a helper for "upgrading" a TypeStr to a DType:

impl DType {
    /// Construct a scalar `DType`. (one which is not a nested array or record type)
    pub fn new_scalar(ty: TypeStr) -> Self;
}

`"derive"` feature

This is a replacement for #[macro_use] extern crate npy_derive. Basically, we don't want #[derive(Serialize)] to clash with serde or other crates, so we instead recommend the following setup to users:

[dependencies]
npy-rs = { version = "0.5", features = ["derive"] }

extern crate npy;

#[derive(npy::Serialize, npy::Deserialize)]
struct MyStruct { a: i32, b: f64 }

Notice the above works even for 2015 edition crates. (the examples and doctests should attest to this!)

`NpyData::dtype`

impl<'a, T: Deserialize> NpyData<'a, T> {
    pub fn dtype(&self) -> DType;
}

This is necessary in order to be able to read an NPY file and then write a new one back that has the same format. I also wanted it for some of the tests.

Little Things

travis now checks that the examples/ build (because I had to update .travis.yml to add the feature flag, and at that point... why not?)
DType now derives Clone because I needed it in the derive macro.
npy_derive now depends on proc_macro2, whose types appear in the public API of syn and quote
unnecessary lifetime deleted from to_file

...and I think that's everything. (whew!)

Edit 1: Updated the signatures of TypeRead to reflect the new, faster API
Edit 2:

Removed TypeRead::read_one_into (it can be added backwards compatibly later)
Removed "Helpers for creating |Sn and |Vn" (it can be added backwards compatibly later)
Added NpyData::dtype

ExpHP · 2019-06-04T21:22:56Z

Results of the existing bench.rs:

Before

running 2 tests
test read  ... bench:     112,028 ns/iter (+/- 6,674)
test write ... bench:     895,624 ns/iter (+/- 32,575)

After

running 2 tests
test read  ... bench:     463,360 ns/iter (+/- 153,967)   (~4x slowdown)
test write ... bench:   1,596,336 ns/iter (+/- 97,324)    (~2x slowdown)

Yeowch.

This is probably due to the double indirection in the current integer/float impls, which likely prevents inlining. There's no longer any good reason for them to have this indirection since I've decided not to support promotions like u32 -> u64, so we'll see how this improves with just a simple branch on endianness.

Edit: Further benchmarks

Edit: After fixing indirection

running 2 tests
test read  ... bench:     198,521 ns/iter (+/- 12,449)   (x1.75 slowdown)
test write ... bench:   1,092,576 ns/iter (+/- 61,045)   (x1.25 slowdown)

Edit: After adding dedicated Little Endian newtypes

running 2 tests
test le_read  ... bench:     168,137 ns/iter (+/- 9,882)
test le_write ... bench:     948,938 ns/iter (+/- 39,311)

The issue seems to be the fact that it no longer statically knows the strides. I have an idea for how to fix this: read_one will need to become something like

pub trait TypeRead {
    type Value;

    fn read_one<'a>(&self, bytes: &'a [u8]) -> (Self::Value, &'a [u8]);
}

ExpHP · 2019-06-05T23:38:00Z

The fn read_one(&self, bytes: &[u8]) -> (Self::Value, &[u8]); fix was wildly successful at optimizing the reads for scalar data and derived structs. Now my local branch is actually faster than master at reading most types! (that is, if you trust the current reading benchmarks... which I really don't)

For this reason I've decided against the inclusion of the fixed-endian wrapper types in this PR.

Before:

test array::read      ... bench:   1,325,069 ns/iter (+/- 43,806)
test array::write     ... bench:   4,602,994 ns/iter (+/- 614,125)
test one_field::read  ... bench:      95,717 ns/iter (+/- 4,101)
test one_field::write ... bench:     476,748 ns/iter (+/- 27,877)
test plain_f32::read  ... bench:      86,771 ns/iter (+/- 3,572)
test plain_f32::write ... bench:     561,184 ns/iter (+/- 9,845)
test simple::read     ... bench:     114,684 ns/iter (+/- 6,689)
test simple::write    ... bench:     869,041 ns/iter (+/- 50,057)

After:

test array::read         ... bench:   1,435,941 ns/iter (+/- 48,061)  (10% slowdown)
test array::write        ... bench:   4,567,738 ns/iter (+/- 160,708)
test one_field::read     ... bench:      56,118 ns/iter (+/- 1,440)   (50% speedup)
test one_field::write    ... bench:     477,967 ns/iter (+/- 28,492)
test plain_f32::read     ... bench:      55,995 ns/iter (+/- 3,302)   (50% speedup)
test plain_f32::write    ... bench:     477,693 ns/iter (+/- 12,268)  (15% speedup)
test simple::read        ... bench:      84,084 ns/iter (+/- 4,506)   (30% speedup)
test simple::write       ... bench:   1,091,575 ns/iter (+/- 18,112)  (25% slowdown)

ExpHP · 2019-06-06T15:25:48Z

I removed a couple of things that can be backwards-compatibly added back later. What remains are basically the things I need for unit tests.

This enables these derives to be qualified under `npy`, for disambiguation from `serde`: extern crate npy; #[derive(npy::Serialize, npy::Deserialize)] struct MyStruct { ... } This has a couple of downsides with regard to maintainence: * npy_derive can no longer be a dev-dependency because it must become an optional dependency. * Many tests and examples need the feature. We need to list all of these in Cargo.toml. * Because this crate is 2015 edition, as soon as we list *any* tests and examples, we must list *all* of them; including the ones that don't need the feature! --- This commit had to update `.travis.yml` to start using the feature. I took this opportunity to also add `--examples` (which the default script does not do) to ensure that examples build correctly.

The new derive macros will need this...

This had to wait until after the derives were added so that the tests could use the derives.

This is the single most important commit in the PR. All breaking changes to existing public APIs are contained in here. Serialize is completely removed. Examples and tests are not yet updated, so they are broken in this commit.

Fix a couple of things I missed while rewriting and reorganizing the commit history.

I tried a variety of things to optimize this function: * Replacing usage of get_unchecked with reuse of the remainder returned by read_one, so that the stride can be statically known rather than having to be looked up. (this is what optimized the old read benchmark) * Putting an assertion up front to prove that the data vector is long enough. But whatever I do, performance won't budge. In the f32 benchmark, a very hot bounds check still occurs on every read to ensure that the length of the data is at least 4 bytes. So I'm adding the benchmark, but leaving the function itself alone.

ExpHP · 2019-06-06T20:02:41Z

This is finished now!

I completely rewrote the git history from scratch, organizing all of the changes into easily reviewable groups. You should be able to read each individual commit one by one and make sense of them.

2015 edition crates do not use NLL yet in the latest stable compiler, so our derive macro must be conservative. Apparently, in the latest nightly, this was changed; 2015 edition will at some point use NLL in the future. This is why I did not notice the problem at first!

potocpav

First of all, thank you for your heroic effort to improve and generalize this library! And I'm sorry I didn't have time to review your changes earlier.

I didn't review all the commits & changes just yet, but I'm quite sure I agree with all your architectural decisions. Your solution of using two-stage serialization to efficiently support all possible formats is particularly neat.

I commented on a few things I found while browsing the commits, but those are just nit-picks. I will try to finish the review so that we can merge everything ASAP.

Edit: And thanks for the detailed write-up, it really helped understand the changes. :-)

src/type_str.rs

src/serialize.rs

Co-Authored-By: Pavel Potocek <pavelpotocek@gmail.com>

It was an artefact of an old design. Out of paranoia, I added some assertions to the Serialize/Deserialize impls to make sure the endianness is valid. These are redundant since it is checked in TypeStr::from_str, but most other such properties are at least implicitly checked by the `_` arm in impls of `reader` and `writer` and I wanted to be safe.

I'm not really sure why I had it return a clone in the first place...

ExpHP · 2019-06-12T16:24:54Z

Two notes:

I changed NpyData::dtype to return &DType instead of DType.
After this is merged, I would like to submit another PR to add support for n-dimensional arrays before the next version release.
- Basically, I'm currently prototyping support for shape.len() != 1 on my own fork, and the most reasonable API I was able to come up with makes open_with_dtype redundant (so I'd like to remove it before it becomes part of the published API).

ExpHP · 2019-06-21T15:24:49Z

@potocpav have you had any time to finish the review?

ExpHP · 2019-07-12T23:34:39Z

There's a couple of WTFs in this that I am aware about after cleaning them up on my fork. I can include those changes here if you want:

Using ByteOrder only for NativeEndian seemed kind of weird and required dumb hacks for u8, so I replaced it with a trait (not publically exposed) for reading primitives. This additionally made the macros not need to know method names or generate modules anymore.
impl_integer_serializable! was needlessly complicated, and the recursive loop structure could be replaced with a $()* repetition. The loop was a leftover from a previous design (one where widening conversions were supported, rather than requiring the size to be an exact match).

ExpHP · 2019-07-13T18:24:24Z

I'm also now having second thoughts about reading DateTimes/TimeDeltas as u64/i64 in this PR.

Basically, I think that at some point I'll want to add back the widening conversions for integers (because integer arrays produced in python code will often have a type that is dynamically chosen between <i4 and <i8), in which case DateTime as u64 doesn't feel right; there should be a dedicated DateTime wrapper type.

I really wish I could find a way to make this PR smaller...

(the only bit that I think can really be pulled out is the derive feature, but that accounts for relatively few changes)

ExpHP · 2019-07-15T14:26:07Z

Rats, I just realized DateTime should be an i64, not a u64. I'll just remove support for these for now.

I don't want to yet commit to a specific API for serializing DateTime/TimeDelta, so that we can keep open the option of widening conversions.

xd009642 · 2020-01-21T14:45:40Z

Is there any progress on this, just cause I've come across this issue myself trying to deserialise u8 numpy arrays

ExpHP · 2021-07-02T02:35:45Z

So it is now two years later; Pavel never responded again to this PR or the other, and I found myself needing npy files in rust once again, so I finally came back to work on my fork and have now published it under the name npyz . In addition to the things from this PR, it also has support for:

io::Read/io::Write
n-dimensional arrays (:o !!!)
num-complex

It can be found here: https://github.com/ExpHP/npyz

Hopefully one day these additions can be merged upstream, so that they can finally be under the name npy where people are most likely to look for them; but I've more or less given up on that by this point.

ExpHP mentioned this pull request Jun 5, 2019

Allow OutFile to write to any Write + Seek #16

Open

ExpHP added 9 commits June 6, 2019 11:37

add type_str

0dc62ca

Make DType contain TypeStr

57a79b2

add new serialization traits, impl for ints

1c9bdaf

impl Serialize traits for floats

ad751d8

impl Serialize traits for bytestrings

b6ce98b

support dynamic readers and writers

a500465

Derive Clone for DType

6aee13e

The new derive macros will need this...

add the new derives

af1e445

ExpHP mentioned this pull request Jun 6, 2019

Nested arrays have backwards shapes in dtype #19

Open

ExpHP added 5 commits June 6, 2019 14:04

impl Serialize for arrays

2a6672b

This had to wait until after the derives were added so that the tests could use the derives.

The breaking changes: Update NpyData and OutFile

e0dc7d2

This is the single most important commit in the PR. All breaking changes to existing public APIs are contained in here. Serialize is completely removed. Examples and tests are not yet updated, so they are broken in this commit.

update benches, examples, and tests to use new traits

f0c499b

small bits of cleanup

d177998

Fix a couple of things I missed while rewriting and reorganizing the commit history.

ExpHP force-pushed the npz-prep branch from d4a9c3f to 7344b12 Compare June 6, 2019 19:46

ExpHP mentioned this pull request Jun 12, 2019

Fix benches and tests to use Cursor ExpHP/npyz#6

Closed

ExpHP mentioned this pull request Jun 12, 2019

NpyData::dtype should return a reference ExpHP/npyz#18

Closed

potocpav reviewed Jun 12, 2019

View reviewed changes

src/type_str.rs Outdated Show resolved Hide resolved

src/serialize.rs Outdated Show resolved Hide resolved

src/serialize.rs Outdated Show resolved Hide resolved

src/serialize.rs Outdated Show resolved Hide resolved

ExpHP and others added 3 commits June 12, 2019 11:45

Apply suggestions from code review

ec62664

Co-Authored-By: Pavel Potocek <pavelpotocek@gmail.com>

change NpyData::dtype to return a reference

132eda2

I'm not really sure why I had it return a clone in the first place...

This was referenced Jun 12, 2019

add more benches #17

Closed

Arrays with ndim != 1 #10

Open

fixup some outdated comments

c4bf3ca

remove datetime; simplify macro; fix comments

646d053

I don't want to yet commit to a specific API for serializing DateTime/TimeDelta, so that we can keep open the option of widening conversions.

ExpHP mentioned this pull request Jan 28, 2020

reading npz files takes soooooo long ExpHP/rsp2#108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support one rust type serializing as many dtypes #15

Support one rust type serializing as many dtypes #15

ExpHP commented Jun 4, 2019 •

edited

Loading

ExpHP commented Jun 4, 2019 •

edited

Loading

ExpHP commented Jun 4, 2019 •

edited

Loading

ExpHP commented Jun 5, 2019

ExpHP commented Jun 6, 2019

ExpHP commented Jun 6, 2019

potocpav left a comment •

edited

Loading

ExpHP commented Jun 12, 2019

ExpHP commented Jun 21, 2019

ExpHP commented Jul 12, 2019 •

edited

Loading

ExpHP commented Jul 13, 2019 •

edited

Loading

ExpHP commented Jul 15, 2019

xd009642 commented Jan 21, 2020

ExpHP commented Jul 2, 2021 •

edited

Loading

Support one rust type serializing as many dtypes #15

Are you sure you want to change the base?

Support one rust type serializing as many dtypes #15

Conversation

ExpHP commented Jun 4, 2019 • edited Loading

This is now ready for review!

Overview

ExpHP commented Jun 4, 2019 • edited Loading

Summary of all changes

Additions / Big changes

The new serialization traits

Where's n_bytes?

Two-stage de/serialization

One more trait: TypeWriteDyn

Outfile::open_with_dtype

DType::Plain.ty changed from String -> TypeStr

"derive" feature

NpyData::dtype

Little Things

ExpHP commented Jun 4, 2019 • edited Loading

ExpHP commented Jun 5, 2019

ExpHP commented Jun 6, 2019

ExpHP commented Jun 6, 2019

potocpav left a comment • edited Loading

Choose a reason for hiding this comment

ExpHP commented Jun 12, 2019

ExpHP commented Jun 21, 2019

ExpHP commented Jul 12, 2019 • edited Loading

ExpHP commented Jul 13, 2019 • edited Loading

ExpHP commented Jul 15, 2019

xd009642 commented Jan 21, 2020

ExpHP commented Jul 2, 2021 • edited Loading

ExpHP commented Jun 4, 2019 •

edited

Loading

ExpHP commented Jun 4, 2019 •

edited

Loading

Where's `n_bytes`?

One more trait: `TypeWriteDyn`

`Outfile::open_with_dtype`

`DType::Plain.ty` changed from `String` -> `TypeStr`

`"derive"` feature

`NpyData::dtype`

ExpHP commented Jun 4, 2019 •

edited

Loading

potocpav left a comment •

edited

Loading

ExpHP commented Jul 12, 2019 •

edited

Loading

ExpHP commented Jul 13, 2019 •

edited

Loading

ExpHP commented Jul 2, 2021 •

edited

Loading