New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hdf5 support #257

Merged
merged 14 commits into from Oct 12, 2018

Conversation

Projects
None yet
2 participants
@Vindaar
Contributor

Vindaar commented Jul 22, 2018

This PR will add support to read and write tensors to HDF5 files. It's a work in progress.

Features (to be) added for a start:

  • write tensor with generic name
  • write tensor with specific name
  • write tensor to specific group
  • read tensor
  • read tensor, convert to different datatype
  • read tensor of given name
  • read tensor from location other than root group
  • write several tensors without closing H5 file in between calls
  • make use of is_C_contiguous
  • make sure tensors are contiguous before writing
  • write proper tests of writing -> reading back tensors
    can probably always add more tests...

Once these are implemented potentially (in another PR):

  • serialize NN model to H5
  • read serialized NN model from H5
  • read Keras NN models from H5 files

Regarding Keras H5 files, relevant Keras code: https://github.com/keras-team/keras/blob/master/keras/engine/saving.py

@Vindaar

This comment has been minimized.

Contributor

Vindaar commented Jul 23, 2018

The major features that come to mind regarding simple storage of tensors in a (shared) HDF5 file have been implemented, I would say.

Of course it's always possible that I overlooked / forgot / wrongly implemented something... :)

I'm not sure if it's due to the unittest module, but on my laptop compilation of the test case takes ages compared to Nim's typical compilation times. verbosity:2 shows it's taking quite long on each single read_hdf5 call.
I don't quite know how to further debug the compilation procedure right now though (I guess I could gdb the nim compiler on the test case?). In other code in which I use the HDF5 library it doesn't have that effect though.

@mratsim

This comment has been minimized.

Owner

mratsim commented Sep 26, 2018

Btw @Vindaar do you still consider this WIP or is this ready for review/merge?

@Vindaar

This comment has been minimized.

Contributor

Vindaar commented Sep 27, 2018

Oh, sorry. I got a little side tracked with my "Keras investigation" and then the last few weeks happened.

Reading Keras files will certainly not be part of this PR. Besides some technical issues (they're using sometimes using static and sometimes variable length strings for attributes for whatever reasons), which I fixed, I have some more general problems implementing this.
My main issue comes down to: how should I use the Keras config, which I read at run time from a H5 file to create a NN in Arraymancer, where I'd normally use the DSL at compile time for that? Several hacky ideas later, I'm still not sure. :)

Are you happy with the canImport solution or should nimhdf5 be a proper dependency? And I'll take a look at the long compilation times again later today and see if that still happens.

@Vindaar Vindaar changed the title from [WIP] Add hdf5 support to Add hdf5 support Sep 27, 2018

@mratsim

LGTM

Do you still have timing issues? I will try on my end as well.

@Vindaar

This comment has been minimized.

Contributor

Vindaar commented Oct 2, 2018

Finally checked the compilation times. While read_hdf5 calls are still on the slower side, compiling the hdf5 test takes about 3.3 s on my laptop. Seems fine for me.

Two questions:

  • I just noticed I still have four assert statements in the read_hdf5. Shall I remove them?
  • Do you want me to rebase onto the current master branch?
@mratsim

This comment has been minimized.

Owner

mratsim commented Oct 10, 2018

The assert are fine, I need to do a pass to remove all the doAssert I did anyway and implement proper checks.

Yep rebasing would be perfect!

Vindaar added some commits Jul 22, 2018

add `write_hdf5` proc
A basic proc to write a tensor to a HDF5 file using `nimhdf5`
guard `io_hdf5` by `canImport` template
There might be a better way to check whether the user has the
`nimhdf5` library installed. Check.
add `parseNameAndGroup` helper, which returns dataset and group name
Returns the desired dataset and group name, taking into account
defaults and potentially generic tensor names.
add `read_hdf5` and make use of `parseNameAndGroup` in both procs
Add proc to read tensor from a given H5 file. Optionally allow
arbitrary names and groups.

Also adds wrappers around the default read / write procs, which do not
take `Option[T]` but rather default `string` and `int` values for
convenient access.
add procs to read / write from H5 file withoute open / close
Allows to read or write multiple tensors to a H5 file without opening
and closing the H5 file after each object.
take C / F contiguous into account, make sure tensor is contiguous
Before writing a tensor make sure it is contiguous. Then take column
or row major ordering into account when reading the tensor back.

@Vindaar Vindaar force-pushed the Vindaar:addHdf5Support branch from 29f686a to c2ad13d Oct 11, 2018

@Vindaar

This comment has been minimized.

Contributor

Vindaar commented Oct 11, 2018

Some clean up done and rebased. :)

@mratsim

Perfect, just a single question about non-contiguous tensors

dset.attrs["shape"] = @(tCont.shape)
dset.attrs["size"] = tCont.size
# workaround since we can't write bool attributes
dset.attrs["is_C_contiguous"] = if tCont.is_C_contiguous: "true" else: "false"

This comment has been minimized.

@mratsim

mratsim Oct 11, 2018

Owner

In which case is this needed? It's forced contiguous on line 159?

This comment has been minimized.

@Vindaar

Vindaar Oct 12, 2018

Contributor

Oh, good point. I'm not entirely sure anymore what I had in mind. I guess to store whether the original tensor was contiguous or not.

@Vindaar

This comment has been minimized.

Contributor

Vindaar commented Oct 12, 2018

Oh, now I remember what I had in mind. It wasn't about whether the tensor is contiguous or not, but whether it is row or column based.
So the idea was to store that information with the tensor so when it's read back, it's done in the same major ordering.
My thinking was: convert tensor to contiguous, check whether it is now C or F contiguous. Write as attribute. Data is stored, read back and put into the same ordering.

See line 90 here: https://github.com/mratsim/Arraymancer/pull/257/files#diff-e2d8651a1d48176c63843ddfd120c20fR90.

@mratsim

This comment has been minimized.

Owner

mratsim commented Oct 12, 2018

Oh I see, I'll readd then

Revert "Remove the contiguous attribute"
This reverts commit dfbaa94.

Reason for reversal: Attribute is used to store whether the given
tensor was in row or column major ordering. When reading a tensor back
this is used to reconstruct the tensor in the same way it was stored
before being saved.

@mratsim mratsim merged commit 0a08949 into mratsim:master Oct 12, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment