Added HDF5 dataset #9

dmitriy-serdyuk · 2015-02-21T19:04:06Z

I added simple HDF5 dataset.

dwf · 2015-02-21T20:08:24Z

fuel/datasets/hdf5.py

+    def __setstate__(self, state):
+        self.__dict__ = state
+        # Open HDF5 file again and sets up `nodes`.
+        self.nodes = self._open_file()


This will fail to unpickle if the HDF5 file is not present in the exact same location. This is very bad and one of the things about pylearn2 that we're trying to get away from.

If the file is not present in the same locaiton, dill will also give inconsistent result after loading a file handler.

Well we're ditching dill AFAIK, but the point is that we should always be able to deserialize the dataset object, regardless of whether the data file can be loaded. That way you at least have the opportunity to fix the path programmatically.

This can be a big problem if you need to deserialize an experiment on another machine and the path where your data was on the original machine isn't writable on the new one (and you are not root). The fact that your dataset object can't be unpickled will mean that any object holding a reference to the dataset can't be unpickled either. Your options then are to write a custom Unpickler class or to edit the file path in the pickle binary; neither are particularly appealing.

@bartvm has some way of dealing with this with in memory datasets called "lazy properties". I don't have all the details of his implementation in my head but these things (or something very close) can be applied here too.

Lazy properties would actually work here as well. Maybe I should just rename them so that it can be used in general, instead of belonging to the InMemoryDataset class, because it's silly to reimplement this log for each dataset.

The idea is very simple: You define a series of attributes that are set in a load method. The load method is only called when an attribute is requested. When pickling, you delete all these attributes. Basically, it's just this:

class Bar(object): def load(self): self.foo = open('data.txt') @property def foo(self): if not hasattr(self, '_foo'): self.load() return self._foo @foo.setter def foo(self, value): self._foo = value def __getstate__(self): state = self.__dict__.copy() del state['_foo'] return state def baz(self): return self.foo.readline()

It's made transparent so that you can just do:

@lazy_properties('foo') def Bar(object): def load(self): self.foo = open('data.txt') def baz(self): return self.foo.readline()

I still don't understand what should a user do in a case when the dataset was moved. Inherit and redefine load? It seems clumsy.

I can propose not_save or not_pickle name for this decorator.

The load method would actually not have the data path hard coded, it would be read from another attribute, which the user could set. So there is no need to overwrite load, you would simply change the attribute.

class Bar(object): path = '/home/data.txt' def load(self): self.foo = open(self.path) @property def foo(self): if not hasattr(self, '_foo'): self.load() return self._foo @foo.setter def foo(self, value): self._foo = value def __getstate__(self): state = self.__dict__.copy() del state['_foo'] return state def baz(self): return self.foo.readline()

And if you wanted to unpickle this and the path had changed you would just change bar.path = '/home/new_location/data.txt'.

We need to factor out the path logic at one point though, or at least standardize it, because it will always be the same: By default, most datasets should read the data from FUEL_DATA_PATH/dataset_name/*. Most of the time, when data moves, you can just change FUEL_DATA_PATH. However, if the files for some reason aren't in a folder, or the filenames themselves have changed, you need to intervene programmatically I guess. I'm not sure about making this an optional function call though. I'd prefer a method where the folder and filenames are saved as attributes, which can simply be changed e.g.

@do_not_pickle_attributes('images', 'labels') class MNIST(FileDataset): folder = 'mnist' files = { 'train': {'images': 'train-images-idx3-ubyte', 'labels': 'train-labels-idx1-ubyte'}, 'test': {'images': 't10k-images-idx3-ubyte', 'labels': 't10k-labels-idx1-ubyte'} } def load_attributes(self): self.images = read_mnist_images(os.path.join( config.data_path, self.folder, self.files[self.which_set]['images'] )) self.labels = read_mnist_labels(os.path.join( config.data_path, self.folder, self.files[self.which_set]['labels'] ))

Now you can simply edit the data_folder or files attributes if they moved. For this HDF5 class, you could do the same thing. Maybe we should factor out the path joining though, maybe into a mixin class with a get_file_path method that does something like:

class FileDataset(object): def get_file_path(self, file): folder = getattr(self, 'folder', '') return os.path.join(config.data_path, folder, file)

bartvm · 2015-02-22T01:45:47Z

fuel/datasets/hdf5.py

+    sources_in_file : tuple of strings
+        Names of nodes in HDF5 file which contain sources. Should the same
+        length as `sources`.
+        Optional, if not set will be equal to `sources`.


There should be a blank line before closing the docstring.

Conflicts: tests/test_datasets.py

dmitriy-serdyuk · 2015-02-26T01:37:49Z

Seems, that the error is not in my code.

bartvm · 2015-02-26T01:47:12Z

Yeah, Travis has been down all afternoon. Nothing we can do but take the night off! :)

Conflicts: .travis.yml tests/test_datasets.py

Conflicts: fuel/datasets/binarized_mnist.py

dmitriy-serdyuk · 2015-03-04T16:13:25Z

Hope this should work now.

Do we need some other features form HDF5 container?

bartvm · 2015-03-05T17:35:36Z

I'll merge it for now just so that we have support, but it would be nice if this could inherit from the IndexableDataset at one point, but we can do that later on.

Added HDF5 dataset

Fixed ramdon 2D rotation tranfromer to work with floats.

dmitriy-serdyuk added 4 commits February 21, 2015 14:02

Added HDF5 dataset

ca39cf6

Install tables in tests

3ba8fa6

Fixed name for tables

b5832ec

Made compatible with python3

f2c0e72

dwf reviewed Feb 21, 2015
View reviewed changes

dmitriy-serdyuk added 2 commits February 21, 2015 17:09

Not crash if no hdf5 file

38835b7

Forgot to open file

fa306f3

bartvm reviewed Feb 22, 2015
View reviewed changes

This was referenced Feb 23, 2015

Refactor lazy properties #14

Closed

Add BinarizedMNIST dataset #15

Merged

dmitriy-serdyuk added 4 commits February 25, 2015 20:13

Merge branch 'master' into hdf5

e823971

Conflicts: tests/test_datasets.py

Used do_not_pickle_attributes decorator

816ee20

Start from a given position in HDF5

3cfb1d5

Let it crash if it cannot open file

b84d78f

dmitriy-serdyuk added 5 commits March 2, 2015 16:58

Merge branch 'master' into hdf5

ebf89da

Conflicts: .travis.yml tests/test_datasets.py

Merge branch 'hdf5' of github.com:dmitriy-serdyuk/fuel into hdf5

1b48de5

Conflicts: fuel/datasets/binarized_mnist.py

Returned binarized mnist

c039bec

PEP8

5fcc6aa

Fixed test

500cf69

bartvm added a commit that referenced this pull request Mar 5, 2015

Merge pull request #9 from dmitriy-serdyuk/hdf5

2cd17bb

Added HDF5 dataset

bartvm merged commit 2cd17bb into mila-iqia:master Mar 5, 2015

vdumoulin referenced this pull request in laurent-dinh/fuel Mar 5, 2015

SVHN format 2

08a7ec7

basveeling pushed a commit to basveeling/fuel that referenced this pull request Mar 22, 2017

Merge pull request mila-iqia#9 from Scyfer/bugfix_rotation_local

d3e63c5

Fixed ramdon 2D rotation tranfromer to work with floats.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added HDF5 dataset #9

Added HDF5 dataset #9

dmitriy-serdyuk commented Feb 21, 2015

dwf Feb 21, 2015

dmitriy-serdyuk Feb 21, 2015

dwf Feb 21, 2015

bartvm Feb 22, 2015

dmitriy-serdyuk Feb 22, 2015

bartvm Feb 22, 2015

bartvm Feb 22, 2015

dmitriy-serdyuk commented Feb 26, 2015

bartvm commented Feb 26, 2015

dmitriy-serdyuk commented Mar 4, 2015

bartvm commented Mar 5, 2015

Added HDF5 dataset #9

Added HDF5 dataset #9

Conversation

dmitriy-serdyuk commented Feb 21, 2015

dwf Feb 21, 2015

Choose a reason for hiding this comment

dmitriy-serdyuk Feb 21, 2015

Choose a reason for hiding this comment

dwf Feb 21, 2015

Choose a reason for hiding this comment

bartvm Feb 22, 2015

Choose a reason for hiding this comment

dmitriy-serdyuk Feb 22, 2015

Choose a reason for hiding this comment

bartvm Feb 22, 2015

Choose a reason for hiding this comment

bartvm Feb 22, 2015

Choose a reason for hiding this comment

dmitriy-serdyuk commented Feb 26, 2015

bartvm commented Feb 26, 2015

dmitriy-serdyuk commented Mar 4, 2015

bartvm commented Mar 5, 2015