Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added HDF5 dataset #9

Merged
merged 15 commits into from
Mar 5, 2015
Merged

Added HDF5 dataset #9

merged 15 commits into from
Mar 5, 2015

Conversation

dmitriy-serdyuk
Copy link
Contributor

I added simple HDF5 dataset.

def __setstate__(self, state):
self.__dict__ = state
# Open HDF5 file again and sets up `nodes`.
self.nodes = self._open_file()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail to unpickle if the HDF5 file is not present in the exact same location. This is very bad and one of the things about pylearn2 that we're trying to get away from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file is not present in the same locaiton, dill will also give inconsistent result after loading a file handler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we're ditching dill AFAIK, but the point is that we should always be able to deserialize the dataset object, regardless of whether the data file can be loaded. That way you at least have the opportunity to fix the path programmatically.

This can be a big problem if you need to deserialize an experiment on another machine and the path where your data was on the original machine isn't writable on the new one (and you are not root). The fact that your dataset object can't be unpickled will mean that any object holding a reference to the dataset can't be unpickled either. Your options then are to write a custom Unpickler class or to edit the file path in the pickle binary; neither are particularly appealing.

@bartvm has some way of dealing with this with in memory datasets called "lazy properties". I don't have all the details of his implementation in my head but these things (or something very close) can be applied here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy properties would actually work here as well. Maybe I should just rename them so that it can be used in general, instead of belonging to the InMemoryDataset class, because it's silly to reimplement this log for each dataset.

The idea is very simple: You define a series of attributes that are set in a load method. The load method is only called when an attribute is requested. When pickling, you delete all these attributes. Basically, it's just this:

class Bar(object):
    def load(self):
        self.foo = open('data.txt')

    @property
    def foo(self):
        if not hasattr(self, '_foo'):
            self.load()
        return self._foo

    @foo.setter
    def foo(self, value):
        self._foo = value

    def __getstate__(self):
        state = self.__dict__.copy()
        del state['_foo']
        return state

    def baz(self):
         return self.foo.readline()

It's made transparent so that you can just do:

@lazy_properties('foo')
def Bar(object):
    def load(self):
        self.foo = open('data.txt')

    def baz(self):
         return self.foo.readline()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand what should a user do in a case when the dataset was moved. Inherit and redefine load? It seems clumsy.

I can propose not_save or not_pickle name for this decorator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load method would actually not have the data path hard coded, it would be read from another attribute, which the user could set. So there is no need to overwrite load, you would simply change the attribute.

class Bar(object):
    path = '/home/data.txt'

    def load(self):
        self.foo = open(self.path)

    @property
    def foo(self):
        if not hasattr(self, '_foo'):
            self.load()
        return self._foo

    @foo.setter
    def foo(self, value):
        self._foo = value

    def __getstate__(self):
        state = self.__dict__.copy()
        del state['_foo']
        return state

    def baz(self):
         return self.foo.readline()

And if you wanted to unpickle this and the path had changed you would just change bar.path = '/home/new_location/data.txt'.

We need to factor out the path logic at one point though, or at least standardize it, because it will always be the same: By default, most datasets should read the data from FUEL_DATA_PATH/dataset_name/*. Most of the time, when data moves, you can just change FUEL_DATA_PATH. However, if the files for some reason aren't in a folder, or the filenames themselves have changed, you need to intervene programmatically I guess. I'm not sure about making this an optional function call though. I'd prefer a method where the folder and filenames are saved as attributes, which can simply be changed e.g.

@do_not_pickle_attributes('images', 'labels')
class MNIST(FileDataset):
    folder = 'mnist'
    files = {
        'train': {'images': 'train-images-idx3-ubyte', 'labels': 'train-labels-idx1-ubyte'},
        'test': {'images': 't10k-images-idx3-ubyte', 'labels': 't10k-labels-idx1-ubyte'}
    }

    def load_attributes(self):
        self.images = read_mnist_images(os.path.join(
            config.data_path, self.folder, self.files[self.which_set]['images']
        ))
        self.labels = read_mnist_labels(os.path.join(
            config.data_path, self.folder, self.files[self.which_set]['labels']
        ))

Now you can simply edit the data_folder or files attributes if they moved. For this HDF5 class, you could do the same thing. Maybe we should factor out the path joining though, maybe into a mixin class with a get_file_path method that does something like:

class FileDataset(object):
    def get_file_path(self, file):
        folder = getattr(self, 'folder', '')
        return os.path.join(config.data_path, folder, file)

sources_in_file : tuple of strings
Names of nodes in HDF5 file which contain sources. Should the same
length as `sources`.
Optional, if not set will be equal to `sources`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a blank line before closing the docstring.

This was referenced Feb 23, 2015
@dmitriy-serdyuk
Copy link
Contributor Author

Seems, that the error is not in my code.

@bartvm
Copy link
Member

bartvm commented Feb 26, 2015

Yeah, Travis has been down all afternoon. Nothing we can do but take the night off! :)

@dmitriy-serdyuk
Copy link
Contributor Author

Hope this should work now.

Do we need some other features form HDF5 container?

@bartvm
Copy link
Member

bartvm commented Mar 5, 2015

I'll merge it for now just so that we have support, but it would be nice if this could inherit from the IndexableDataset at one point, but we can do that later on.

bartvm added a commit that referenced this pull request Mar 5, 2015
@bartvm bartvm merged commit 2cd17bb into mila-iqia:master Mar 5, 2015
vdumoulin referenced this pull request in laurent-dinh/fuel Mar 5, 2015
basveeling pushed a commit to basveeling/fuel that referenced this pull request Mar 22, 2017
Fixed ramdon 2D rotation tranfromer to work with floats.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants