-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added HDF5 dataset #9
Conversation
def __setstate__(self, state): | ||
self.__dict__ = state | ||
# Open HDF5 file again and sets up `nodes`. | ||
self.nodes = self._open_file() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail to unpickle if the HDF5 file is not present in the exact same location. This is very bad and one of the things about pylearn2 that we're trying to get away from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file is not present in the same locaiton, dill will also give inconsistent result after loading a file handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we're ditching dill AFAIK, but the point is that we should always be able to deserialize the dataset object, regardless of whether the data file can be loaded. That way you at least have the opportunity to fix the path programmatically.
This can be a big problem if you need to deserialize an experiment on another machine and the path where your data was on the original machine isn't writable on the new one (and you are not root). The fact that your dataset object can't be unpickled will mean that any object holding a reference to the dataset can't be unpickled either. Your options then are to write a custom Unpickler class or to edit the file path in the pickle binary; neither are particularly appealing.
@bartvm has some way of dealing with this with in memory datasets called "lazy properties". I don't have all the details of his implementation in my head but these things (or something very close) can be applied here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lazy properties would actually work here as well. Maybe I should just rename them so that it can be used in general, instead of belonging to the InMemoryDataset
class, because it's silly to reimplement this log for each dataset.
The idea is very simple: You define a series of attributes that are set in a load
method. The load
method is only called when an attribute is requested. When pickling, you delete all these attributes. Basically, it's just this:
class Bar(object):
def load(self):
self.foo = open('data.txt')
@property
def foo(self):
if not hasattr(self, '_foo'):
self.load()
return self._foo
@foo.setter
def foo(self, value):
self._foo = value
def __getstate__(self):
state = self.__dict__.copy()
del state['_foo']
return state
def baz(self):
return self.foo.readline()
It's made transparent so that you can just do:
@lazy_properties('foo')
def Bar(object):
def load(self):
self.foo = open('data.txt')
def baz(self):
return self.foo.readline()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand what should a user do in a case when the dataset was moved. Inherit and redefine load
? It seems clumsy.
I can propose not_save
or not_pickle
name for this decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The load
method would actually not have the data path hard coded, it would be read from another attribute, which the user could set. So there is no need to overwrite load
, you would simply change the attribute.
class Bar(object):
path = '/home/data.txt'
def load(self):
self.foo = open(self.path)
@property
def foo(self):
if not hasattr(self, '_foo'):
self.load()
return self._foo
@foo.setter
def foo(self, value):
self._foo = value
def __getstate__(self):
state = self.__dict__.copy()
del state['_foo']
return state
def baz(self):
return self.foo.readline()
And if you wanted to unpickle this and the path had changed you would just change bar.path = '/home/new_location/data.txt'
.
We need to factor out the path logic at one point though, or at least standardize it, because it will always be the same: By default, most datasets should read the data from FUEL_DATA_PATH/dataset_name/*
. Most of the time, when data moves, you can just change FUEL_DATA_PATH
. However, if the files for some reason aren't in a folder, or the filenames themselves have changed, you need to intervene programmatically I guess. I'm not sure about making this an optional function call though. I'd prefer a method where the folder and filenames are saved as attributes, which can simply be changed e.g.
@do_not_pickle_attributes('images', 'labels')
class MNIST(FileDataset):
folder = 'mnist'
files = {
'train': {'images': 'train-images-idx3-ubyte', 'labels': 'train-labels-idx1-ubyte'},
'test': {'images': 't10k-images-idx3-ubyte', 'labels': 't10k-labels-idx1-ubyte'}
}
def load_attributes(self):
self.images = read_mnist_images(os.path.join(
config.data_path, self.folder, self.files[self.which_set]['images']
))
self.labels = read_mnist_labels(os.path.join(
config.data_path, self.folder, self.files[self.which_set]['labels']
))
Now you can simply edit the data_folder
or files
attributes if they moved. For this HDF5 class, you could do the same thing. Maybe we should factor out the path joining though, maybe into a mixin class with a get_file_path
method that does something like:
class FileDataset(object):
def get_file_path(self, file):
folder = getattr(self, 'folder', '')
return os.path.join(config.data_path, folder, file)
sources_in_file : tuple of strings | ||
Names of nodes in HDF5 file which contain sources. Should the same | ||
length as `sources`. | ||
Optional, if not set will be equal to `sources`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a blank line before closing the docstring.
Seems, that the error is not in my code. |
Yeah, Travis has been down all afternoon. Nothing we can do but take the night off! :) |
Conflicts: .travis.yml tests/test_datasets.py
Conflicts: fuel/datasets/binarized_mnist.py
Hope this should work now. Do we need some other features form HDF5 container? |
I'll merge it for now just so that we have support, but it would be nice if this could inherit from the |
Fixed ramdon 2D rotation tranfromer to work with floats.
I added simple HDF5 dataset.