New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dogs vs cats #285
Dogs vs cats #285
Conversation
Failing because of unrelated test, fixed in #286 |
output_filename='dogs_vs_cats.hdf5'): | ||
"""Converts the Dogs vs. Cats dataset to HDF5. | ||
|
||
Converts the CIFAR-10 dataset to an HDF5 dataset compatible with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: CIFAR-10
@bartvm thanks a lot! Aside from my inline comments, I'm wondering if we should encode JPEG images as in the ILSVRC2010 converter. One one hand, this might mean slower processing at training time, but on the other hand we would stick closer to the original data, which I think is one of Fuel's goals regarding reproducibility. What are your thoughts on this? |
I'm not too sure about storing things in JPEG. I thought that for ILSVRC2010 this was decided mostly for memory considerations (JPEG is just so much more space efficient)? If not for that, isn't it faster (as you said) and more user-friendly to store them as arrays. It just seems more intuitive if a user can just open the file in HDF5 without needing to read JPEG blobs. Reproducibility is important, but having the conversion code being public and part of Fuel arguably meets that principle. |
4689003
to
433c74e
Compare
The ILSVRC call was for disk space reasons. Iteration speed is definitely a concern considering how CPU-hungry JPEG Sticking close to the original data isn't a problem, storing the results of |
Would you mind adding unit tests, at least for the downloader and the dataset class? |
Remaining Travis error is unrelated, so let me know if this is good to merge (that way I can tell the students in IFT6266 to use it tomorrow morning) :) |
Just a note: reading HDF5 vlen data turns out to be pretty damn slow (close to a second with batch size 100 from local magnetic disk on ILSVRC2010). It's fine to keep it around, but it's worth considering maybe a more performant alternative for the case of images (honestly not sure what that might be). Also, once I get a jpeg4py transformer going, because it is both fast and sensible about the GIL, JPEG decode time will be a much smaller deal than once believed. |
The PR as it is looks good to me, you can merge. You might want to point the students to this PR I made a while ago documenting how to do parallelize data processing. |
I added the Dogs vs. Cats dataset (for IFT6266). In general, it'd be nice if e.g. @vdumoulin could have a quick look to see if I'm doing it right. Also, few things I ran into:
Content-Disposition
headers are formed differently from what the code assumes i.e. instead of ending withfilename=file
they look something likefilename=file; filename=*UTF8'file'
or something weird like that.maxval
is created, resulting in a default progress bar that goes to 100. Whenbar.update
is called this gives an error and crashes.