Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dogs vs cats #285

Merged
merged 4 commits into from Jan 28, 2016
Merged

Dogs vs cats #285

merged 4 commits into from Jan 28, 2016

Conversation

bartvm
Copy link
Member

@bartvm bartvm commented Jan 26, 2016

I added the Dogs vs. Cats dataset (for IFT6266). In general, it'd be nice if e.g. @vdumoulin could have a quick look to see if I'm doing it right. Also, few things I ran into:

  • Following RFC5987 the Content-Disposition headers are formed differently from what the code assumes i.e. instead of ending with filename=file they look something like filename=file; filename=*UTF8'file' or something weird like that.
  • Fuel prints "Downloading [target filename]" but the progress bar then displays the source filename, seems a bit odd.
  • If the filesize can't be determined, a progressbar without maxval is created, resulting in a default progress bar that goes to 100. When bar.update is called this gives an error and crashes.
  • H5PyDataset seems to have a hardcoded assumption that it is reshaping a batch, when you create an example stream instead it crashes when trying to reshape.

@bartvm
Copy link
Member Author

bartvm commented Jan 26, 2016

Failing because of unrelated test, fixed in #286

output_filename='dogs_vs_cats.hdf5'):
"""Converts the Dogs vs. Cats dataset to HDF5.

Converts the CIFAR-10 dataset to an HDF5 dataset compatible with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: CIFAR-10

@vdumoulin
Copy link
Contributor

@bartvm thanks a lot!

Aside from my inline comments, I'm wondering if we should encode JPEG images as in the ILSVRC2010 converter. One one hand, this might mean slower processing at training time, but on the other hand we would stick closer to the original data, which I think is one of Fuel's goals regarding reproducibility. What are your thoughts on this?

@bartvm
Copy link
Member Author

bartvm commented Jan 27, 2016

I'm not too sure about storing things in JPEG. I thought that for ILSVRC2010 this was decided mostly for memory considerations (JPEG is just so much more space efficient)? If not for that, isn't it faster (as you said) and more user-friendly to store them as arrays. It just seems more intuitive if a user can just open the file in HDF5 without needing to read JPEG blobs.

Reproducibility is important, but having the conversion code being public and part of Fuel arguably meets that principle.

@bartvm bartvm force-pushed the dogs_vs_cats branch 2 times, most recently from 4689003 to 433c74e Compare January 27, 2016 04:04
@dwf
Copy link
Contributor

dwf commented Jan 27, 2016

The ILSVRC call was for disk space reasons.

Iteration speed is definitely a concern considering how CPU-hungry JPEG
decoding is. It's not really practical without a transcode (which I'm
working on now) or a fuel-server process. It can take upwards of 1.7
seconds to read, decode, resize, and crop a batch of 100 images from
ILSVRC2010 -- the overwhelming bulk of that is decoding. Someone should
eventually look into parallelizing across cores with jpeg4py (a
threadsafe/GIL-aware wrapper around the non-legacy API of libjpeg-turbo),
I'm apparently doing CCW tickets again so maybe I'll look into it in March.

Sticking close to the original data isn't a problem, storing the results of
JPEG decoding are as good as storing the JPEG, just not nearly as space
efficient.

@vdumoulin
Copy link
Contributor

@dwf @bartvm sounds good!

@vdumoulin
Copy link
Contributor

Would you mind adding unit tests, at least for the downloader and the dataset class?

@bartvm
Copy link
Member Author

bartvm commented Jan 27, 2016

0662d46

@bartvm
Copy link
Member Author

bartvm commented Jan 27, 2016

Remaining Travis error is unrelated, so let me know if this is good to merge (that way I can tell the students in IFT6266 to use it tomorrow morning) :)

@dwf
Copy link
Contributor

dwf commented Jan 27, 2016

Just a note: reading HDF5 vlen data turns out to be pretty damn slow (close to a second with batch size 100 from local magnetic disk on ILSVRC2010). It's fine to keep it around, but it's worth considering maybe a more performant alternative for the case of images (honestly not sure what that might be).

Also, once I get a jpeg4py transformer going, because it is both fast and sensible about the GIL, JPEG decode time will be a much smaller deal than once believed.

@vdumoulin
Copy link
Contributor

The PR as it is looks good to me, you can merge. You might want to point the students to this PR I made a while ago documenting how to do parallelize data processing.

bartvm added a commit that referenced this pull request Jan 28, 2016
@bartvm bartvm merged commit db8dd6b into master Jan 28, 2016
@bartvm bartvm deleted the dogs_vs_cats branch January 28, 2016 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants