Dogs vs cats #285

bartvm · 2016-01-26T03:07:12Z

I added the Dogs vs. Cats dataset (for IFT6266). In general, it'd be nice if e.g. @vdumoulin could have a quick look to see if I'm doing it right. Also, few things I ran into:

Following RFC5987 the Content-Disposition headers are formed differently from what the code assumes i.e. instead of ending with filename=file they look something like filename=file; filename=*UTF8'file' or something weird like that.
Fuel prints "Downloading [target filename]" but the progress bar then displays the source filename, seems a bit odd.
If the filesize can't be determined, a progressbar without maxval is created, resulting in a default progress bar that goes to 100. When bar.update is called this gives an error and crashes.
H5PyDataset seems to have a hardcoded assumption that it is reshaping a batch, when you create an example stream instead it crashes when trying to reshape.

bartvm · 2016-01-26T03:57:49Z

Failing because of unrelated test, fixed in #286

vdumoulin · 2016-01-26T14:16:15Z

fuel/converters/dogs_vs_cats.py

+                         output_filename='dogs_vs_cats.hdf5'):
+    """Converts the Dogs vs. Cats dataset to HDF5.
+
+    Converts the CIFAR-10 dataset to an HDF5 dataset compatible with


Typo: CIFAR-10

vdumoulin · 2016-01-26T14:26:27Z

@bartvm thanks a lot!

Aside from my inline comments, I'm wondering if we should encode JPEG images as in the ILSVRC2010 converter. One one hand, this might mean slower processing at training time, but on the other hand we would stick closer to the original data, which I think is one of Fuel's goals regarding reproducibility. What are your thoughts on this?

bartvm · 2016-01-27T03:50:45Z

I'm not too sure about storing things in JPEG. I thought that for ILSVRC2010 this was decided mostly for memory considerations (JPEG is just so much more space efficient)? If not for that, isn't it faster (as you said) and more user-friendly to store them as arrays. It just seems more intuitive if a user can just open the file in HDF5 without needing to read JPEG blobs.

Reproducibility is important, but having the conversion code being public and part of Fuel arguably meets that principle.

dwf · 2016-01-27T05:39:03Z

The ILSVRC call was for disk space reasons.

Iteration speed is definitely a concern considering how CPU-hungry JPEG
decoding is. It's not really practical without a transcode (which I'm
working on now) or a fuel-server process. It can take upwards of 1.7
seconds to read, decode, resize, and crop a batch of 100 images from
ILSVRC2010 -- the overwhelming bulk of that is decoding. Someone should
eventually look into parallelizing across cores with jpeg4py (a
threadsafe/GIL-aware wrapper around the non-legacy API of libjpeg-turbo),
I'm apparently doing CCW tickets again so maybe I'll look into it in March.

Sticking close to the original data isn't a problem, storing the results of
JPEG decoding are as good as storing the JPEG, just not nearly as space
efficient.

vdumoulin · 2016-01-27T13:13:22Z

@dwf @bartvm sounds good!

vdumoulin · 2016-01-27T13:18:38Z

Would you mind adding unit tests, at least for the downloader and the dataset class?

bartvm · 2016-01-27T17:04:33Z

0662d46

bartvm · 2016-01-27T22:16:22Z

Remaining Travis error is unrelated, so let me know if this is good to merge (that way I can tell the students in IFT6266 to use it tomorrow morning) :)

dwf · 2016-01-27T22:23:15Z

Just a note: reading HDF5 vlen data turns out to be pretty damn slow (close to a second with batch size 100 from local magnetic disk on ILSVRC2010). It's fine to keep it around, but it's worth considering maybe a more performant alternative for the case of images (honestly not sure what that might be).

Also, once I get a jpeg4py transformer going, because it is both fast and sensible about the GIL, JPEG decode time will be a much smaller deal than once believed.

vdumoulin · 2016-01-28T02:36:44Z

The PR as it is looks good to me, you can merge. You might want to point the students to this PR I made a while ago documenting how to do parallelize data processing.

Dogs vs cats

vdumoulin reviewed Jan 26, 2016
View reviewed changes

bartvm force-pushed the dogs_vs_cats branch 2 times, most recently from 4689003 to 433c74e Compare January 27, 2016 04:04

bartvm force-pushed the dogs_vs_cats branch from ffb40d8 to 0662d46 Compare January 27, 2016 17:04

bartvm added 4 commits January 27, 2016 16:12

RFC5987 support and show target not source filename when downloading

8ed009b

Add Dogs vs. Cats dataset

2d7692a

Fix error in case the download size isn't known

159e935

Add Dogs vs. Cats tests

a16d651

bartvm force-pushed the dogs_vs_cats branch from a7689a4 to a16d651 Compare January 27, 2016 21:12

bartvm added a commit that referenced this pull request Jan 28, 2016

Merge pull request #285 from mila-udem/dogs_vs_cats

db8dd6b

Dogs vs cats

bartvm merged commit db8dd6b into master Jan 28, 2016

bartvm deleted the dogs_vs_cats branch January 28, 2016 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dogs vs cats #285

Dogs vs cats #285

bartvm commented Jan 26, 2016

bartvm commented Jan 26, 2016

vdumoulin Jan 26, 2016

vdumoulin commented Jan 26, 2016

bartvm commented Jan 27, 2016

dwf commented Jan 27, 2016

vdumoulin commented Jan 27, 2016

vdumoulin commented Jan 27, 2016

bartvm commented Jan 27, 2016

bartvm commented Jan 27, 2016

dwf commented Jan 27, 2016

vdumoulin commented Jan 28, 2016

Dogs vs cats #285

Dogs vs cats #285

Conversation

bartvm commented Jan 26, 2016

bartvm commented Jan 26, 2016

vdumoulin Jan 26, 2016

Choose a reason for hiding this comment

vdumoulin commented Jan 26, 2016

bartvm commented Jan 27, 2016

dwf commented Jan 27, 2016

vdumoulin commented Jan 27, 2016

vdumoulin commented Jan 27, 2016

bartvm commented Jan 27, 2016

bartvm commented Jan 27, 2016

dwf commented Jan 27, 2016

vdumoulin commented Jan 28, 2016