Add imagefolder dataset #2830

nateraw · 2021-08-23T23:34:06Z

A generic imagefolder dataset inspired by torchvision.datasets.ImageFolder.

Resolves #2508

Example Usage:

nateraw

~~@lhoestq @albertvillanova I think I'm close here, but running into issue with features. How can I make this error go away?~~

ValueError: Please pass `features` or at least one example when writing data

~~I have a feeling it has something to do with the pa.schema lines that I'm seeing in json, csv, parquet, pandas, etc. Any tips?~~

Edit: figured it out, I think. 😅

src/datasets/packaged_modules/imagefolder/imagefolder.py

nateraw · 2021-09-02T21:16:19Z

@lhoestq @albertvillanova it would be super cool if we could get the Image Classification task to work with this. I'm not sure how to have the dataset find the unique label names after the dataset has been loaded. Is that even possible?

My hacky community version here does this, but it wouldn't pass the test suite here. Any thoughts?

lhoestq · 2021-09-06T16:11:39Z

Hi ! Dataset builders that require some data_files like csv or json are handled differently that actual dataset scripts.

In particular:

they are placed directly in the src folder of the lib so that you can use it without internet connection (more exactly in src/datasets/packaged_modules/<builder_name>.py). So feel free to move the dataset python file there. You also need to register it in src/datasets/packaked_modules.__init__.py
they are handled a bit differently in our test suite (see the PackagedDatasetTest class in test_dataset_common.py). To be able to test the builder with your dummy data, you just need to modify get_packaged_dataset_dummy_data_files in test_dataset_common.py to return the right data_files for your builder. The dummy data can stay in datasets/image_folder/dummy

Let me know if you have questions or if I can help !

nateraw · 2021-09-07T02:59:55Z

Hey @lhoestq , I actually already did both of those things. I'm trying to get the image-classification task to work now.

For example...When you run ds = load_dataset('imagefolder', data_files='my_files'), with a directory called ./my_files that looks like this:

my_files
----| Cat
--------| image1.jpg
--------| ...
----| Dog
--------| image1.jpg
--------| ...

...We should set the dataset's labels feature to datasets.features.ClassLabel(names=['cat', 'dog']) dynamically with class names we find by getting a list of directories in my_files (via data_files). Otherwise the datasets.tasks.ImageClassification task will break, as the labels feature is not a ClassLabel.

I couldn't figure out how to access the data_files in the builder's _info function in a way that would pass in the test suite.

lhoestq · 2021-09-07T09:06:41Z

Nice ! Then maybe you can use self.config.data_files in _info() ?
What error are you getting in the test suite ?

Also note that data_files was first developed to accept paths to actual files, not directories. In particular, it fetches the metadata of all the data_files to get a unique hash for the caching mechanism. So we may need to do a few changes first.

lhoestq · 2021-11-15T16:47:47Z

I'm trying to make it work by getting the label names in the _info automatically.
I'll let you know tomorrow how it goes :)

Also cc @mariosasko since we're going to use #3163

Right now I'm getting the label name per file by taking the first word (from regex \w+) after the common prefix of all the files per split

lhoestq · 2021-11-15T17:54:28Z

Data files resolution takes too much time on my side for a dataset of a few 10,000s of examples. I'll speed it up with some multihreading tomorrow, and maybe by removing the unnecessary checksum verification

mariosasko · 2022-02-22T19:47:01Z

The code is a bit ugly for my taste. I'll try to simplify it tomorrow by avoiding the os.path.commonprefix computation and do something similar to @nateraw's ImageFolder instead, where only the second-to-last path component is considered a label (and see if I can update the class labels lazily in _generate_examples).

Also, as discussed offline with @lhoestq, I reverted the automatic directory globbing change in data_files.py and will investigate if we can use data_dir for that (e.g. load_dataset("imagefolder", data_dir="path/to/data") would be equal to load_dataset("imagefolder", data_files=["path/to/data/**/*", "path/to/data/*"]). The only problem with data_dir that it's equal to dl_manager.manual_dir, which would break scripts with manul_download_instructions, so maybe we can limit this behavior only to the packaged loaders? WDYT?

mariosasko · 2022-02-23T14:45:38Z

An updated example of usage:

lhoestq · 2022-02-23T14:58:25Z

The code is a bit ugly for my taste. I'll try to simplify it tomorrow by avoiding the os.path.commonprefix computation and do something similar to @nateraw's ImageFolder instead, where only the second-to-last path component is considered a label (and see if I can update the class labels lazily in _generate_examples).

Sounds good ! It's fine if we just support the same format as pytorch ImageFolder.

Regarding the data_dir parameter, what do you think is best ?

dl_manager.data_dir = data_dir
dl_manager.data_files = resolve(os.path.join(data_dir, "**"))

or something else ?

The only problem with data_dir that it's equal to dl_manager.manual_dir, which would break scripts with manul_download_instructions, so maybe we can limit this behavior only to the packaged loaders? WDYT?

We can still have dl_manager.manual_dir = data_dir though

lhoestq · 2022-02-23T14:59:14Z

The example colab is amazing !

mariosasko · 2022-02-23T16:50:50Z

@lhoestq

Regarding the data_dir parameter, what do you think is best ?

dl_manager.data_dir = data_dir

dl_manager.data_files = resolve(os.path.join(data_dir, "**"))

The second option. Basically, I would like data_files to be equal to:

def _split_generators(self, dl_manager):
    data_files = self.config.data_files
    if data_files is None:    
        data_files = glob.glob("{self.config.data_dir}/**", recursive=True)
    else:
        raise ValueError(f"At least one data file must be specified, but got data_files={data_files}")

in the scripts of packaged modules. It's probably better to do the resolution in data_files.py tho (to handle relative file paths on the Hub, for instance)

lhoestq · 2022-02-23T17:30:50Z

The second option. Basically, I would like data_files to be equal to:
def _split_generators(self, dl_manager):
    data_files = self.config.data_files
    if data_files is None:    
        data_files = glob.glob("{self.config.data_dir}/**", recursive=True)
    else:
        raise ValueError(f"At least one data file must be specified, but got data_files={data_files}")
in the scripts of packaged modules. It's probably better to do the resolution in data_files.py tho (to handle relative file paths on the Hub, for instance)

sounds good !

albertvillanova

Awesome!!!

davanstrien · 2022-02-24T16:16:00Z

🙌

…older

This reverts commit b3a065e.

nateraw · 2022-02-24T23:22:42Z

Hey @mariosasko are we still actually able to load an image folder?

For example...

! wget https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
! unzip kagglecatsanddogs_3367a.zip

followed by

from datasets import load_dataset

# Does not work
ds = load_dataset('imagefolder', data_files='/PetImages')

# Also doesn't work
ds = load_dataset('imagefolder', data_dir='/PetImages')

Are we going forward with the assumption that the user always wants to download from URL and that they won't have a dataset locally already? This at least gets us part of the way, but is technically not an "imagefolder" as intended.

Either way, was delighted to see the colab notebook work smoothly outside of the case I just described above. ❤️ thanks so much for the work here.

davanstrien · 2022-02-25T10:26:17Z

Hey @mariosasko are we still actually able to load an image folder?

For example...

! wget https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
! unzip kagglecatsanddogs_3367a.zip

followed by

from datasets import load_dataset

# Does not work
ds = load_dataset('imagefolder', data_files='/PetImages')

I ran into this too when I was trying to out. At the moment you can still load from a local on disk directory using a glob pattern i.e.

from datasets import load_dataset
ds = load_dataset("imagefolder", data_files="PetImages/**/*")

Colab example. I'm not sure if that is the intended behaviour or not. If it is, I think it would be good to document this because I also assumed the approach @nateraw used would work.

…older

lhoestq

Awesome thank you !

Feel free to add some logger.info or logger.debug here in there if you want :)

src/datasets/packaged_modules/imagefolder/imagefolder.py

lhoestq · 2022-03-01T14:28:26Z

src/datasets/packaged_modules/imagefolder/imagefolder.py

+                            labels.add(os.path.basename(os.path.dirname(downloaded_dir_file)))
+
+        data_files = self.config.data_files
+        downloaded_data_files = dl_manager.download_and_extract(data_files)


This is unexpectedly long when you have a folder of 100,000 images, it gets stuck for ~2min on "Extracting data files"

Even if you set ignore_verifications to True in load_dataset ?

Yes even with ignore_verifications=True

I tried with a local directory containing around 100,000 images from here: http://cs231n.stanford.edu/tiny-imagenet-200.zip

After discussing the issue via Slack, the problem seems to stem from ExtractorManager's checks. We can optimize this part by using dl_manager.download on an image file and dl_manager.download_and_extract on archives. I'll address this in a separate PR.

src/datasets/packaged_modules/imagefolder/imagefolder.py

docs/source/loading.rst

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

nateraw added 2 commits August 23, 2021 17:31

🚧 add wip imagefolder dataset

fc8c6cb

💄 style

46a6f0e

nateraw commented Aug 23, 2021

View reviewed changes

src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved

nateraw added 8 commits August 23, 2021 17:39

💄 style

7384084

✏️ fix typo

e37e58b

✅ Update tests

cf4294b

💄 style

2eb567b

🚸 use dirs not archives

99e2c6a

📝 add really quick docs on imagefolder

4ecf3c0

🎨 rename image_file feature to image_file_path

de3286b

📝 add to imagefolder docs

57b55e5

nateraw marked this pull request as ready for review August 24, 2021 01:52

nateraw requested review from thomwolf, albertvillanova and lhoestq August 24, 2021 01:53

nateraw changed the title ~~[WIP] Add imagefolder dataset~~ Add imagefolder dataset Aug 24, 2021

👥 Thank myself for contributing!

61c0a5d

nateraw mentioned this pull request Aug 31, 2021

✨ Add PyTorch image classification example huggingface/transformers#13134

Merged

5 tasks

lhoestq added 4 commits November 15, 2021 18:55

Merge remote-tracking branch 'upstream/master' into image-folder

960cb69

get class names when the feature types are loaded

0f8323c

allow to pass a local directory to data_files

8ac839b

update docs

e82979d

NielsRogge mentioned this pull request Jan 20, 2022

[ViTMAE] Add image pretraining script huggingface/transformers#15242

Merged

Resolve conflicts

489b90b

mariosasko added 4 commits February 22, 2022 14:45

Revert automatic directory glob in sanitaze_patterns

d9e812a

Support archives

f84cd6c

Resolve conflict

41aad02

Style

05adfa3

mariosasko added 3 commits February 23, 2022 13:23

Cleaner code

0afc714

Remove xcommonprefix

981ec2e

Small fix to pass CI

0496f1b

albertvillanova approved these changes Feb 24, 2022

View reviewed changes

mariosasko added 4 commits February 24, 2022 17:16

Merge branch 'master' of github.com:huggingface/datasets into image-f…

615f2e8

…older

Simplify iter_files call

8e5315b

Fix isdir check in HfFileSystem

b3a065e

Revert "Fix isdir check in HfFileSystem"

645154c

This reverts commit b3a065e.

mariosasko mentioned this pull request Feb 25, 2022

Add data_dir to data_files resolution and misc improvements to HfFileSystem #3791

Merged

mariosasko added 4 commits March 1, 2022 14:00

Add new line

8a8fff6

Merge branch 'master' of github.com:huggingface/datasets into image-f…

39d9faf

…older

Improve docs

a2fb28b

Merge branch 'master' of github.com:huggingface/datasets into image-f…

74c418d

…older

lhoestq approved these changes Mar 1, 2022

View reviewed changes

Apply suggestions from code review

1c21a68

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

mariosasko merged commit 207be67 into huggingface:master Mar 1, 2022

mariosasko mentioned this pull request Mar 10, 2022

ImageFolder improvements #3887

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add imagefolder dataset #2830

Add imagefolder dataset #2830

nateraw commented Aug 23, 2021 •

edited

Loading

nateraw left a comment •

edited

Loading

nateraw commented Sep 2, 2021

lhoestq commented Sep 6, 2021

nateraw commented Sep 7, 2021 •

edited

Loading

lhoestq commented Sep 7, 2021

lhoestq commented Nov 15, 2021 •

edited

Loading

lhoestq commented Nov 15, 2021

mariosasko commented Feb 22, 2022 •

edited

Loading

mariosasko commented Feb 23, 2022

lhoestq commented Feb 23, 2022

lhoestq commented Feb 23, 2022

mariosasko commented Feb 23, 2022 •

edited

Loading

lhoestq commented Feb 23, 2022

albertvillanova left a comment

davanstrien commented Feb 24, 2022

nateraw commented Feb 24, 2022

davanstrien commented Feb 25, 2022 •

edited

Loading

lhoestq left a comment

lhoestq Mar 1, 2022

mariosasko Mar 1, 2022

lhoestq Mar 1, 2022

lhoestq Mar 1, 2022

mariosasko Mar 1, 2022

Add imagefolder dataset #2830

Add imagefolder dataset #2830

Conversation

nateraw commented Aug 23, 2021 • edited Loading

nateraw left a comment • edited Loading

Choose a reason for hiding this comment

nateraw commented Sep 2, 2021

lhoestq commented Sep 6, 2021

nateraw commented Sep 7, 2021 • edited Loading

lhoestq commented Sep 7, 2021

lhoestq commented Nov 15, 2021 • edited Loading

lhoestq commented Nov 15, 2021

mariosasko commented Feb 22, 2022 • edited Loading

mariosasko commented Feb 23, 2022

lhoestq commented Feb 23, 2022

lhoestq commented Feb 23, 2022

mariosasko commented Feb 23, 2022 • edited Loading

lhoestq commented Feb 23, 2022

albertvillanova left a comment

Choose a reason for hiding this comment

davanstrien commented Feb 24, 2022

nateraw commented Feb 24, 2022

davanstrien commented Feb 25, 2022 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq Mar 1, 2022

Choose a reason for hiding this comment

mariosasko Mar 1, 2022

Choose a reason for hiding this comment

lhoestq Mar 1, 2022

Choose a reason for hiding this comment

lhoestq Mar 1, 2022

Choose a reason for hiding this comment

mariosasko Mar 1, 2022

Choose a reason for hiding this comment

nateraw commented Aug 23, 2021 •

edited

Loading

nateraw left a comment •

edited

Loading

nateraw commented Sep 7, 2021 •

edited

Loading

lhoestq commented Nov 15, 2021 •

edited

Loading

mariosasko commented Feb 22, 2022 •

edited

Loading

mariosasko commented Feb 23, 2022 •

edited

Loading

davanstrien commented Feb 25, 2022 •

edited

Loading