Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add imagefolder dataset #2830

Merged
merged 40 commits into from
Mar 1, 2022
Merged

Add imagefolder dataset #2830

merged 40 commits into from
Mar 1, 2022

Conversation

nateraw
Copy link
Contributor

@nateraw nateraw commented Aug 23, 2021

A generic imagefolder dataset inspired by torchvision.datasets.ImageFolder.

Resolves #2508


Example Usage:

Open In Colab

Copy link
Contributor Author

@nateraw nateraw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq @albertvillanova I think I'm close here, but running into issue with features. How can I make this error go away?

ValueError: Please pass `features` or at least one example when writing data

I have a feeling it has something to do with the pa.schema lines that I'm seeing in json, csv, parquet, pandas, etc. Any tips?

Edit: figured it out, I think. 😅

src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved
@nateraw nateraw marked this pull request as ready for review August 24, 2021 01:52
@nateraw nateraw changed the title [WIP] Add imagefolder dataset Add imagefolder dataset Aug 24, 2021
@nateraw
Copy link
Contributor Author

nateraw commented Sep 2, 2021

@lhoestq @albertvillanova it would be super cool if we could get the Image Classification task to work with this. I'm not sure how to have the dataset find the unique label names after the dataset has been loaded. Is that even possible?

My hacky community version here does this, but it wouldn't pass the test suite here. Any thoughts?

@lhoestq
Copy link
Member

lhoestq commented Sep 6, 2021

Hi ! Dataset builders that require some data_files like csv or json are handled differently that actual dataset scripts.

In particular:

  • they are placed directly in the src folder of the lib so that you can use it without internet connection (more exactly in src/datasets/packaged_modules/<builder_name>.py). So feel free to move the dataset python file there. You also need to register it in src/datasets/packaked_modules.__init__.py
  • they are handled a bit differently in our test suite (see the PackagedDatasetTest class in test_dataset_common.py). To be able to test the builder with your dummy data, you just need to modify get_packaged_dataset_dummy_data_files in test_dataset_common.py to return the right data_files for your builder. The dummy data can stay in datasets/image_folder/dummy

Let me know if you have questions or if I can help !

@nateraw
Copy link
Contributor Author

nateraw commented Sep 7, 2021

Hey @lhoestq , I actually already did both of those things. I'm trying to get the image-classification task to work now.

For example...When you run ds = load_dataset('imagefolder', data_files='my_files'), with a directory called ./my_files that looks like this:

my_files
----| Cat
--------| image1.jpg
--------| ...
----| Dog
--------| image1.jpg
--------| ...

...We should set the dataset's labels feature to datasets.features.ClassLabel(names=['cat', 'dog']) dynamically with class names we find by getting a list of directories in my_files (via data_files). Otherwise the datasets.tasks.ImageClassification task will break, as the labels feature is not a ClassLabel.

I couldn't figure out how to access the data_files in the builder's _info function in a way that would pass in the test suite.

@lhoestq
Copy link
Member

lhoestq commented Sep 7, 2021

Nice ! Then maybe you can use self.config.data_files in _info() ?
What error are you getting in the test suite ?

Also note that data_files was first developed to accept paths to actual files, not directories. In particular, it fetches the metadata of all the data_files to get a unique hash for the caching mechanism. So we may need to do a few changes first.

@lhoestq
Copy link
Member

lhoestq commented Nov 15, 2021

I'm trying to make it work by getting the label names in the _info automatically.
I'll let you know tomorrow how it goes :)

Also cc @mariosasko since we're going to use #3163

Right now I'm getting the label name per file by taking the first word (from regex \w+) after the common prefix of all the files per split

@lhoestq
Copy link
Member

lhoestq commented Nov 15, 2021

Data files resolution takes too much time on my side for a dataset of a few 10,000s of examples. I'll speed it up with some multihreading tomorrow, and maybe by removing the unnecessary checksum verification

@mariosasko
Copy link
Collaborator

mariosasko commented Feb 22, 2022

The code is a bit ugly for my taste. I'll try to simplify it tomorrow by avoiding the os.path.commonprefix computation and do something similar to @nateraw's ImageFolder instead, where only the second-to-last path component is considered a label (and see if I can update the class labels lazily in _generate_examples).

Also, as discussed offline with @lhoestq, I reverted the automatic directory globbing change in data_files.py and will investigate if we can use data_dir for that (e.g. load_dataset("imagefolder", data_dir="path/to/data") would be equal to load_dataset("imagefolder", data_files=["path/to/data/**/*", "path/to/data/*"]). The only problem with data_dir that it's equal to dl_manager.manual_dir, which would break scripts with manul_download_instructions, so maybe we can limit this behavior only to the packaged loaders? WDYT?

@mariosasko
Copy link
Collaborator

An updated example of usage: Open In Colab

@lhoestq
Copy link
Member

lhoestq commented Feb 23, 2022

The code is a bit ugly for my taste. I'll try to simplify it tomorrow by avoiding the os.path.commonprefix computation and do something similar to @nateraw's ImageFolder instead, where only the second-to-last path component is considered a label (and see if I can update the class labels lazily in _generate_examples).

Sounds good ! It's fine if we just support the same format as pytorch ImageFolder.

Regarding the data_dir parameter, what do you think is best ?

  1. dl_manager.data_dir = data_dir
  2. dl_manager.data_files = resolve(os.path.join(data_dir, "**"))

or something else ?

The only problem with data_dir that it's equal to dl_manager.manual_dir, which would break scripts with manul_download_instructions, so maybe we can limit this behavior only to the packaged loaders? WDYT?

We can still have dl_manager.manual_dir = data_dir though

@lhoestq
Copy link
Member

lhoestq commented Feb 23, 2022

The example colab is amazing !

@mariosasko
Copy link
Collaborator

mariosasko commented Feb 23, 2022

@lhoestq

Regarding the data_dir parameter, what do you think is best ?

  1. dl_manager.data_dir = data_dir
  2. dl_manager.data_files = resolve(os.path.join(data_dir, "**"))

The second option. Basically, I would like data_files to be equal to:

def _split_generators(self, dl_manager):
    data_files = self.config.data_files
    if data_files is None:    
        data_files = glob.glob("{self.config.data_dir}/**", recursive=True)
    else:
        raise ValueError(f"At least one data file must be specified, but got data_files={data_files}")

in the scripts of packaged modules. It's probably better to do the resolution in data_files.py tho (to handle relative file paths on the Hub, for instance)

@lhoestq
Copy link
Member

lhoestq commented Feb 23, 2022

The second option. Basically, I would like data_files to be equal to:

def _split_generators(self, dl_manager):
    data_files = self.config.data_files
    if data_files is None:    
        data_files = glob.glob("{self.config.data_dir}/**", recursive=True)
    else:
        raise ValueError(f"At least one data file must be specified, but got data_files={data_files}")

in the scripts of packaged modules. It's probably better to do the resolution in data_files.py tho (to handle relative file paths on the Hub, for instance)

sounds good !

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!!

@davanstrien
Copy link
Member

🙌

@nateraw
Copy link
Contributor Author

nateraw commented Feb 24, 2022

Hey @mariosasko are we still actually able to load an image folder?

For example...

! wget https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
! unzip kagglecatsanddogs_3367a.zip

followed by

from datasets import load_dataset

# Does not work
ds = load_dataset('imagefolder', data_files='/PetImages')

# Also doesn't work
ds = load_dataset('imagefolder', data_dir='/PetImages')

Are we going forward with the assumption that the user always wants to download from URL and that they won't have a dataset locally already? This at least gets us part of the way, but is technically not an "imagefolder" as intended.

Either way, was delighted to see the colab notebook work smoothly outside of the case I just described above. ❤️ thanks so much for the work here.

@davanstrien
Copy link
Member

davanstrien commented Feb 25, 2022

Hey @mariosasko are we still actually able to load an image folder?

For example...

! wget https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
! unzip kagglecatsanddogs_3367a.zip

followed by

from datasets import load_dataset

# Does not work
ds = load_dataset('imagefolder', data_files='/PetImages')

I ran into this too when I was trying to out. At the moment you can still load from a local on disk directory using a glob pattern i.e.

from datasets import load_dataset
ds = load_dataset("imagefolder", data_files="PetImages/**/*")

Colab example. I'm not sure if that is the intended behaviour or not. If it is, I think it would be good to document this because I also assumed the approach @nateraw used would work.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you !

Feel free to add some logger.info or logger.debug here in there if you want :)

src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved
labels.add(os.path.basename(os.path.dirname(downloaded_dir_file)))

data_files = self.config.data_files
downloaded_data_files = dl_manager.download_and_extract(data_files)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unexpectedly long when you have a folder of 100,000 images, it gets stuck for ~2min on "Extracting data files"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if you set ignore_verifications to True in load_dataset ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes even with ignore_verifications=True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with a local directory containing around 100,000 images from here: http://cs231n.stanford.edu/tiny-imagenet-200.zip

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing the issue via Slack, the problem seems to stem from ExtractorManager's checks. We can optimize this part by using dl_manager.download on an image file and dl_manager.download_and_extract on archives. I'll address this in a separate PR.

src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved
src/datasets/packaged_modules/imagefolder/imagefolder.py Outdated Show resolved Hide resolved
docs/source/loading.rst Outdated Show resolved Hide resolved
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@mariosasko mariosasko merged commit 207be67 into huggingface:master Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load Image Classification Dataset from Local
5 participants