-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add imagefolder dataset #2830
Add imagefolder dataset #2830
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq @albertvillanova I think I'm close here, but running into issue with features
. How can I make this error go away?
ValueError: Please pass `features` or at least one example when writing data
I have a feeling it has something to do with the pa.schema
lines that I'm seeing in json
, csv
, parquet
, pandas
, etc. Any tips?
Edit: figured it out, I think. 😅
@lhoestq @albertvillanova it would be super cool if we could get the Image Classification task to work with this. I'm not sure how to have the dataset find the unique label names after the dataset has been loaded. Is that even possible? My hacky community version here does this, but it wouldn't pass the test suite here. Any thoughts? |
Hi ! Dataset builders that require some In particular:
Let me know if you have questions or if I can help ! |
Hey @lhoestq , I actually already did both of those things. I'm trying to get the For example...When you run
...We should set the dataset's I couldn't figure out how to access the |
Nice ! Then maybe you can use Also note that |
I'm trying to make it work by getting the label names in the _info automatically. Also cc @mariosasko since we're going to use #3163 Right now I'm getting the label name per file by taking the first word (from regex |
Data files resolution takes too much time on my side for a dataset of a few 10,000s of examples. I'll speed it up with some multihreading tomorrow, and maybe by removing the unnecessary checksum verification |
The code is a bit ugly for my taste. I'll try to simplify it tomorrow by avoiding the Also, as discussed offline with @lhoestq, I reverted the automatic directory globbing change in |
Sounds good ! It's fine if we just support the same format as pytorch ImageFolder. Regarding the
or something else ?
We can still have |
The example colab is amazing ! |
The second option. Basically, I would like def _split_generators(self, dl_manager):
data_files = self.config.data_files
if data_files is None:
data_files = glob.glob("{self.config.data_dir}/**", recursive=True)
else:
raise ValueError(f"At least one data file must be specified, but got data_files={data_files}") in the scripts of packaged modules. It's probably better to do the resolution in |
sounds good ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!!!
🙌 |
Hey @mariosasko are we still actually able to load an image folder? For example...
followed by from datasets import load_dataset
# Does not work
ds = load_dataset('imagefolder', data_files='/PetImages')
# Also doesn't work
ds = load_dataset('imagefolder', data_dir='/PetImages') Are we going forward with the assumption that the user always wants to download from URL and that they won't have a dataset locally already? This at least gets us part of the way, but is technically not an "imagefolder" as intended. Either way, was delighted to see the colab notebook work smoothly outside of the case I just described above. ❤️ thanks so much for the work here. |
I ran into this too when I was trying to out. At the moment you can still load from a local on disk directory using a glob pattern i.e. from datasets import load_dataset
ds = load_dataset("imagefolder", data_files="PetImages/**/*") Colab example. I'm not sure if that is the intended behaviour or not. If it is, I think it would be good to document this because I also assumed the approach @nateraw used would work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you !
Feel free to add some logger.info or logger.debug here in there if you want :)
labels.add(os.path.basename(os.path.dirname(downloaded_dir_file))) | ||
|
||
data_files = self.config.data_files | ||
downloaded_data_files = dl_manager.download_and_extract(data_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unexpectedly long when you have a folder of 100,000 images, it gets stuck for ~2min on "Extracting data files"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if you set ignore_verifications
to True
in load_dataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes even with ignore_verifications=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with a local directory containing around 100,000 images from here: http://cs231n.stanford.edu/tiny-imagenet-200.zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing the issue via Slack, the problem seems to stem from ExtractorManager
's checks. We can optimize this part by using dl_manager.download
on an image file and dl_manager.download_and_extract
on archives. I'll address this in a separate PR.
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
A generic imagefolder dataset inspired by
torchvision.datasets.ImageFolder
.Resolves #2508
Example Usage: