Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I load partial parquet files only? #6979

Closed
lucasjinreal opened this issue Jun 18, 2024 · 12 comments
Closed

How can I load partial parquet files only? #6979

lucasjinreal opened this issue Jun 18, 2024 · 12 comments

Comments

@lucasjinreal
Copy link

I have a HUGE dataset about 14TB, I unable to download all parquet all. I just take about 100 from it.

dataset = load_dataset("xx/", data_files="data/train-001*-of-00314.parquet")

How can I just using 000 - 100 from a 00314 from all partially?

I search whole net didn't found a solution, this is stupid if they didn't support it, and I swear I wont using stupid parquet any more

@Dref360
Copy link
Contributor

Dref360 commented Jun 20, 2024

Hello,

Have you tried loading the dataset in streaming mode? Documentation

This way you wouldn't have to load it all. Also, let's be nice to Parquet, it's a really nice technology and we don't need to be mean :)

@lucasjinreal
Copy link
Author

I have downloaded part of it, just want to know how to load part of it, stream mode is not work for me since my network (in china) not stable, I don't want do it all again and again.

Just curious, doesn't there a way to load part of it?

@Dref360
Copy link
Contributor

Dref360 commented Jun 20, 2024

Could you convert the IterableDataset to a Dataset after taking the first 100 rows with .take? This way, you would have a local copy of the first 100 rows on your system and thus won't need to download. Would that work?

Here is a SO question detailing how to do the conversion.

@lucasjinreal
Copy link
Author

lucasjinreal commented Jun 21, 2024

I mean, the parquet is like:

00000-0143554
00001-0143554
00002-0143554
...
00100-0143554
...
09100-0143554

I just downloaded the first 9900 part of it.

I can not load with load_dataset, it throw an error says my file is not same as parquet all amount.

How could I load the only I have?

( I really don't want downlaod them all, cause, I don't need all, and pulus, its huge.... )

As I said, I have donwloaded about 9999... It's not about stream... I just wnat to konw how to load offline... part....

@albertvillanova
Copy link
Member

Hi, @lucasjinreal.

I am not sure of understanding your issue. What is the error message and stack trace you get? What version of datasets are you using? Could you provide a reproducible example?

Without knowing all those details, I would naively say that you can load whatever number of Parquet files by using the "parquet" loader: https://huggingface.co/docs/datasets/loading#parquet

ds = load_dataset("parquet", data_files="data/train-001*-of-00314.parquet", split="train")

@lucasjinreal
Copy link
Author

@albertvillanova Not sure you have tested with this or not, but I have tried,

the only error I got is it still laodding all parquet with a progress bar maxium to the whole number 014354, and it loads my 0 - 000999 part, then throws an error.

Says Numinfo is not same.

I am so confused,

@albertvillanova
Copy link
Member

Yes, my code snippet works.

Could you copy-paste your code and the output? Otherwise we are not able to know what the issue is.

@lucasjinreal
Copy link
Author

lucasjinreal commented Jun 21, 2024

@albertvillanova Hi, thanks for the tracing of the issue.

This is the output:

ython get_llava_recap_cc3m.py
Generating train split:   3%|███▋                                                                                                                | 101910/3199866 [00:16<08:30, 6065.67 examples/s]
Traceback (most recent call last):
  File "get_llava_recap_cc3m.py", line 31, in <module>
    dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")
  File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1118, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/info_utils.py", line 101, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=156885281898.75, num_examples=3199866, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=4994080770, num_examples=101910, shard_lengths=[10191, 10291, 10291, 10291, 10291, 10191, 10191, 10291, 10291, 9591], dataset_name='llava-recap-cc3m')}]

this is my code:

dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")

My situation and requirements:

00314 is all, but I downlaode about 150, half of it, as you can see, i used 0000*-of-00314. which should be at most 99 file being loaded.

But it just fail.

Can u understand my issue now?

If so, then do not suggest me with stream, Just want to know, is there a way to load part if it...... and please don't say you can not replicate my issue when you have downloaded them all, my english is not good, but I think all situations and all prerequists I have addressed already.

@albertvillanova
Copy link
Member

albertvillanova commented Jun 21, 2024

I see you did not use the "parquet" loader as I suggested in my code snippet above: #6979 (comment)
Please try passing "parquet" instead of "llava-recap-cc3m/" to load_dataset, and the complete path to data files in data_files:

load_dataset("parquet", data_files="llava-recap-cc3m/data/train-001*-of-00314.parquet")

@albertvillanova
Copy link
Member

Let me explain that you get the error because of this content within the dataset_info YAML tag in the llava-recap-cc3m/README.md:

  - name: train
    num_bytes: 156885281898.75
    num_examples: 3199866

By default, if there is that content in the README file, load_dataset performs a basic check to verify it the generated number of examples matches the expected one and raises a NonMatchingSplitsSizesError if that is not the case.

You can avoid this basic check by passing verification_mode="no_checks":

load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet", verification_mode="no_checks")

@albertvillanova
Copy link
Member

And please, next time you have an issue, please fill the Bug template issue with all the necessary information: https://github.com/huggingface/datasets/issues/new?assignees=&labels=&projects=&template=bug-report.yml

Otherwise it is very difficult for us to understand the underlying problem and to propose a pertinent solution.

@lucasjinreal
Copy link
Author

thank u albert!

It solved my issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants