How can I load partial parquet files only? #6979

lucasjinreal · 2024-06-18T15:44:16Z

I have a HUGE dataset about 14TB, I unable to download all parquet all. I just take about 100 from it.

dataset = load_dataset("xx/", data_files="data/train-001*-of-00314.parquet")

How can I just using 000 - 100 from a 00314 from all partially?

I search whole net didn't found a solution, this is stupid if they didn't support it, and I swear I wont using stupid parquet any more

Dref360 · 2024-06-20T13:23:16Z

Hello,

Have you tried loading the dataset in streaming mode? Documentation

This way you wouldn't have to load it all. Also, let's be nice to Parquet, it's a really nice technology and we don't need to be mean :)

lucasjinreal · 2024-06-20T14:17:36Z

I have downloaded part of it, just want to know how to load part of it, stream mode is not work for me since my network (in china) not stable, I don't want do it all again and again.

Just curious, doesn't there a way to load part of it?

Dref360 · 2024-06-20T14:40:46Z

Could you convert the IterableDataset to a Dataset after taking the first 100 rows with .take? This way, you would have a local copy of the first 100 rows on your system and thus won't need to download. Would that work?

Here is a SO question detailing how to do the conversion.

lucasjinreal · 2024-06-21T02:55:25Z

I mean, the parquet is like:

00000-0143554
00001-0143554
00002-0143554
...
00100-0143554
...
09100-0143554

I just downloaded the first 9900 part of it.

I can not load with load_dataset, it throw an error says my file is not same as parquet all amount.

How could I load the only I have?

( I really don't want downlaod them all, cause, I don't need all, and pulus, its huge.... )

As I said, I have donwloaded about 9999... It's not about stream... I just wnat to konw how to load offline... part....

albertvillanova · 2024-06-21T05:32:42Z

Hi, @lucasjinreal.

I am not sure of understanding your issue. What is the error message and stack trace you get? What version of datasets are you using? Could you provide a reproducible example?

Without knowing all those details, I would naively say that you can load whatever number of Parquet files by using the "parquet" loader: https://huggingface.co/docs/datasets/loading#parquet

ds = load_dataset("parquet", data_files="data/train-001*-of-00314.parquet", split="train")

lucasjinreal · 2024-06-21T07:21:29Z

@albertvillanova Not sure you have tested with this or not, but I have tried,

the only error I got is it still laodding all parquet with a progress bar maxium to the whole number 014354, and it loads my 0 - 000999 part, then throws an error.

Says Numinfo is not same.

I am so confused,

albertvillanova · 2024-06-21T08:25:01Z

Yes, my code snippet works.

Could you copy-paste your code and the output? Otherwise we are not able to know what the issue is.

lucasjinreal · 2024-06-21T10:04:13Z

@albertvillanova Hi, thanks for the tracing of the issue.

This is the output:

ython get_llava_recap_cc3m.py
Generating train split:   3%|███▋                                                                                                                | 101910/3199866 [00:16<08:30, 6065.67 examples/s]
Traceback (most recent call last):
  File "get_llava_recap_cc3m.py", line 31, in <module>
    dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")
  File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1118, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/usr/local/lib/python3.8/dist-packages/datasets/utils/info_utils.py", line 101, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=156885281898.75, num_examples=3199866, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=4994080770, num_examples=101910, shard_lengths=[10191, 10291, 10291, 10291, 10291, 10191, 10191, 10291, 10291, 9591], dataset_name='llava-recap-cc3m')}]

this is my code:

dataset = load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet")

My situation and requirements:

00314 is all, but I downlaode about 150, half of it, as you can see, i used 0000*-of-00314. which should be at most 99 file being loaded.

But it just fail.

Can u understand my issue now?

If so, then do not suggest me with stream, Just want to know, is there a way to load part if it...... and please don't say you can not replicate my issue when you have downloaded them all, my english is not good, but I think all situations and all prerequists I have addressed already.

albertvillanova · 2024-06-21T13:07:46Z

I see you did not use the "parquet" loader as I suggested in my code snippet above: #6979 (comment)
Please try passing "parquet" instead of "llava-recap-cc3m/" to load_dataset, and the complete path to data files in data_files:

load_dataset("parquet", data_files="llava-recap-cc3m/data/train-001*-of-00314.parquet")

albertvillanova · 2024-06-21T13:30:43Z

Let me explain that you get the error because of this content within the dataset_info YAML tag in the llava-recap-cc3m/README.md:

  - name: train
    num_bytes: 156885281898.75
    num_examples: 3199866

By default, if there is that content in the README file, load_dataset performs a basic check to verify it the generated number of examples matches the expected one and raises a NonMatchingSplitsSizesError if that is not the case.

You can avoid this basic check by passing verification_mode="no_checks":

load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet", verification_mode="no_checks")

albertvillanova · 2024-06-21T13:32:50Z

And please, next time you have an issue, please fill the Bug template issue with all the necessary information: https://github.com/huggingface/datasets/issues/new?assignees=&labels=&projects=&template=bug-report.yml

Otherwise it is very difficult for us to understand the underlying problem and to propose a pertinent solution.

lucasjinreal · 2024-06-21T17:09:31Z

thank u albert!

It solved my issue!

albertvillanova closed this as completed Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I load partial parquet files only? #6979

How can I load partial parquet files only? #6979

lucasjinreal commented Jun 18, 2024

Dref360 commented Jun 20, 2024 •

edited

Loading

lucasjinreal commented Jun 20, 2024

Dref360 commented Jun 20, 2024 •

edited

Loading

lucasjinreal commented Jun 21, 2024 •

edited

Loading

albertvillanova commented Jun 21, 2024

lucasjinreal commented Jun 21, 2024

albertvillanova commented Jun 21, 2024

lucasjinreal commented Jun 21, 2024 •

edited

Loading

albertvillanova commented Jun 21, 2024 •

edited

Loading

albertvillanova commented Jun 21, 2024

albertvillanova commented Jun 21, 2024

lucasjinreal commented Jun 21, 2024

How can I load partial parquet files only? #6979

How can I load partial parquet files only? #6979

Comments

lucasjinreal commented Jun 18, 2024

Dref360 commented Jun 20, 2024 • edited Loading

lucasjinreal commented Jun 20, 2024

Dref360 commented Jun 20, 2024 • edited Loading

lucasjinreal commented Jun 21, 2024 • edited Loading

albertvillanova commented Jun 21, 2024

lucasjinreal commented Jun 21, 2024

albertvillanova commented Jun 21, 2024

lucasjinreal commented Jun 21, 2024 • edited Loading

albertvillanova commented Jun 21, 2024 • edited Loading

albertvillanova commented Jun 21, 2024

albertvillanova commented Jun 21, 2024

lucasjinreal commented Jun 21, 2024

Dref360 commented Jun 20, 2024 •

edited

Loading

Dref360 commented Jun 20, 2024 •

edited

Loading

lucasjinreal commented Jun 21, 2024 •

edited

Loading

lucasjinreal commented Jun 21, 2024 •

edited

Loading

albertvillanova commented Jun 21, 2024 •

edited

Loading