-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I load partial parquet files only? #6979
Comments
Hello, Have you tried loading the dataset in streaming mode? Documentation This way you wouldn't have to load it all. Also, let's be nice to Parquet, it's a really nice technology and we don't need to be mean :) |
I have downloaded part of it, just want to know how to load part of it, stream mode is not work for me since my network (in china) not stable, I don't want do it all again and again. Just curious, doesn't there a way to load part of it? |
Could you convert the IterableDataset to a Dataset after taking the first 100 rows with Here is a SO question detailing how to do the conversion. |
I mean, the parquet is like: 00000-0143554 I just downloaded the first 9900 part of it. I can not load with load_dataset, it throw an error says my file is not same as parquet all amount. How could I load the only I have? ( I really don't want downlaod them all, cause, I don't need all, and pulus, its huge.... ) As I said, I have donwloaded about 9999... It's not about stream... I just wnat to konw how to load offline... part.... |
Hi, @lucasjinreal. I am not sure of understanding your issue. What is the error message and stack trace you get? What version of Without knowing all those details, I would naively say that you can load whatever number of Parquet files by using the "parquet" loader: https://huggingface.co/docs/datasets/loading#parquet ds = load_dataset("parquet", data_files="data/train-001*-of-00314.parquet", split="train") |
@albertvillanova Not sure you have tested with this or not, but I have tried, the only error I got is it still laodding all parquet with a progress bar maxium to the whole number 014354, and it loads my 0 - 000999 part, then throws an error. Says Numinfo is not same. I am so confused, |
Yes, my code snippet works. Could you copy-paste your code and the output? Otherwise we are not able to know what the issue is. |
@albertvillanova Hi, thanks for the tracing of the issue. This is the output:
this is my code:
My situation and requirements: 00314 is all, but I downlaode about 150, half of it, as you can see, i used But it just fail. Can u understand my issue now? If so, then do not suggest me with stream, Just want to know, is there a way to load part if it...... and please don't say you can not replicate my issue when you have downloaded them all, my english is not good, but I think all situations and all prerequists I have addressed already. |
I see you did not use the "parquet" loader as I suggested in my code snippet above: #6979 (comment) load_dataset("parquet", data_files="llava-recap-cc3m/data/train-001*-of-00314.parquet") |
Let me explain that you get the error because of this content within the
By default, if there is that content in the README file, You can avoid this basic check by passing load_dataset("llava-recap-cc3m/", data_files="data/train-0000*-of-00314.parquet", verification_mode="no_checks") |
And please, next time you have an issue, please fill the Bug template issue with all the necessary information: https://github.com/huggingface/datasets/issues/new?assignees=&labels=&projects=&template=bug-report.yml Otherwise it is very difficult for us to understand the underlying problem and to propose a pertinent solution. |
thank u albert! It solved my issue! |
I have a HUGE dataset about 14TB, I unable to download all parquet all. I just take about 100 from it.
dataset = load_dataset("xx/", data_files="data/train-001*-of-00314.parquet")
How can I just using 000 - 100 from a 00314 from all partially?
I search whole net didn't found a solution, this is stupid if they didn't support it, and I swear I wont using stupid parquet any more
The text was updated successfully, but these errors were encountered: