-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
Hi
I am having an issue when loading a dataset.
I have about 200 json files each about 1GB (total about 215GB). each row has a few features which are a list of ints.
I am trying to load the dataset using load_dataset.
The dataset is about 1.5M samples
I use num_proc=32 and a node with 378GB of memory.
About a third of the way there I get an OOM.
I also saw an old bug with a similar issue, which says to set writer_batch_size. I tried to lower it to 10, but it still crashed.
I also tried to lower the num_proc to 16 and even 8, but still the same issue.
Steps to reproduce the bug
dataset = load_dataset("json", data_dir=data_config.train_path, num_proc=data_config.num_proc, writer_batch_size=50)["train"]
Expected behavior
Loading a dataset with more than 100GB to spare should not cause an OOM error.
maybe i am missing something but I would love some help.
Environment info
datasetsversion: 3.5.0- Platform: Linux-6.6.20-aufs-1-x86_64-with-glibc2.36
- Python version: 3.11.2
huggingface_hubversion: 0.29.1- PyArrow version: 19.0.1
- Pandas version: 2.2.3
fsspecversion: 2024.9.0