Dataset uses excessive memory when loading files

### Describe the bug

Hi
I am having an issue when loading a dataset.
I have about 200 json files each about 1GB (total about 215GB). each row has a few features which are a list of ints.
I am trying to load the dataset using `load_dataset`.
The dataset is about 1.5M samples
I use `num_proc=32` and a node with 378GB of memory. 
About a third of the way there I get an OOM.
I also saw an old bug with a similar issue, which says to set `writer_batch_size`. I tried to lower it to 10, but it still crashed.
I also tried to lower the `num_proc` to 16 and even 8, but still the same issue.

### Steps to reproduce the bug

`dataset = load_dataset("json", data_dir=data_config.train_path, num_proc=data_config.num_proc, writer_batch_size=50)["train"]`


### Expected behavior

Loading a dataset with more than 100GB to spare should not cause an OOM error.
maybe i am missing something but I would love some help.

### Environment info

- `datasets` version: 3.5.0
- Platform: Linux-6.6.20-aufs-1-x86_64-with-glibc2.36
- Python version: 3.11.2
- `huggingface_hub` version: 0.29.1
- PyArrow version: 19.0.1
- Pandas version: 2.2.3
- `fsspec` version: 2024.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset uses excessive memory when loading files #7509

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset uses excessive memory when loading files #7509

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions