全量参数开放做增量预训练，数据集加载内存溢出报错，不符合预期 #4915

Adam-fei · 2024-07-21T13:13:37Z

Reminder

I have read the README and searched the existing issues.

System Info

stage: pt 模式进行全参数开放增量预训练
执行脚本为

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen2_7b_full_sft_ds3.yaml

Reproduction

机器内存共900G，在使用上述脚本执行后，加载数据阶段就会内存溢出，报错退出，报错信息如下：

E0721 20:54:31.520000 140161827444544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 2330) of binary: /python

但是此时加载的数据总量仅仅不到 250G，不明白为何会溢出。
请问：

执行逻辑是一次性将所有数据读到内存里吗？有什么更加优雅的数据加载方式？
希望训练的数据总量在 500G 左右，本框架可以支持吗？

谢谢

Expected behavior

No response

Others

No response

github-actions bot added the pending This problem is yet to be addressed label Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

全量参数开放做增量预训练，数据集加载内存溢出报错，不符合预期 #4915

全量参数开放做增量预训练，数据集加载内存溢出报错，不符合预期 #4915

Adam-fei commented Jul 21, 2024

全量参数开放做增量预训练，数据集加载内存溢出报错，不符合预期 #4915

全量参数开放做增量预训练，数据集加载内存溢出报错，不符合预期 #4915

Comments

Adam-fei commented Jul 21, 2024

Reminder

System Info

Reproduction

Expected behavior

Others