Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

全量参数开放做增量预训练,数据集加载内存溢出报错,不符合预期 #4915

Open
1 task done
Adam-fei opened this issue Jul 21, 2024 · 0 comments
Open
1 task done
Labels
pending This problem is yet to be addressed

Comments

@Adam-fei
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

stage: pt 模式进行全参数开放增量预训练
执行脚本为

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen2_7b_full_sft_ds3.yaml

Reproduction

机器内存共900G,在使用上述脚本执行后,加载数据阶段就会内存溢出,报错退出,报错信息如下:

E0721 20:54:31.520000 140161827444544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 2330) of binary: /python

但是此时加载的数据总量仅仅不到 250G,不明白为何会溢出。
请问:

  1. 执行逻辑是一次性将所有数据读到内存里吗?有什么更加优雅的数据加载方式?
  2. 希望训练的数据总量在 500G 左右,本框架可以支持吗?

谢谢

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant