We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stage: pt 模式进行全参数开放增量预训练 执行脚本为
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen2_7b_full_sft_ds3.yaml
机器内存共900G,在使用上述脚本执行后,加载数据阶段就会内存溢出,报错退出,报错信息如下:
E0721 20:54:31.520000 140161827444544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 2330) of binary: /python
但是此时加载的数据总量仅仅不到 250G,不明白为何会溢出。 请问:
谢谢
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Reminder
System Info
stage: pt 模式进行全参数开放增量预训练
执行脚本为
Reproduction
机器内存共900G,在使用上述脚本执行后,加载数据阶段就会内存溢出,报错退出,报错信息如下:
但是此时加载的数据总量仅仅不到 250G,不明白为何会溢出。
请问:
谢谢
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: