Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

我使用accelerate和deepspeed zero-stage3微调的模型,使用fitune脚本中的save_checkpoint存下ckpt后应该怎么load? #18

Closed
Salierioo opened this issue Apr 21, 2023 · 12 comments

Comments

@Salierioo
Copy link

No description provided.

@ChawDoe
Copy link

ChawDoe commented Apr 26, 2023

No description provided.

请问您解决这个问题了吗?

@ChawDoe
Copy link

ChawDoe commented Apr 26, 2023

No description provided.

我也遇到相同的问题了

@rurubaobao
Copy link

我也有这个问题怎么解决的呀

@Salierioo
Copy link
Author

Salierioo commented May 5, 2023 via email

@Salierioo
Copy link
Author

No description provided.

请问您解决这个问题了吗?

No description provided.

我也遇到相同的问题了

我的解决方法回复在下面了,不知道您是否已经解决了。

@rurubaobao
Copy link

你好,我把它合起来大概60多个G,但是官方模型30多个G,你是直接加载60多个G的模型嘛?

@Salierioo
Copy link
Author

Salierioo commented May 6, 2023 via email

@Salierioo
Copy link
Author

Salierioo commented May 14, 2023 via email

@usun1997
Copy link

卡在where expected condition to be a boolean tensor, but got a tensor with dtype Half 这一步了,人傻了。用zero_to_fp32.py 将checkpoint .pt文件转换成了一个60多GB的pytorch_model.bin,然后在pytorch_model.bin.index 将所有的模型名称改成了pytorch_model.bin, 运行推理的时候报这个错误

@mayurou
Copy link

mayurou commented Jun 6, 2023

用zero_to_fp32.py 将checkpoint .pt文件转换成了一个60多GB的pytorch_model.bin,然后在pytorch_model.bin.index 将所有的模型名称改成了pytorch_model.bin

无法运行! 用zero_to_fp32.py 将checkpoint .pt文件转换成了一个60多GB的pytorch_model.bin,然后在pytorch_model.bin.index 将所有的模型名称改成了pytorch_model.bin. 这样子报错TypeError: expected str, bytes or os.PathLike object, not NoneType

@Salierioo
Copy link
Author

我之前做finetune的事情已经过去挺久了,可能是因为我没记清

但我确实对手动添加编写.index.json没什么印象,是跟着文档说明就能直接load,也没遇到什么问题。

@lmc8133
Copy link

lmc8133 commented Jun 30, 2023

解决方法我在accelerate的github中与deepspeed相关的文档中找到了,我贴个链接: https://github.com/huggingface/accelerate/blob/e60f3cab7a54a5519bf8f200fa1c998ce46e75bb/docs/source/usage_guides/deepspeed.mdx 方法在saving and loading那一节中。 简单来说就是用save_checkpoint方法生成的py文件对ckpt进行合并(这步对存储空间和内存大小都有要求),生成pytorch_model.bin之后就可以用from_pretrain来load了。 yiahs @.***  

------------------ 原始邮件 ------------------ 发件人: "OpenLMLab/MOSS" @.>; 发送时间: 2023年5月5日(星期五) 下午3:21 @.>; @.>;"State @.>; 主题: Re: [OpenLMLab/MOSS] 我使用accelerate和deepspeed zero-stage3微调的模型,使用fitune脚本中的save_checkpoint存下ckpt后应该怎么load? (Issue #18) 我也有这个问题怎么解决的呀 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>

这个链接看的更舒服点~
https://huggingface.co/docs/accelerate/usage_guides/deepspeed#saving-and-loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants