-
Notifications
You must be signed in to change notification settings - Fork 897
Description
Describe the bug
when use resume_from_checkpoint to load model for continue train
--resume_from_checkpoint xx \ --resume_only_model false
error info:
[rank4]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, �[1mdo those steps only if you trust the source of the checkpoint�[0m.
[rank4]: (1) In PyTorch 2.6, we changed the default value of the weights_only
argument in torch.load
from False
to True
. Re-running torch.load
with weights_only
set to False
will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank4]: (2) Alternatively, to load with weights_only=True
please check the recommended steps in the following error message.
[rank4]: WeightsUnpickler error: Unsupported global: GLOBAL deepspeed.runtime.fp16.loss_scaler.LossScaler was not an allowed global by default. Please use torch.serialization.add_safe_globals([LossScaler])
or the torch.serialization.safe_globals([LossScaler])
context manager to allowlist this global if you trust this class/function.
if not use resume_from_checkpoint, just full train, it works.
Your hardware and system info
swift.version: 3.5.0 GPU: H20, cuda:12.4, torch:2.6