Skip to content

resume_only_model 不生效 #6084

@thesby

Description

@thesby

Describe the bug
在已经训练好的 lora 上继续训练,使用 resume_only_model,仍然会读取上一次的 step,然后直接就结束训练。

命令:

nproc_per_node=2
CUDA_VISIBLE_DEVICES=0,1 \
NPROC_PER_NODE=$nproc_per_node \
nohup swift sft \
--model Qwen3-VL-30B-A3B-Instruct \
--dataset  data.jsonl \
--val_dataset eval.jsonl \
--train_type lora \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-5 \
--lora_rank 512 \
--lora_alpha 512 \
--freeze_llm false \
--freeze_vit true \
--gradient_accumulation_steps 2 \
--save_steps 500 \
--save_total_limit 5 \
--logging_steps 5 \
--max_new_tokens 8192 \
--max_length 12800 \
--output_dir outputs/Qwen3-VL-30B-A3B-Instruct/
--resume_from_checkpoint 'outputs/Qwen3-VL-30B-A3B-Instruct/v1-20251004-222628/checkpoint-35200' \
--resume_only_model true \
--warmup_steps 100 \
--dataloader_num_workers 0 \
--optim adamw_8bit \
--attn_impl flash_attn \
--do_eval false \
--truncation_strategy delete \
--max_pixels 409600  > ./v2.log 2>&1 &
Train:   0%|          | 0/30255 [00:00<?, ?it/s]
                                                
{'train_runtime': 0.8557, 'train_samples_per_second': 141423.405, 'train_steps_per_second': 35355.559, 'train_loss': 0.0, 'epoch': 0.35, 'global_step/max_steps': '35200/30255', 'percentage': '116.34%', 'elapsed_time': '0s', 'remaining_time': '23h 59m 59s', 'memory(GiB)': 83.59, 'train_speed(iter/s)': 86436.446281}

Train:   0%|          | 0/30255 [00:00<?, ?it/s]
Train:   0%|          | 0/30255 [00:00<?, ?it/s]
Train:   0%|          | 0/30255 [00:00<?, ?it/s]
[INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: outputs/Qwen3-VL-30B-A3B-Instruct/v1-20251004-222628/checkpoint-35200
[INFO:swift] images_dir: outputs/Qwen3-VL-30B-A3B-Instruct/v2-20251010-191404/images
[INFO:swift] End time of running main: 2025-10-10 21:13:41.979910

transformers 4.38.2
ms-swift 源码安装最新

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions