How can we resume training from lora model? #29607

rangehow · 2024-03-12T05:23:07Z

System Info

transformers version: 4.39.0.dev0
Platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.13
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 16, 'zero3_init_flag': False, 'zero_stage': 0}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.2.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:
Name: peft
Version: 0.9.1.dev0

Who can help?

@younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I found a relevant PR #24274 but it seems not work for me (I found that there is an commenet in the PR that someone who is completely consistent with me has not been resolved)
In short, when using transformers trainer, if training is interrupted, simply use trainer.train(resume_from_checkpoint=True) will get following error

ValueError: Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json) in ***/checkpoint-50.

reproduction code


from transformers import MambaForCausalLM,AutoTokenizer,Seq2SeqTrainer,DataCollatorForSeq2Seq,Seq2SeqTrainingArguments,Trainer,TrainingArguments
import torch
from dataset import MyDataset
import json
from plot import plot_loss
from peft import LoraConfig
model_dir='mamba-2.8b-hf'
output_dir='./mamba-translate'
tokenizer=AutoTokenizer.from_pretrained(model_dir,padding_side='left')
model=MambaForCausalLM.from_pretrained(model_dir,torch_dtype=torch.bfloat16)

collator=DataCollatorForSeq2Seq(tokenizer,model)


lora_config =  LoraConfig(
        r=64,
        target_modules=["x_proj", "embeddings", "in_proj", "out_proj"],
        task_type="CAUSAL_LM",
        bias="none",
        use_rslora=True,
)
model.add_adapter(lora_config)

trainer = Trainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        args=TrainingArguments(
            overwrite_output_dir =True,
            remove_unused_columns =False,
            gradient_accumulation_steps=16,
            # gradient_checkpointing=True,
            #------------------------------
            evaluation_strategy='steps',
            eval_delay=100,
            eval_steps =50,
            #-------------------------------
            save_strategy ='steps',
            save_steps = 50,
            save_total_limit =3,
            load_best_model_at_end=True,
            #--------------------------------
            dataloader_num_workers =10,
            learning_rate=2e-3,
            num_train_epochs=30,
            # auto_find_batch_size=True,
            per_device_train_batch_size=8,
            per_device_eval_batch_size =8,
            output_dir="/data/ruanjh/mamba-translate-2.8b-lora",
            logging_steps=5,
            bf16=True,
            prediction_loss_only=True,
            lr_scheduler_type="cosine",
            # torch_compile=True,
            # torch_compile_backend='inductor',
            # torch_compile_mode='max-autotune',
            optim='adamw_apex_fused',
            # save_safetensors =False,
        ),
        data_collator=collator,
    )

trainer.train(resume_from_checkpoint=True)

Expected behavior

Resume training correctly

The text was updated successfully, but these errors were encountered:

rangehow · 2024-03-12T06:01:59Z

I found the possible cause, in hf we can use model.add_adapter(lora_config) or model=get_peft_model(model,lora_config) to convert a model to peft model. But the former will cause the error while the latter works well.

But using the latter one will cause another error:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

amyeroberts · 2024-04-11T08:46:33Z

also cc @pacman100

younesbelkada · 2024-04-16T08:01:45Z

@rangehow
Apologies for the delay, could you try to run a single training step on CPU and report the error here ? Alternatively you can also run CUDA_LAUNCH_BLOCKING=1 python xxx and paste the error traceback here 🙏

github-actions · 2024-05-10T08:04:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

This was referenced Mar 19, 2024

[resume optimization] skip loading pretrained weights on resume #11465

Open

Add option to not re-load the model when resuming from checkpoint. #29740

Open

huggingface deleted a comment from github-actions bot Apr 11, 2024

amyeroberts added Core: Modeling Internals of the library; Models. lora trainer and removed Core: Modeling Internals of the library; Models. labels Apr 11, 2024

rangehow closed this as not planned Won't fix, can't repro, duplicate, stale May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we resume training from lora model? #29607

How can we resume training from lora model? #29607

rangehow commented Mar 12, 2024

rangehow commented Mar 12, 2024

amyeroberts commented Apr 11, 2024

younesbelkada commented Apr 16, 2024

github-actions bot commented May 10, 2024

How can we resume training from lora model? #29607

How can we resume training from lora model? #29607

Comments

rangehow commented Mar 12, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

rangehow commented Mar 12, 2024

amyeroberts commented Apr 11, 2024

younesbelkada commented Apr 16, 2024

github-actions bot commented May 10, 2024