Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can we resume training from lora model? #29607

Closed
2 of 4 tasks
rangehow opened this issue Mar 12, 2024 · 4 comments
Closed
2 of 4 tasks

How can we resume training from lora model? #29607

rangehow opened this issue Mar 12, 2024 · 4 comments

Comments

@rangehow
Copy link

System Info

  • transformers version: 4.39.0.dev0
  • Platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.10.13
  • Huggingface_hub version: 0.21.4
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 4
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {'gradient_accumulation_steps': 16, 'zero3_init_flag': False, 'zero_stage': 0}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.2.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:
  • Name: peft
    Version: 0.9.1.dev0

Who can help?

@younesbelkada

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I found a relevant PR #24274 but it seems not work for me (I found that there is an commenet in the PR that someone who is completely consistent with me has not been resolved)
In short, when using transformers trainer, if training is interrupted, simply use trainer.train(resume_from_checkpoint=True) will get following error

ValueError: Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json) in ***/checkpoint-50.

reproduction code


from transformers import MambaForCausalLM,AutoTokenizer,Seq2SeqTrainer,DataCollatorForSeq2Seq,Seq2SeqTrainingArguments,Trainer,TrainingArguments
import torch
from dataset import MyDataset
import json
from plot import plot_loss
from peft import LoraConfig
model_dir='mamba-2.8b-hf'
output_dir='./mamba-translate'
tokenizer=AutoTokenizer.from_pretrained(model_dir,padding_side='left')
model=MambaForCausalLM.from_pretrained(model_dir,torch_dtype=torch.bfloat16)

collator=DataCollatorForSeq2Seq(tokenizer,model)


lora_config =  LoraConfig(
        r=64,
        target_modules=["x_proj", "embeddings", "in_proj", "out_proj"],
        task_type="CAUSAL_LM",
        bias="none",
        use_rslora=True,
)
model.add_adapter(lora_config)

trainer = Trainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        args=TrainingArguments(
            overwrite_output_dir =True,
            remove_unused_columns =False,
            gradient_accumulation_steps=16,
            # gradient_checkpointing=True,
            #------------------------------
            evaluation_strategy='steps',
            eval_delay=100,
            eval_steps =50,
            #-------------------------------
            save_strategy ='steps',
            save_steps = 50,
            save_total_limit =3,
            load_best_model_at_end=True,
            #--------------------------------
            dataloader_num_workers =10,
            learning_rate=2e-3,
            num_train_epochs=30,
            # auto_find_batch_size=True,
            per_device_train_batch_size=8,
            per_device_eval_batch_size =8,
            output_dir="/data/ruanjh/mamba-translate-2.8b-lora",
            logging_steps=5,
            bf16=True,
            prediction_loss_only=True,
            lr_scheduler_type="cosine",
            # torch_compile=True,
            # torch_compile_backend='inductor',
            # torch_compile_mode='max-autotune',
            optim='adamw_apex_fused',
            # save_safetensors =False,
        ),
        data_collator=collator,
    )

trainer.train(resume_from_checkpoint=True)


Expected behavior

Resume training correctly

@rangehow
Copy link
Author

I found the possible cause, in hf we can use model.add_adapter(lora_config) or model=get_peft_model(model,lora_config) to convert a model to peft model. But the former will cause the error while the latter works well.

But using the latter one will cause another error:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@amyeroberts
Copy link
Collaborator

also cc @pacman100

@younesbelkada
Copy link
Contributor

@rangehow
Apologies for the delay, could you try to run a single training step on CPU and report the error here ? Alternatively you can also run CUDA_LAUNCH_BLOCKING=1 python xxx and paste the error traceback here 🙏

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@rangehow rangehow closed this as not planned Won't fix, can't repro, duplicate, stale May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants