-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[WIP] Multi-gpus training with accelerate #1246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
|
Please merge? |
|
This PR is not working: return self._untyped_storage.data_ptr() The make_policy will fail |
Could you give me some details about this error? |
|
Hi, the torch2.6+ introduces DTensor feature. Accelerate won't be able to load the model properly or prepare the model for distributed training when DTensor is not disabled. From what I can see in the train_accelerate, there is nowhere to properly handle this error. On my side, torch 2.7.1 + transformers & accelerate latest hit the error when training on multiple GPUs. Once you resolve the loading state dict error, there will still be an error: [rank0]: AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default! |
|
Can you please tell me what modifications need to be made to this saved weights for inference, I find that multi-card weights are very ineffective (compared to single card training) |
I used |
My checkpoint file is the same as the single card, but when reasoning (pi0) it has almost no effect, is there a problem with the checkpoint saving logic or do I need to make other changes, please advise! |
It is a interesting question. In Line 288, if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
logging.info(f"Checkpoint policy after step {step}")
checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
# Wait for all processes before saving
accelerator.wait_for_everyone()
# Unwrap model for saving
unwrapped_policy = accelerator.unwrap_model(policy)
save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
update_last_checkpoint(checkpoint_dir)You can modify it to: if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
logging.info(f"Checkpoint policy after step {step}")
checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
# Wait for all processes before saving
accelerator.wait_for_everyone()
accelerator.save_model(model, save_directory)Then try again? |
|
I am training with multiple GPUs now. |
I'm curious why you don't run into a deadlock problem due to incorrect usage |
I've actually encountered this problem and look forward to the author's answer |
You mentioned above that you encountered the |
You are right. I think the correct code is: if cfg.save_checkpoint and is_saving_step:
accelerator.wait_for_everyone()
if accelerator.is_main_process:
logging.info(f"Checkpoint policy after step {step}")
checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")
accelerator.save_model(model, save_directory)
accelerator.wait_for_everyone() I will test it and commit it again. Thanks! |
I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this: [rank0]: Traceback (most recent call last): do you know how to fix it. |
Can you give me your running command? |
2 similar comments
Can you give me your running command? |
Can you give me your running command? |
|
Here is my running command, I set mixed_precision to fp16 or bf16, and always get dtype mismatch errors. I'm wondering if there is something wrong with the pi0 model? python -m accelerate.commands.launch |
Oh, I also used deepspeed, which has some differences from your code, but most of it is pretty consistent |
I think you should add a argument named accelerator = Accelerator(
mixed_precision="fp16" if cfg.policy.use_amp else "no",
gradient_accumulation_steps=cfg.policy.gradient_accumulation_steps,
log_with="wandb" if cfg.wandb.enable else None,
kwargs_handlers=[ddp_kwargs],
project_dir=cfg.output_dir,
) |
I did not use with accelerator.autocast(): I think this is the same as |
I don't know the specific reason, but I guess
causing the two parameter types to not match when performing matrix operations. i don't know much about accelerate, just a insight.
|
In DeepSpeed zero2, the type of model weights would be casted to the specified type(bf16 or fp32) by default and in this case. However, the weight type in P0 includes BOTH fp32 (linear projectors) and bf16 (transformer layers). I don't know how to solve it, either. accelerate works well but deepspeed / fsdp not |
Hi, I recently meet similar problems, from the theoretical perspective, the process should meet the deadlock problem but it doesn't, have you figured it out? |
|
Can you show what is the sample of your script to run for multiple-GPUs? is it just changing for the gradient_accumulation_steps or num_processes? |
python -m accelerate.commands.launch \
--num_processes 4 \
--num_machines 1 \
--mixed_precision=fp16 \
lerobot/scripts/train_accelerate.py \
... |
|
Thanks! However i still encounter errors but using |
I tried several attempts to fix the issue but was unable to with the initial accelerate script. It seems that even though i have set 4 GPUs, it loads the model onto the same one,
|
|
Thank you so much for the PR, however closing this as we recently supported multi-gpu training with accelerate: |
I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a
lerobot/scripts/train_accelerate.pyto fix it.What this does
This pull request introduces a new training script leveraging the
acceleratelibrary for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:New Training Script
lerobot/scripts/train_accelerate.pythat integrates theacceleratelibrary for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.Configuration Updates
gradient_accumulation_stepsin thePreTrainedConfigclass to support gradient accumulation during training.Dependency Updates
accelerate>=1.7.0to thepyproject.tomlfile to include theacceleratelibrary as a dependency for distributed and mixed-precision training.How to checkout & try? (for the reviewer)
Provide a simple way for the reviewer to try out your changes.
Examples: