Skip to content

Conversation

@YushunXiang
Copy link
Contributor

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

  • Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

  • Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

  • Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

python lerobot/scripts/train_accelerate.py \
--policy.type=pi0 \
--dataset.repo_id=danaaubakirova/koch_test

@lucasjinreal
Copy link

Please merge?

@lucasjinreal
Copy link

This PR is not working:

return self._untyped_storage.data_ptr()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Attempted to access the data pointer on an invalid python storage.

The make_policy will fail

@YushunXiang
Copy link
Contributor Author

This PR is not working:

return self._untyped_storage.data_ptr() [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: Attempted to access the data pointer on an invalid python storage.

The make_policy will fail

Could you give me some details about this error?

@lucasjinreal
Copy link

lucasjinreal commented Jun 14, 2025

Hi, the torch2.6+ introduces DTensor feature. Accelerate won't be able to load the model properly or prepare the model for distributed training when DTensor is not disabled. From what I can see in the train_accelerate, there is nowhere to properly handle this error.

On my side, torch 2.7.1 + transformers & accelerate latest hit the error when training on multiple GPUs.

Once you resolve the loading state dict error, there will still be an error:

[rank0]: AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!

@zhangxp12345678
Copy link

Can you please tell me what modifications need to be made to this saved weights for inference, I find that multi-card weights are very ineffective (compared to single card training)

@YushunXiang
Copy link
Contributor Author

YushunXiang commented Jun 16, 2025

Can you please tell me what modifications need to be made to this saved weights for inference, I find that multi-card weights are very ineffective (compared to single card training)

I used model.safetensors as the inference checkpoint.

@zhangxp12345678
Copy link

您能否告诉我需要对这个保存的权重进行哪些修改以进行推理,我发现多卡权重非常无效(与单卡训练相比)

我用作推理检查点。model.safetensors

My checkpoint file is the same as the single card, but when reasoning (pi0) it has almost no effect, is there a problem with the checkpoint saving logic or do I need to make other changes, please advise!

@YushunXiang
Copy link
Contributor Author

YushunXiang commented Jun 16, 2025

My checkpoint file is the same as the single card, but when reasoning (pi0) it has almost no effect, is there a problem with the checkpoint saving logic or do I need to make other changes, please advise!

It is a interesting question.

In Line 288, lerobot/scripts/train_accelerate.py files:

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()

            # Unwrap model for saving
            unwrapped_policy = accelerator.unwrap_model(policy)
            save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
            update_last_checkpoint(checkpoint_dir)

You can modify it to:

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()
            accelerator.save_model(model, save_directory)

Then try again?

@lucasjinreal
Copy link

I am training with multiple GPUs now.

@command-z-z
Copy link

command-z-z commented Jun 17, 2025

My checkpoint file is the same as the single card, but when reasoning (pi0) it has almost no effect, is there a problem with the checkpoint saving logic or do I need to make other changes, please advise!

It is a interesting question.

In Line 288, lerobot/scripts/train_accelerate.py files:

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()

            # Unwrap model for saving
            unwrapped_policy = accelerator.unwrap_model(policy)
            save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
            update_last_checkpoint(checkpoint_dir)

You can modify it to:

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()
            accelerator.save_model(model, save_directory)

Then try again?

I'm curious why you don't run into a deadlock problem due to incorrect usage accelerator.wait_for_everyone(). Only rank 0(i.e., main process) will come under this if condition, which will wait for other ranks due to accelerator.wait_for_everyone(). However, others will never run this code, which will result in a deadlock.
It may also be that I know too little about this accelerate package. I look forward to your reply to my confusion.
Is the following better or correct?

       if cfg.save_checkpoint and is_saving_step:
            logging.info(f"Process {accelerator.process_index} waiting at barrier before saving.")
            accelerator.wait_for_everyone()
            logging.info(f"Process {accelerator.process_index} passed the barrier.")

            if accelerator.is_main_process:
                logging.info(f"Checkpoint policy after step {step}")
                checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

                logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")

                # Unwrap model for saving
                unwrapped_policy = accelerator.unwrap_model(policy)
                save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
                update_last_checkpoint(checkpoint_dir)
                # if wandb_logger:
                #     wandb_logger.log_policy(checkpoint_dir)

@zhangxp12345678
Copy link

我的 checkpoint 文件和单卡一样,但是在推理 (pi0) 的时候几乎没有效果,是 checkpoint 保存逻辑有问题还是需要做其他改动,请指教!

这是一个有趣的问题。
在第 288 行中,文件:lerobot/scripts/train_accelerate.py

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()

            # Unwrap model for saving
            unwrapped_policy = accelerator.unwrap_model(policy)
            save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
            update_last_checkpoint(checkpoint_dir)

您可以将其修改为:

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()
            accelerator.save_model(model, save_directory)

然后重试?

我很好奇为什么你不会因为使用不正确而遇到死锁问题。只有 rank 0(即主进程)才会处于此条件,它将等待其他 rank,因为 。但是,其他人永远不会运行此代码,这将导致死锁。也可能是我对这个包了解得太少。我期待着你对我的困惑的回答。accelerator.wait_for_everyone()``if``accelerator.wait_for_everyone()``accelerate

I've actually encountered this problem and look forward to the author's answer

@xliu0105
Copy link

I am training with multiple GPUs now.

You mentioned above that you encountered the make_policy problem and another problem. How did you solve them?

@YushunXiang
Copy link
Contributor Author

I'm curious why you don't run into a deadlock problem due to incorrect usage accelerator.wait_for_everyone(). Only rank 0(i.e., main process) will come under this if condition, which will wait for other ranks due to accelerator.wait_for_everyone(). However, others will never run this code, which will result in a deadlock. It may also be that I know too little about this accelerate package. I look forward to your reply to my confusion. Is the following better or correct?

       if cfg.save_checkpoint and is_saving_step:
            logging.info(f"Process {accelerator.process_index} waiting at barrier before saving.")
            accelerator.wait_for_everyone()
            logging.info(f"Process {accelerator.process_index} passed the barrier.")

            if accelerator.is_main_process:
                logging.info(f"Checkpoint policy after step {step}")
                checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

                logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")

                # Unwrap model for saving
                unwrapped_policy = accelerator.unwrap_model(policy)
                save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
                update_last_checkpoint(checkpoint_dir)
                # if wandb_logger:
                #     wandb_logger.log_policy(checkpoint_dir)

You are right. I think the correct code is:

if cfg.save_checkpoint and is_saving_step:
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        logging.info(f"Checkpoint policy after step {step}")
        checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
        logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")
        accelerator.save_model(model, save_directory)
    accelerator.wait_for_everyone() 

I will test it and commit it again. Thanks!
@command-z-z @zhangxp12345678

@xliu0105
Copy link

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

  • Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

  • Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

  • Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

python lerobot/scripts/train_accelerate.py \
--policy.type=pi0 \
--dataset.repo_id=danaaubakirova/koch_test

I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:

[rank0]: Traceback (most recent call last):
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in
[rank0]: train()
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner
[rank0]: response = fn(cfg, *args, **kwargs)
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train
[rank0]: train_tracker, output_dict = update_policy(
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy
[rank0]: loss, output_dict = policy.forward(batch)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward
[rank0]: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward
[rank0]: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix
[rank0]: state_emb = self.state_proj(state)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
[rank0]: return F.linear(input, self.weight, self.bias)
[rank0]: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

do you know how to fix it.

@YushunXiang
Copy link
Contributor Author

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

  • Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

  • Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

  • Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

python lerobot/scripts/train_accelerate.py \

--policy.type=pi0 \

--dataset.repo_id=danaaubakirova/koch_test

I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:

rank0: Traceback (most recent call last):

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner

rank0: response = fn(cfg, *args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train

rank0: train_tracker, output_dict = update_policy(

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy

rank0: loss, output_dict = policy.forward(batch)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn

rank0: ret_val = func(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward

rank0: loss = self.module(*inputs, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl

rank0: return inner()

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner

rank0: result = forward_call(*args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward

rank0: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward

rank0: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix

rank0: state_emb = self.state_proj(state)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward

rank0: return F.linear(input, self.weight, self.bias)

rank0: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

do you know how to fix it.

Can you give me your running command?

2 similar comments
@YushunXiang
Copy link
Contributor Author

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

  • Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

  • Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

  • Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

python lerobot/scripts/train_accelerate.py \

--policy.type=pi0 \

--dataset.repo_id=danaaubakirova/koch_test

I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:

rank0: Traceback (most recent call last):

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner

rank0: response = fn(cfg, *args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train

rank0: train_tracker, output_dict = update_policy(

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy

rank0: loss, output_dict = policy.forward(batch)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn

rank0: ret_val = func(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward

rank0: loss = self.module(*inputs, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl

rank0: return inner()

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner

rank0: result = forward_call(*args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward

rank0: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward

rank0: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix

rank0: state_emb = self.state_proj(state)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward

rank0: return F.linear(input, self.weight, self.bias)

rank0: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

do you know how to fix it.

Can you give me your running command?

@YushunXiang
Copy link
Contributor Author

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

  • Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

  • Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

  • Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

python lerobot/scripts/train_accelerate.py \

--policy.type=pi0 \

--dataset.repo_id=danaaubakirova/koch_test

I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:

rank0: Traceback (most recent call last):

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner

rank0: response = fn(cfg, *args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train

rank0: train_tracker, output_dict = update_policy(

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy

rank0: loss, output_dict = policy.forward(batch)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn

rank0: ret_val = func(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward

rank0: loss = self.module(*inputs, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl

rank0: return inner()

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner

rank0: result = forward_call(*args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward

rank0: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward

rank0: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix

rank0: state_emb = self.state_proj(state)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward

rank0: return F.linear(input, self.weight, self.bias)

rank0: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

do you know how to fix it.

Can you give me your running command?

@xliu0105
Copy link

Here is my running command, I set mixed_precision to fp16 or bf16, and always get dtype mismatch errors. I'm wondering if there is something wrong with the pi0 model?

python -m accelerate.commands.launch
--num_processes 3
--num_machines 1
--mixed_precision=fp16
lerobot/scripts/train_dis.py
--policy.path=/nas13/lx_folder/pretrained_models/lerobot-pi0
--policy.gradient_accumulation_steps=1
--dataset.repo_id=None
--dataset.root=/nas13/lx_folder/datasets/lerobot/xht_lerobot/chef_z_lerobot_joint_img_256_256_delay_0.1_traj_320
--dataset.video_backend=pyav
--batch_size=3
--num_workers=4
--log_freq=1
--seed=1000
--steps=100000
--save_checkpoint=True
--save_freq=2
--resume=False
--wandb.enable=True
--wandb.project=lerobot
--wandb.entity=xliu-work-huazhong-university-of-science-and-technology
--wandb.run_id=None \

@xliu0105
Copy link

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

  • Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

  • Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

  • Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

python lerobot/scripts/train_accelerate.py \

--policy.type=pi0 \

--dataset.repo_id=danaaubakirova/koch_test

I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:
rank0: Traceback (most recent call last):
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner
rank0: response = fn(cfg, *args, **kwargs)
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train
rank0: train_tracker, output_dict = update_policy(
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy
rank0: loss, output_dict = policy.forward(batch)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
rank0: ret_val = func(*args, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward
rank0: loss = self.module(*inputs, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
rank0: return inner()
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner
rank0: result = forward_call(*args, **kwargs)
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward
rank0: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward
rank0: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix
rank0: state_emb = self.state_proj(state)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
rank0: return F.linear(input, self.weight, self.bias)
rank0: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half
do you know how to fix it.

Can you give me your running command?

Oh, I also used deepspeed, which has some differences from your code, but most of it is pretty consistent

@YushunXiang
Copy link
Contributor Author

YushunXiang commented Jun 18, 2025

Here is my running command, I set mixed_precision to fp16 or bf16, and always get dtype mismatch errors. I'm wondering if there is something wrong with the pi0 model?

python -m accelerate.commands.launch --num_processes 3 --num_machines 1 --mixed_precision=fp16 lerobot/scripts/train_dis.py --policy.path=/nas13/lx_folder/pretrained_models/lerobot-pi0 --policy.gradient_accumulation_steps=1 --dataset.repo_id=None --dataset.root=/nas13/lx_folder/datasets/lerobot/xht_lerobot/chef_z_lerobot_joint_img_256_256_delay_0.1_traj_320 --dataset.video_backend=pyav --batch_size=3 --num_workers=4 --log_freq=1 --seed=1000 --steps=100000 --save_checkpoint=True --save_freq=2 --resume=False --wandb.enable=True --wandb.project=lerobot --wandb.entity=xliu-work-huazhong-university-of-science-and-technology --wandb.run_id=None \

I think you should add a argument named --policy.use_amp=True because in the train_accelerate.py:

    accelerator = Accelerator(
        mixed_precision="fp16" if cfg.policy.use_amp else "no",
        gradient_accumulation_steps=cfg.policy.gradient_accumulation_steps,
        log_with="wandb" if cfg.wandb.enable else None,
        kwargs_handlers=[ddp_kwargs],
        project_dir=cfg.output_dir,
    )

@xliu0105
Copy link

Here is my running command, I set mixed_precision to fp16 or bf16, and always get dtype mismatch errors. I'm wondering if there is something wrong with the pi0 model?
python -m accelerate.commands.launch --num_processes 3 --num_machines 1 --mixed_precision=fp16 lerobot/scripts/train_dis.py --policy.path=/nas13/lx_folder/pretrained_models/lerobot-pi0 --policy.gradient_accumulation_steps=1 --dataset.repo_id=None --dataset.root=/nas13/lx_folder/datasets/lerobot/xht_lerobot/chef_z_lerobot_joint_img_256_256_delay_0.1_traj_320 --dataset.video_backend=pyav --batch_size=3 --num_workers=4 --log_freq=1 --seed=1000 --steps=100000 --save_checkpoint=True --save_freq=2 --resume=False --wandb.enable=True --wandb.project=lerobot --wandb.entity=xliu-work-huazhong-university-of-science-and-technology --wandb.run_id=None \

I think you should add a argument named --policy.use_amp=True because in the train_accelerate.py:

    accelerator = Accelerator(
        mixed_precision="fp16" if cfg.policy.use_amp else "no",
        gradient_accumulation_steps=cfg.policy.gradient_accumulation_steps,
        log_with="wandb" if cfg.wandb.enable else None,
        kwargs_handlers=[ddp_kwargs],
        project_dir=cfg.output_dir,
    )

I did not use policy.use_amp as a judgment in the code, but used whether fp16 was set in the accelerate config to judge. At the same time, the code of the autocast part was set:

with accelerator.autocast():
loss, output_dict = policy.forward(batch)

I think this is the same as mixed_precision="fp16" if cfg.policy.use_amp else "no"

@command-z-z
Copy link

fp16

I don't know the specific reason, but I guess accelerate can't convert some variables explicitly specified as float32 to fp16, such as these variables

return time.to(dtype=torch.float32, device=device)

causing the two parameter types to not match when performing matrix operations.
i don't know much about accelerate, just a insight.

@YushunXiang YushunXiang changed the title Multi-gpus training with accelerate [WIP] Multi-gpus training with accelerate Jul 1, 2025
@Viredery
Copy link

Viredery commented Jul 3, 2025

Oh, I also used deepspeed, which has some differences from your code, but most of it is pretty consistent

In DeepSpeed zero2, the type of model weights would be casted to the specified type(bf16 or fp32) by default and in this case. However, the weight type in P0 includes BOTH fp32 (linear projectors) and bf16 (transformer layers). I don't know how to solve it, either.

accelerate works well but deepspeed / fsdp not

@YushunXiang YushunXiang marked this pull request as draft July 7, 2025 07:45
@chenkang455
Copy link

I'm curious why you don't run into a deadlock problem due to incorrect usage accelerator.wait_for_everyone(). Only rank 0(i.e., main process) will come under this if condition, which will wait for other ranks due to accelerator.wait_for_everyone(). However, others will never run this code, which will result in a deadlock. It may also be that I know too little about this accelerate package. I look forward to your reply to my confusion. Is the following better or correct?

       if cfg.save_checkpoint and is_saving_step:
            logging.info(f"Process {accelerator.process_index} waiting at barrier before saving.")
            accelerator.wait_for_everyone()
            logging.info(f"Process {accelerator.process_index} passed the barrier.")

            if accelerator.is_main_process:
                logging.info(f"Checkpoint policy after step {step}")
                checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

                logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")

                # Unwrap model for saving
                unwrapped_policy = accelerator.unwrap_model(policy)
                save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
                update_last_checkpoint(checkpoint_dir)
                # if wandb_logger:
                #     wandb_logger.log_policy(checkpoint_dir)

You are right. I think the correct code is:

if cfg.save_checkpoint and is_saving_step:
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        logging.info(f"Checkpoint policy after step {step}")
        checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
        logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")
        accelerator.save_model(model, save_directory)
    accelerator.wait_for_everyone() 

I will test it and commit it again. Thanks! @command-z-z @zhangxp12345678

Hi, I recently meet similar problems, from the theoretical perspective, the process should meet the deadlock problem but it doesn't, have you figured it out?

@CarolinePascal CarolinePascal added enhancement Suggestions for new features or improvements policies Items related to robot policies performance Issues aimed at improving speed or resource usage labels Jul 25, 2025
@heimrih
Copy link

heimrih commented Jul 29, 2025

Can you show what is the sample of your script to run for multiple-GPUs? is it just changing for the gradient_accumulation_steps or num_processes?

@YushunXiang
Copy link
Contributor Author

Can you show what is the sample of your script to run for multiple-GPUs? is it just changing for the gradient_accumulation_steps or num_processes?

python -m accelerate.commands.launch \
    --num_processes 4 \
    --num_machines 1 \
    --mixed_precision=fp16 \
    lerobot/scripts/train_accelerate.py \
    ...

@heimrih
Copy link

heimrih commented Jul 30, 2025

Thanks! However i still encounter errors but using torchrun --nproc_per_node 4 seems to be fine.

@heimrih
Copy link

heimrih commented Jul 30, 2025

Thanks! However i still encounter errors but using torchrun --nproc_per_node 4 seems to be fine.

I tried several attempts to fix the issue but was unable to with the initial accelerate script. It seems that even though i have set 4 GPUs, it loads the model onto the same one,

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 70.00 MiB. GPU 0 has a total capacity
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity

@jadechoghari
Copy link
Member

Thank you so much for the PR, however closing this as we recently supported multi-gpu training with accelerate:
#2154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Suggestions for new features or improvements performance Issues aimed at improving speed or resource usage policies Items related to robot policies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants