[WIP] Multi-gpus training with accelerate #1246

YushunXiang · 2025-06-09T20:01:14Z

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

python lerobot/scripts/train_accelerate.py \
--policy.type=pi0 \
--dataset.repo_id=danaaubakirova/koch_test

for more information, see https://pre-commit.ci

lucasjinreal · 2025-06-13T02:06:37Z

Please merge?

lucasjinreal · 2025-06-14T03:44:35Z

This PR is not working:

return self._untyped_storage.data_ptr()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Attempted to access the data pointer on an invalid python storage.

The make_policy will fail

YushunXiang · 2025-06-14T08:42:16Z

This PR is not working:

return self._untyped_storage.data_ptr() [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: Attempted to access the data pointer on an invalid python storage.

The make_policy will fail

Could you give me some details about this error?

lucasjinreal · 2025-06-14T13:18:15Z

Hi, the torch2.6+ introduces DTensor feature. Accelerate won't be able to load the model properly or prepare the model for distributed training when DTensor is not disabled. From what I can see in the train_accelerate, there is nowhere to properly handle this error.

On my side, torch 2.7.1 + transformers & accelerate latest hit the error when training on multiple GPUs.

Once you resolve the loading state dict error, there will still be an error:

[rank0]: AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!

zhangxp12345678 · 2025-06-16T07:36:08Z

Can you please tell me what modifications need to be made to this saved weights for inference, I find that multi-card weights are very ineffective (compared to single card training)

YushunXiang · 2025-06-16T08:59:41Z

Can you please tell me what modifications need to be made to this saved weights for inference, I find that multi-card weights are very ineffective (compared to single card training)

I used model.safetensors as the inference checkpoint.

zhangxp12345678 · 2025-06-16T10:53:49Z

您能否告诉我需要对这个保存的权重进行哪些修改以进行推理，我发现多卡权重非常无效（与单卡训练相比）

我用作推理检查点。model.safetensors

My checkpoint file is the same as the single card, but when reasoning (pi0) it has almost no effect, is there a problem with the checkpoint saving logic or do I need to make other changes, please advise!

YushunXiang · 2025-06-16T12:12:59Z

My checkpoint file is the same as the single card, but when reasoning (pi0) it has almost no effect, is there a problem with the checkpoint saving logic or do I need to make other changes, please advise!

It is a interesting question.

In Line 288, lerobot/scripts/train_accelerate.py files:

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()

            # Unwrap model for saving
            unwrapped_policy = accelerator.unwrap_model(policy)
            save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
            update_last_checkpoint(checkpoint_dir)

You can modify it to:

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()
            accelerator.save_model(model, save_directory)

Then try again?

lucasjinreal · 2025-06-16T12:40:01Z

I am training with multiple GPUs now.

command-z-z · 2025-06-17T01:53:13Z

My checkpoint file is the same as the single card, but when reasoning (pi0) it has almost no effect, is there a problem with the checkpoint saving logic or do I need to make other changes, please advise!

It is a interesting question.

In Line 288, lerobot/scripts/train_accelerate.py files:

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()

            # Unwrap model for saving
            unwrapped_policy = accelerator.unwrap_model(policy)
            save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
            update_last_checkpoint(checkpoint_dir)

You can modify it to:

        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()
            accelerator.save_model(model, save_directory)

Then try again?

I'm curious why you don't run into a deadlock problem due to incorrect usage accelerator.wait_for_everyone(). Only rank 0(i.e., main process) will come under this if condition, which will wait for other ranks due to accelerator.wait_for_everyone(). However, others will never run this code, which will result in a deadlock.
It may also be that I know too little about this accelerate package. I look forward to your reply to my confusion.
Is the following better or correct?

       if cfg.save_checkpoint and is_saving_step:
            logging.info(f"Process {accelerator.process_index} waiting at barrier before saving.")
            accelerator.wait_for_everyone()
            logging.info(f"Process {accelerator.process_index} passed the barrier.")

            if accelerator.is_main_process:
                logging.info(f"Checkpoint policy after step {step}")
                checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

                logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")

                # Unwrap model for saving
                unwrapped_policy = accelerator.unwrap_model(policy)
                save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
                update_last_checkpoint(checkpoint_dir)
                # if wandb_logger:
                #     wandb_logger.log_policy(checkpoint_dir)

zhangxp12345678 · 2025-06-17T01:57:15Z

我的 checkpoint 文件和单卡一样，但是在推理（pi0）的时候几乎没有效果，是 checkpoint 保存逻辑有问题还是需要做其他改动，请指教！

这是一个有趣的问题。
在第 288 行中，文件：lerobot/scripts/train_accelerate.py
        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()

            # Unwrap model for saving
            unwrapped_policy = accelerator.unwrap_model(policy)
            save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
            update_last_checkpoint(checkpoint_dir)
您可以将其修改为：
        if cfg.save_checkpoint and is_saving_step and accelerator.is_main_process:
            logging.info(f"Checkpoint policy after step {step}")
            checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

            # Wait for all processes before saving
            accelerator.wait_for_everyone()
            accelerator.save_model(model, save_directory)
然后重试？
我很好奇为什么你不会因为使用不正确而遇到死锁问题。只有 rank 0（即主进程）才会处于此条件，它将等待其他 rank，因为。但是，其他人永远不会运行此代码，这将导致死锁。也可能是我对这个包了解得太少。我期待着你对我的困惑的回答。accelerator.wait_for_everyone()``if``accelerator.wait_for_everyone()``accelerate

I've actually encountered this problem and look forward to the author's answer

xliu0105 · 2025-06-17T03:43:13Z

I am training with multiple GPUs now.

You mentioned above that you encountered the make_policy problem and another problem. How did you solve them?

YushunXiang · 2025-06-17T06:55:47Z

I'm curious why you don't run into a deadlock problem due to incorrect usage accelerator.wait_for_everyone(). Only rank 0(i.e., main process) will come under this if condition, which will wait for other ranks due to accelerator.wait_for_everyone(). However, others will never run this code, which will result in a deadlock. It may also be that I know too little about this accelerate package. I look forward to your reply to my confusion. Is the following better or correct?
       if cfg.save_checkpoint and is_saving_step:
            logging.info(f"Process {accelerator.process_index} waiting at barrier before saving.")
            accelerator.wait_for_everyone()
            logging.info(f"Process {accelerator.process_index} passed the barrier.")

            if accelerator.is_main_process:
                logging.info(f"Checkpoint policy after step {step}")
                checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

                logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")

                # Unwrap model for saving
                unwrapped_policy = accelerator.unwrap_model(policy)
                save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
                update_last_checkpoint(checkpoint_dir)
                # if wandb_logger:
                #     wandb_logger.log_policy(checkpoint_dir)

You are right. I think the correct code is:

if cfg.save_checkpoint and is_saving_step:
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        logging.info(f"Checkpoint policy after step {step}")
        checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
        logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")
        accelerator.save_model(model, save_directory)
    accelerator.wait_for_everyone()

I will test it and commit it again. Thanks!
@command-z-z @zhangxp12345678

xliu0105 · 2025-06-18T10:01:40Z

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:
python lerobot/scripts/train_accelerate.py \
--policy.type=pi0 \
--dataset.repo_id=danaaubakirova/koch_test

I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:

[rank0]: Traceback (most recent call last):
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in
[rank0]: train()
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner
[rank0]: response = fn(cfg, *args, **kwargs)
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train
[rank0]: train_tracker, output_dict = update_policy(
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy
[rank0]: loss, output_dict = policy.forward(batch)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward
[rank0]: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward
[rank0]: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)
[rank0]: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix
[rank0]: state_emb = self.state_proj(state)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
[rank0]: return F.linear(input, self.weight, self.bias)
[rank0]: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

do you know how to fix it.

YushunXiang · 2025-06-18T10:37:07Z

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:
python lerobot/scripts/train_accelerate.py \

--policy.type=pi0 \

--dataset.repo_id=danaaubakirova/koch_test
I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:

rank0: Traceback (most recent call last):

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner

rank0: response = fn(cfg, *args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train

rank0: train_tracker, output_dict = update_policy(

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy

rank0: loss, output_dict = policy.forward(batch)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn

rank0: ret_val = func(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward

rank0: loss = self.module(*inputs, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl

rank0: return inner()

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner

rank0: result = forward_call(*args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward

rank0: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward

rank0: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix

rank0: state_emb = self.state_proj(state)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward

rank0: return F.linear(input, self.weight, self.bias)

rank0: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

do you know how to fix it.

Can you give me your running command?

YushunXiang · 2025-06-18T10:37:15Z

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:
python lerobot/scripts/train_accelerate.py \

--policy.type=pi0 \

--dataset.repo_id=danaaubakirova/koch_test
I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:

rank0: Traceback (most recent call last):

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner

rank0: response = fn(cfg, *args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train

rank0: train_tracker, output_dict = update_policy(

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy

rank0: loss, output_dict = policy.forward(batch)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn

rank0: ret_val = func(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward

rank0: loss = self.module(*inputs, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl

rank0: return inner()

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner

rank0: result = forward_call(*args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward

rank0: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward

rank0: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix

rank0: state_emb = self.state_proj(state)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward

rank0: return F.linear(input, self.weight, self.bias)

rank0: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

do you know how to fix it.

Can you give me your running command?

YushunXiang · 2025-06-18T10:37:44Z

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:
python lerobot/scripts/train_accelerate.py \

--policy.type=pi0 \

--dataset.repo_id=danaaubakirova/koch_test
I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:

rank0: Traceback (most recent call last):

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner

rank0: response = fn(cfg, *args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train

rank0: train_tracker, output_dict = update_policy(

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy

rank0: loss, output_dict = policy.forward(batch)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn

rank0: ret_val = func(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward

rank0: loss = self.module(*inputs, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl

rank0: return inner()

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner

rank0: result = forward_call(*args, **kwargs)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward

rank0: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward

rank0: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)

rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix

rank0: state_emb = self.state_proj(state)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

rank0: return self._call_impl(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl

rank0: return forward_call(*args, **kwargs)

rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward

rank0: return F.linear(input, self.weight, self.bias)

rank0: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

do you know how to fix it.

Can you give me your running command?

xliu0105 · 2025-06-18T10:45:24Z

Here is my running command, I set mixed_precision to fp16 or bf16, and always get dtype mismatch errors. I'm wondering if there is something wrong with the pi0 model?

python -m accelerate.commands.launch
--num_processes 3
--num_machines 1
--mixed_precision=fp16
lerobot/scripts/train_dis.py
--policy.path=/nas13/lx_folder/pretrained_models/lerobot-pi0
--policy.gradient_accumulation_steps=1
--dataset.repo_id=None
--dataset.root=/nas13/lx_folder/datasets/lerobot/xht_lerobot/chef_z_lerobot_joint_img_256_256_delay_0.1_traj_320
--dataset.video_backend=pyav
--batch_size=3
--num_workers=4
--log_freq=1
--seed=1000
--steps=100000
--save_checkpoint=True
--save_freq=2
--resume=False
--wandb.enable=True
--wandb.project=lerobot
--wandb.entity=xliu-work-huazhong-university-of-science-and-technology
--wandb.run_id=None \

xliu0105 · 2025-06-18T11:07:15Z

I have seen a lot of issues and PRs link #1176 #558 #317 #956 #876 #778 about data parallel training with multi-GPU. This shows that multi-GPU training is what this community as well as I need. So I write a lerobot/scripts/train_accelerate.py to fix it.

What this does

This pull request introduces a new training script leveraging the accelerate library for distributed and mixed-precision training. It also adds support for gradient accumulation and updates dependencies accordingly. Below are the most significant changes grouped by theme:

New Training Script

Added a comprehensive training script in lerobot/scripts/train_accelerate.py that integrates the accelerate library for distributed training, mixed-precision support, and gradient accumulation. The script includes features such as policy updates, checkpointing, evaluation, and integration with Weights & Biases for logging.

Configuration Updates

Introduced a new configuration parameter gradient_accumulation_steps in the PreTrainedConfig class to support gradient accumulation during training.

Dependency Updates

Added accelerate>=1.7.0 to the pyproject.toml file to include the accelerate library as a dependency for distributed and mixed-precision training.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

python lerobot/scripts/train_accelerate.py \

--policy.type=pi0 \

--dataset.repo_id=danaaubakirova/koch_test

I tested the code and found that when using mixed precision training, there will be a dtype mismatch error, like this:
rank0: Traceback (most recent call last):
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 399, in
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/configs/parser.py", line 226, in wrapper_inner
rank0: response = fn(cfg, *args, **kwargs)
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 294, in train
rank0: train_tracker, output_dict = update_policy(
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/scripts/train_dis.py", line 97, in update_policy
rank0: loss, output_dict = policy.forward(batch)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
rank0: ret_val = func(*args, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2087, in forward
rank0: loss = self.module(*inputs, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
rank0: return inner()
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1805, in inner
rank0: result = forward_call(*args, **kwargs)
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 319, in forward
rank0: losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 628, in forward
rank0: suffix_embs, suffix_pad_masks, suffix_att_masks = self.embed_suffix(state, x_t, time)
rank0: File "/nas13/lx_folder/lx_vla/lerobot/lerobot/common/policies/pi0/modeling_pi0.py", line 564, in embed_suffix
rank0: state_emb = self.state_proj(state)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/home/liuxu/miniforge3/envs/lerobot/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
rank0: return F.linear(input, self.weight, self.bias)
rank0: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half
do you know how to fix it.

Can you give me your running command?

Oh, I also used deepspeed, which has some differences from your code, but most of it is pretty consistent

YushunXiang · 2025-06-18T12:40:18Z

Here is my running command, I set mixed_precision to fp16 or bf16, and always get dtype mismatch errors. I'm wondering if there is something wrong with the pi0 model?

python -m accelerate.commands.launch --num_processes 3 --num_machines 1 --mixed_precision=fp16 lerobot/scripts/train_dis.py --policy.path=/nas13/lx_folder/pretrained_models/lerobot-pi0 --policy.gradient_accumulation_steps=1 --dataset.repo_id=None --dataset.root=/nas13/lx_folder/datasets/lerobot/xht_lerobot/chef_z_lerobot_joint_img_256_256_delay_0.1_traj_320 --dataset.video_backend=pyav --batch_size=3 --num_workers=4 --log_freq=1 --seed=1000 --steps=100000 --save_checkpoint=True --save_freq=2 --resume=False --wandb.enable=True --wandb.project=lerobot --wandb.entity=xliu-work-huazhong-university-of-science-and-technology --wandb.run_id=None \

I think you should add a argument named --policy.use_amp=True because in the train_accelerate.py:

    accelerator = Accelerator(
        mixed_precision="fp16" if cfg.policy.use_amp else "no",
        gradient_accumulation_steps=cfg.policy.gradient_accumulation_steps,
        log_with="wandb" if cfg.wandb.enable else None,
        kwargs_handlers=[ddp_kwargs],
        project_dir=cfg.output_dir,
    )

xliu0105 · 2025-06-18T12:47:44Z

Here is my running command, I set mixed_precision to fp16 or bf16, and always get dtype mismatch errors. I'm wondering if there is something wrong with the pi0 model?
python -m accelerate.commands.launch --num_processes 3 --num_machines 1 --mixed_precision=fp16 lerobot/scripts/train_dis.py --policy.path=/nas13/lx_folder/pretrained_models/lerobot-pi0 --policy.gradient_accumulation_steps=1 --dataset.repo_id=None --dataset.root=/nas13/lx_folder/datasets/lerobot/xht_lerobot/chef_z_lerobot_joint_img_256_256_delay_0.1_traj_320 --dataset.video_backend=pyav --batch_size=3 --num_workers=4 --log_freq=1 --seed=1000 --steps=100000 --save_checkpoint=True --save_freq=2 --resume=False --wandb.enable=True --wandb.project=lerobot --wandb.entity=xliu-work-huazhong-university-of-science-and-technology --wandb.run_id=None \

I think you should add a argument named --policy.use_amp=True because in the train_accelerate.py:
    accelerator = Accelerator(
        mixed_precision="fp16" if cfg.policy.use_amp else "no",
        gradient_accumulation_steps=cfg.policy.gradient_accumulation_steps,
        log_with="wandb" if cfg.wandb.enable else None,
        kwargs_handlers=[ddp_kwargs],
        project_dir=cfg.output_dir,
    )

I did not use policy.use_amp as a judgment in the code, but used whether fp16 was set in the accelerate config to judge. At the same time, the code of the autocast part was set:

with accelerator.autocast():
loss, output_dict = policy.forward(batch)

I think this is the same as mixed_precision="fp16" if cfg.policy.use_amp else "no"

command-z-z · 2025-06-18T13:18:39Z

fp16

I don't know the specific reason, but I guess accelerate can't convert some variables explicitly specified as float32 to fp16, such as these variables

lerobot/lerobot/common/policies/pi0/modeling_pi0.py

Line 504 in 2b71789

return time.to(dtype=torch.float32, device=device)

causing the two parameter types to not match when performing matrix operations.
i don't know much about accelerate, just a insight.

Viredery · 2025-07-03T11:54:46Z

Oh, I also used deepspeed, which has some differences from your code, but most of it is pretty consistent

In DeepSpeed zero2, the type of model weights would be casted to the specified type(bf16 or fp32) by default and in this case. However, the weight type in P0 includes BOTH fp32 (linear projectors) and bf16 (transformer layers). I don't know how to solve it, either.

accelerate works well but deepspeed / fsdp not

chenkang455 · 2025-07-12T06:27:12Z

I'm curious why you don't run into a deadlock problem due to incorrect usage accelerator.wait_for_everyone(). Only rank 0(i.e., main process) will come under this if condition, which will wait for other ranks due to accelerator.wait_for_everyone(). However, others will never run this code, which will result in a deadlock. It may also be that I know too little about this accelerate package. I look forward to your reply to my confusion. Is the following better or correct?
       if cfg.save_checkpoint and is_saving_step:
            logging.info(f"Process {accelerator.process_index} waiting at barrier before saving.")
            accelerator.wait_for_everyone()
            logging.info(f"Process {accelerator.process_index} passed the barrier.")

            if accelerator.is_main_process:
                logging.info(f"Checkpoint policy after step {step}")
                checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)

                logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")

                # Unwrap model for saving
                unwrapped_policy = accelerator.unwrap_model(policy)
                save_checkpoint(checkpoint_dir, step, cfg, unwrapped_policy, optimizer, lr_scheduler)
                update_last_checkpoint(checkpoint_dir)
                # if wandb_logger:
                #     wandb_logger.log_policy(checkpoint_dir)

You are right. I think the correct code is:

if cfg.save_checkpoint and is_saving_step:
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        logging.info(f"Checkpoint policy after step {step}")
        checkpoint_dir = get_step_checkpoint_dir(cfg.output_dir, cfg.steps, step)
        logging.info(colored("This is save model in ", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")
        accelerator.save_model(model, save_directory)
    accelerator.wait_for_everyone()

I will test it and commit it again. Thanks! @command-z-z @zhangxp12345678

Hi, I recently meet similar problems, from the theoretical perspective, the process should meet the deadlock problem but it doesn't, have you figured it out?

heimrih · 2025-07-29T00:50:40Z

Can you show what is the sample of your script to run for multiple-GPUs? is it just changing for the gradient_accumulation_steps or num_processes?

YushunXiang · 2025-07-29T06:20:30Z

Can you show what is the sample of your script to run for multiple-GPUs? is it just changing for the gradient_accumulation_steps or num_processes?

python -m accelerate.commands.launch \
    --num_processes 4 \
    --num_machines 1 \
    --mixed_precision=fp16 \
    lerobot/scripts/train_accelerate.py \
    ...

heimrih · 2025-07-30T01:24:41Z

Thanks! However i still encounter errors but using torchrun --nproc_per_node 4 seems to be fine.

heimrih · 2025-07-30T08:17:40Z

Thanks! However i still encounter errors but using torchrun --nproc_per_node 4 seems to be fine.

I tried several attempts to fix the issue but was unable to with the initial accelerate script. It seems that even though i have set 4 GPUs, it loads the model onto the same one,

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 70.00 MiB. GPU 0 has a total capacity
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity

jadechoghari · 2025-10-17T11:32:42Z

Thank you so much for the PR, however closing this as we recently supported multi-gpu training with accelerate:
#2154

YushunXiang and others added 5 commits June 9, 2025 23:52

feat: add accelerate dependency to pyproject.toml

a25e408

fix: fix some bug for accelerate.

1bd51de

feat: enhance WandB tracking configuration in training script

2fe0c9a

fix: remove unused DeepSpeedPlugin import in train script

76aebdd

[pre-commit.ci] auto fixes from pre-commit.com hooks

9f68e56

for more information, see https://pre-commit.ci

Merge branch 'main' into dev-accelerate

6049dae

ducido mentioned this pull request Jun 30, 2025

PI0 Evaluation results on LIBERO is very low #952

Closed

YushunXiang changed the title ~~Multi-gpus training with accelerate~~ [WIP] Multi-gpus training with accelerate Jul 1, 2025

YushunXiang marked this pull request as draft July 7, 2025 07:45

CarolinePascal added enhancement Suggestions for new features or improvements policies Items related to robot policies performance Issues aimed at improving speed or resource usage labels Jul 25, 2025

jadechoghari closed this Oct 17, 2025

[WIP] Multi-gpus training with accelerate #1246

[WIP] Multi-gpus training with accelerate #1246

Uh oh!

Conversation

YushunXiang commented Jun 9, 2025

What this does

New Training Script

Configuration Updates

Dependency Updates

How to checkout & try? (for the reviewer)

Uh oh!

lucasjinreal commented Jun 13, 2025

Uh oh!

lucasjinreal commented Jun 14, 2025

Uh oh!

YushunXiang commented Jun 14, 2025

Uh oh!

lucasjinreal commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangxp12345678 commented Jun 16, 2025

Uh oh!

YushunXiang commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangxp12345678 commented Jun 16, 2025

Uh oh!

YushunXiang commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucasjinreal commented Jun 16, 2025

Uh oh!

command-z-z commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangxp12345678 commented Jun 17, 2025

Uh oh!

xliu0105 commented Jun 17, 2025

Uh oh!

YushunXiang commented Jun 17, 2025

Uh oh!

xliu0105 commented Jun 18, 2025

What this does

New Training Script

Configuration Updates

Dependency Updates

How to checkout & try? (for the reviewer)

Uh oh!

YushunXiang commented Jun 18, 2025

What this does

New Training Script

Configuration Updates

Dependency Updates

How to checkout & try? (for the reviewer)

Uh oh!

YushunXiang commented Jun 18, 2025

What this does

New Training Script

Configuration Updates

Dependency Updates

How to checkout & try? (for the reviewer)

Uh oh!

YushunXiang commented Jun 18, 2025

What this does

New Training Script

Configuration Updates

Dependency Updates

How to checkout & try? (for the reviewer)

Uh oh!

xliu0105 commented Jun 18, 2025

Uh oh!

xliu0105 commented Jun 18, 2025

What this does

New Training Script

Configuration Updates

Dependency Updates

How to checkout & try? (for the reviewer)

Uh oh!

YushunXiang commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

lucasjinreal commented Jun 14, 2025 •

edited

Loading

YushunXiang commented Jun 16, 2025 •

edited

Loading

YushunXiang commented Jun 16, 2025 •

edited

Loading

command-z-z commented Jun 17, 2025 •

edited

Loading

YushunXiang commented Jun 18, 2025 •

edited

Loading

Viredery commented Jul 3, 2025 •

edited

Loading