how to gather checkpoints to master node during multi-nodes training #5452

conderls · 2024-04-23T14:48:05Z

Describe the bug
with 2 nodes x 2 GPU, I can fine-tinue a LLM with checkpoints saved on each node

node0: checkpoint-2
├── adapter_config.json
├── adapter_model.safetensors
├── global_step2
│   ├── mp_rank_00_model_states.pt
│   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│   └── zero_pp_rank_1_mp_rank_00_optim_states.pt
├── latest
├── README.md
├── rng_state_0.pth
├── rng_state_1.pth
├── ......

node1: checkpoint-2
├── adapter_config.json
├── adapter_model.safetensors
├── global_step2
│   ├── mp_rank_00_model_states.pt
│   ├── zero_pp_rank_2_mp_rank_00_optim_states.pt
│   └── zero_pp_rank_3_mp_rank_00_optim_states.pt
├── latest
├── README.md
├── rng_state_2.pth
├── rng_state_3.pth
├── ......

the deepspeed with --save_on_each_node true (transformers' TrainerArguments) and deepspeed config with options

"checkpoint": {
    "use_node_local_storage": true
}

then the checkpoints will saved on each nodes, same as #2319 described.

Questions:

In order to merge the lora checkpoints, I need to copy the checkpoints from the worker-nodes to the master nodes; how to gather the checkpoints to the master automatically?
If I use a shared filesystem, then It failed to save the checkpoints with error raised

node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
node1:     self._save_checkpoint(model, trial, metrics=metrics)
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/site-packages/transformers/trainer.py", line 2555, in _save_checkpoint
node1:     shutil.rmtree(staging_output_dir)
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/shutil.py", line 722, in rmtree
node1:     onerror(os.rmdir, path, sys.exc_info())
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/shutil.py", line 720, in rmtree
node1:     os.rmdir(path)
node1: FileNotFoundError: [Errno 2] No such file or directory: '/data/output_share/0423220002/tmp-checkpoint-2'

how to use the shared filesystem properly?

Expected behavior

if without a shared filesystem, automatically (or provide an option) gather the checkpoints to master nodes;
if with a shared filesystem, save all checkpoints to the same path.

System info (please complete the following information):

OS: Ubuntu 22.04
GPU count and types: two machines with x8 A40
Python version: 3.8.18
deepspeed: 0.13.1

Launcher context
deepspeed launcher, pdsh

The text was updated successfully, but these errors were encountered:

tjruwase · 2024-04-23T14:53:24Z

This is a transformers error not DeepSpeed. And the error is due to the fact that destination folder does not exist and so the remove folder command is failing.

conderls · 2024-04-24T06:53:11Z

@tjruwase thanks for the reply.
I know this issue is directly with transformers(v4.38.2), but it was due to the multi-nodes training with shared filesystem.

I am wondering for the multi-nodes training with ds:

the transformers trainer may not work well with deepspeed using shared filesystem?
what's the more convenient way to collect the checkpoints from different nodes?

tjruwase · 2024-04-24T14:57:58Z

@conderls, the standard (i.e., default) multi-node training with DeepSpeed is with shared filesystem.

The following feature that you referenced earlier was created to support node local storage:

Have you tried eliminating these options from your config in order to get the standard shared filesystem checkpointing behavior?

conderls · 2024-04-25T06:03:59Z

@conderls, the standard (i.e., default) multi-node training with DeepSpeed is with shared filesystem.

The following feature that you referenced earlier was created to support node local storage:

Have you tried eliminating these options from your config in order to get the standard shared filesystem checkpointing behavior?

yes, with shared filesystem, error still raised. I will dig deeper into the logics of the transformers/trainer.py

==> save_on_each_node=false, use_node_local_storage=false <==
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/shutil.py", line 720, in rmtree
node1:     os.rmdir(path)
node1: FileNotFoundError: [Errno 2] No such file or directory: '/data/output_share/baichuan_0425134117/tmp-checkpoint-2'
node1: [2024-04-25 05:42:54,555] [INFO] [launch.py:347:main] Process 714566 exits successfully.

==> save_on_each_node=true, use_node_local_storage=false <==
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/site-packages/transformers/trainer.py", line 2538, in _save_checkpoint
node1:     os.rename(staging_output_dir, output_dir)
node1: FileNotFoundError: [Errno 2] No such file or directory: '/data/output_share/baichuan_0425134405/tmp-checkpoint-2' -> '/data/output_share/baichuan_0425134405/checkpoint-2'
node1: [2024-04-25 05:45:52,057] [INFO] [launch.py:347:main] Process 715263 exits successfully.

==> save_on_each_node=true, use_node_local_storage=true <==
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/site-packages/transformers/trainer.py", line 2538, in _save_checkpoint
node1:     os.rename(staging_output_dir, output_dir)
node1: FileNotFoundError: [Errno 2] No such file or directory: '/data/output_share/baichuan_0425134759/tmp-checkpoint-2' -> '/data/output_share/baichuan_0425134759/checkpoint-2'
node1: [2024-04-25 05:49:35,814] [INFO] [launch.py:347:main] Process 715971 exits successfully.

tjruwase · 2024-04-25T19:00:56Z

@conderls, as you dig into trainer logic, can you also check if using a fixed folder name, i.e., no timestamp, helps.

conderls · 2024-04-27T07:02:49Z

@conderls, as you dig into trainer logic, can you also check if using a fixed folder name, i.e., no timestamp, helps.

I just found that the checkpoints saving strategy updated since transformers v4.39.0:

https://github.com/huggingface/transformers/blob/v4.39.0/src/transformers/trainer.py#L2643

before that, the saving strategy(v4.38.2) cause the thread contention and failed to do the file oprations:

        # Then go through the rewriting process, only renaming and rotating from main process(es)
        if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            if staging_output_dir != output_dir:
                if os.path.exists(staging_output_dir):
                    os.rename(staging_output_dir, output_dir)

                    # Ensure rename completed in cases where os.rename is not atomic
                    # And can only happen on non-windows based systems
                    if os.name != "nt":
                        fd = os.open(output_dir, os.O_RDONLY)
                        os.fsync(fd)
                        os.close(fd)

            # Maybe delete some older checkpoints.
            if self.args.should_save:
                # Solely rely on numerical checkpoint id for rotation.
                # mtime is not reliable especially on some fuse fs in cloud environments.
                self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir)

it works for shared filesystem now, but for non-shared filesystem, we still need to collect checkpoints from workers to master node.

thanks for your time.

conderls added bug Something isn't working training labels Apr 23, 2024

tjruwase closed this as completed Apr 23, 2024

tjruwase reopened this Apr 24, 2024

conderls closed this as completed Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to gather checkpoints to master node during multi-nodes training #5452

how to gather checkpoints to master node during multi-nodes training #5452

conderls commented Apr 23, 2024

tjruwase commented Apr 23, 2024

conderls commented Apr 24, 2024

tjruwase commented Apr 24, 2024

conderls commented Apr 25, 2024

tjruwase commented Apr 25, 2024

conderls commented Apr 27, 2024 •

edited

how to gather checkpoints to master node during multi-nodes training #5452

how to gather checkpoints to master node during multi-nodes training #5452

Comments

conderls commented Apr 23, 2024

tjruwase commented Apr 23, 2024

conderls commented Apr 24, 2024

tjruwase commented Apr 24, 2024

conderls commented Apr 25, 2024

tjruwase commented Apr 25, 2024

conderls commented Apr 27, 2024 • edited

conderls commented Apr 27, 2024 •

edited