Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to gather checkpoints to master node during multi-nodes training #5452

Closed
conderls opened this issue Apr 23, 2024 · 6 comments
Closed

how to gather checkpoints to master node during multi-nodes training #5452

conderls opened this issue Apr 23, 2024 · 6 comments
Labels
bug Something isn't working training

Comments

@conderls
Copy link

Describe the bug
with 2 nodes x 2 GPU, I can fine-tinue a LLM with checkpoints saved on each node

node0: checkpoint-2
├── adapter_config.json
├── adapter_model.safetensors
├── global_step2
│   ├── mp_rank_00_model_states.pt
│   ├── zero_pp_rank_0_mp_rank_00_optim_states.pt
│   └── zero_pp_rank_1_mp_rank_00_optim_states.pt
├── latest
├── README.md
├── rng_state_0.pth
├── rng_state_1.pth
├── ......

node1: checkpoint-2
├── adapter_config.json
├── adapter_model.safetensors
├── global_step2
│   ├── mp_rank_00_model_states.pt
│   ├── zero_pp_rank_2_mp_rank_00_optim_states.pt
│   └── zero_pp_rank_3_mp_rank_00_optim_states.pt
├── latest
├── README.md
├── rng_state_2.pth
├── rng_state_3.pth
├── ......

the deepspeed with --save_on_each_node true (transformers' TrainerArguments) and deepspeed config with options

"checkpoint": {
    "use_node_local_storage": true
}

then the checkpoints will saved on each nodes, same as #2319 described.

Questions:

  1. In order to merge the lora checkpoints, I need to copy the checkpoints from the worker-nodes to the master nodes; how to gather the checkpoints to the master automatically?
  2. If I use a shared filesystem, then It failed to save the checkpoints with error raised
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
node1:     self._save_checkpoint(model, trial, metrics=metrics)
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/site-packages/transformers/trainer.py", line 2555, in _save_checkpoint
node1:     shutil.rmtree(staging_output_dir)
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/shutil.py", line 722, in rmtree
node1:     onerror(os.rmdir, path, sys.exc_info())
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/shutil.py", line 720, in rmtree
node1:     os.rmdir(path)
node1: FileNotFoundError: [Errno 2] No such file or directory: '/data/output_share/0423220002/tmp-checkpoint-2'

how to use the shared filesystem properly?

Expected behavior

  1. if without a shared filesystem, automatically (or provide an option) gather the checkpoints to master nodes;
  2. if with a shared filesystem, save all checkpoints to the same path.

System info (please complete the following information):

  • OS: Ubuntu 22.04
  • GPU count and types: two machines with x8 A40
  • Python version: 3.8.18
  • deepspeed: 0.13.1

Launcher context
deepspeed launcher, pdsh

@conderls conderls added bug Something isn't working training labels Apr 23, 2024
@tjruwase
Copy link
Contributor

image

This is a transformers error not DeepSpeed. And the error is due to the fact that destination folder does not exist and so the remove folder command is failing.

@conderls
Copy link
Author

@tjruwase thanks for the reply.
I know this issue is directly with transformers(v4.38.2), but it was due to the multi-nodes training with shared filesystem.

I am wondering for the multi-nodes training with ds:

  1. the transformers trainer may not work well with deepspeed using shared filesystem?
  2. what's the more convenient way to collect the checkpoints from different nodes?

@tjruwase tjruwase reopened this Apr 24, 2024
@tjruwase
Copy link
Contributor

@conderls, the standard (i.e., default) multi-node training with DeepSpeed is with shared filesystem.

The following feature that you referenced earlier was created to support node local storage:
image

Have you tried eliminating these options from your config in order to get the standard shared filesystem checkpointing behavior?

@conderls
Copy link
Author

@conderls, the standard (i.e., default) multi-node training with DeepSpeed is with shared filesystem.

The following feature that you referenced earlier was created to support node local storage: image

Have you tried eliminating these options from your config in order to get the standard shared filesystem checkpointing behavior?

yes, with shared filesystem, error still raised. I will dig deeper into the logics of the transformers/trainer.py

==> save_on_each_node=false, use_node_local_storage=false <==
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/shutil.py", line 720, in rmtree
node1:     os.rmdir(path)
node1: FileNotFoundError: [Errno 2] No such file or directory: '/data/output_share/baichuan_0425134117/tmp-checkpoint-2'
node1: [2024-04-25 05:42:54,555] [INFO] [launch.py:347:main] Process 714566 exits successfully.

==> save_on_each_node=true, use_node_local_storage=false <==
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/site-packages/transformers/trainer.py", line 2538, in _save_checkpoint
node1:     os.rename(staging_output_dir, output_dir)
node1: FileNotFoundError: [Errno 2] No such file or directory: '/data/output_share/baichuan_0425134405/tmp-checkpoint-2' -> '/data/output_share/baichuan_0425134405/checkpoint-2'
node1: [2024-04-25 05:45:52,057] [INFO] [launch.py:347:main] Process 715263 exits successfully.

==> save_on_each_node=true, use_node_local_storage=true <==
node1:   File "/usr/anaconda3/envs/py3/lib/python3.8/site-packages/transformers/trainer.py", line 2538, in _save_checkpoint
node1:     os.rename(staging_output_dir, output_dir)
node1: FileNotFoundError: [Errno 2] No such file or directory: '/data/output_share/baichuan_0425134759/tmp-checkpoint-2' -> '/data/output_share/baichuan_0425134759/checkpoint-2'
node1: [2024-04-25 05:49:35,814] [INFO] [launch.py:347:main] Process 715971 exits successfully.

@tjruwase
Copy link
Contributor

@conderls, as you dig into trainer logic, can you also check if using a fixed folder name, i.e., no timestamp, helps.

image

@conderls
Copy link
Author

conderls commented Apr 27, 2024

@conderls, as you dig into trainer logic, can you also check if using a fixed folder name, i.e., no timestamp, helps.

image

I just found that the checkpoints saving strategy updated since transformers v4.39.0:

https://github.com/huggingface/transformers/blob/v4.39.0/src/transformers/trainer.py#L2643

before that, the saving strategy(v4.38.2) cause the thread contention and failed to do the file oprations:

        # Then go through the rewriting process, only renaming and rotating from main process(es)
        if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            if staging_output_dir != output_dir:
                if os.path.exists(staging_output_dir):
                    os.rename(staging_output_dir, output_dir)

                    # Ensure rename completed in cases where os.rename is not atomic
                    # And can only happen on non-windows based systems
                    if os.name != "nt":
                        fd = os.open(output_dir, os.O_RDONLY)
                        os.fsync(fd)
                        os.close(fd)

            # Maybe delete some older checkpoints.
            if self.args.should_save:
                # Solely rely on numerical checkpoint id for rotation.
                # mtime is not reliable especially on some fuse fs in cloud environments.
                self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir)

it works for shared filesystem now, but for non-shared filesystem, we still need to collect checkpoints from workers to master node.

thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants