New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to gather checkpoints to master node during multi-nodes training #5452
Comments
@tjruwase thanks for the reply. I am wondering for the multi-nodes training with ds:
|
@conderls, the standard (i.e., default) multi-node training with DeepSpeed is with shared filesystem. The following feature that you referenced earlier was created to support node local storage: Have you tried eliminating these options from your config in order to get the standard shared filesystem checkpointing behavior? |
yes, with shared filesystem, error still raised. I will dig deeper into the logics of the
|
@conderls, as you dig into trainer logic, can you also check if using a fixed folder name, i.e., no timestamp, helps. |
I just found that the checkpoints saving strategy updated since transformers https://github.com/huggingface/transformers/blob/v4.39.0/src/transformers/trainer.py#L2643 before that, the saving strategy(v4.38.2) cause the thread contention and failed to do the file oprations: # Then go through the rewriting process, only renaming and rotating from main process(es)
if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
if staging_output_dir != output_dir:
if os.path.exists(staging_output_dir):
os.rename(staging_output_dir, output_dir)
# Ensure rename completed in cases where os.rename is not atomic
# And can only happen on non-windows based systems
if os.name != "nt":
fd = os.open(output_dir, os.O_RDONLY)
os.fsync(fd)
os.close(fd)
# Maybe delete some older checkpoints.
if self.args.should_save:
# Solely rely on numerical checkpoint id for rotation.
# mtime is not reliable especially on some fuse fs in cloud environments.
self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
elif self.is_local_process_zero():
# Clean up the remaining staging checkpoint folders on other nodes
if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
shutil.rmtree(staging_output_dir) it works for shared filesystem now, but for non-shared filesystem, we still need to collect checkpoints from workers to master node. thanks for your time. |
Describe the bug
with 2 nodes x 2 GPU, I can fine-tinue a LLM with checkpoints saved on each node
the deepspeed with
--save_on_each_node true
(transformers' TrainerArguments) and deepspeed config with optionsthen the checkpoints will saved on each nodes, same as #2319 described.
Questions:
how to use the shared filesystem properly?
Expected behavior
System info (please complete the following information):
Launcher context
deepspeed
launcher, pdshThe text was updated successfully, but these errors were encountered: