Updated torchrun instructions #2096

TJ-Solergibert · 2023-10-27T11:51:25Z

What does this PR do?

Fixes #2095

I have been working with Accelerate for a couple of weeks, and one of the first resources I used to test the environment were these scripts. Personally, I have been using torchrun and realized that the documentation was incorrect. Now I am preparing a small guide to work with Accelerate on a cluster with Slurm. If it would be beneficial for the project, I could also add it!

@muellerzr

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

muellerzr · 2023-10-27T11:56:45Z

@TJ-Solergibert more SLURM resources is something we plan on looking at very soon, so if you have some to offer that would be phenomenal!

muellerzr

Thanks for the improvement! I think somewhere I’d still like one example of using the module itself (e.g. python -m) as we have users that want to do so/call it in this fashion. (We have it documented you can do accelerate launch -m so showing it here as an example of “what this means we’re doing would be good if we don’t have this already somewhere else)

HuggingFaceDocBuilderDev · 2023-10-27T12:00:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

BenjaminBossan

Thanks for updating the docs with regard to calling torchrun directly. There are a few small issues, the rest LGTM.

I’d still like one example of using the module itself (e.g. python -m) as we have users that want to do so/call it in this fashion.

I just checked, python -m torchrun does not actually work (anymore?), I had to call python -m torch.distributed.run. It would probably be sufficient to mention once that the two calls are equivalent.

examples/README.md

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

TJ-Solergibert · 2023-10-27T13:07:34Z

I just checked, python -m torchrun does not actually work (anymore?), I had to call python -m torch.distributed.run. It would probably be sufficient to mention once that the two calls are equivalent.

python -m torch.distributed.launch is deprecated and will be removed in future. PyTorch suggest moving to torchrun, so I think it is more convenient to delete all references to torch.distributed.launch so as not to create confusion. I wouldn't spend too much time duplicating the documentation for both cases either, as even though the behavior of both launchers is similar, the flags are different.

muellerzr · 2023-10-27T14:02:27Z

@TJ-Solergibert launch is, but not run. Let's please include calling run via a module please :) (And I know it's not deprecated because it's what we use internally under the hood and call it as a module. We did switch from launch however)

muellerzr

Thanks! I think we're almost ready to put this in now! However, since this will live in accelerate let's please use accelerate launch for the example scripts :)

(And ideally let's have this under examples/slurm rather than examples/Slurm, to keep in practice with the namings of files. Ideally using submit_multigpu.sh and submit_multinode.sh instead as well. Easier to tab-complete etc)

examples/Slurm/submit-multiGPU.sh

examples/Slurm/submit-multinode.sh

examples/README.md

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

TJ-Solergibert · 2023-11-02T15:15:51Z

Done! Thank you for your comments, I hope this little update will be helpful!

muellerzr

This looks great to me! Really appreciate you iterating with us over it, it will be a fantastic addition 🔥

A good follow up if you'd like is to add it to the example zoo doc (so it's in our documentation too), however I can do this for you :)

examples/README.md

BenjaminBossan

Apart from Zach's comments, this LGTM, thanks. I have no experience with SLURM, so I can't comment on the correctness of those instructions.

TJ-Solergibert · 2023-11-06T12:28:35Z

@muellerzr @BenjaminBossan Before merging the SLURM scripts I will run them to be sure that they work. I will do it in the next few days, as we had last week a breakdown in my facilities and now the system is very busy. I will add them to the model zoo!

pacman100

Thank you @TJ-Solergibert for updating the docs wrt torchrun and slurm support 🚀! Left a suggestion.

examples/slurm/submit_multinode.sh

muellerzr · 2023-11-14T19:45:28Z

@TJ-Solergibert how did your tests go? Successful/good to merge? 🤗 (Once Sourab's comments have been addressed)

TJ-Solergibert · 2023-11-15T01:08:00Z

@TJ-Solergibert how did your tests go? Successful/good to merge? 🤗 (Once Sourab's comments have been addressed)

These weeks I have been quite busy and I did not want to merge the repository without first checking the correct functioning of the scripts. I have to finish reviewing the multi-node training case, I hope that this weekend I will be able to do the final testing.

At the same time, I have identified an error when computing the evaluation with several nodes in a setup with a parallel filesystem. The error appears in a call to an evaluate function but to correct it you have to modify a datasets class too. Where should I open the PR?

muellerzr · 2023-11-15T01:25:20Z

Can you describe the fix here and I can upstream it internally to get back with you on where it’s best to put in a fix?

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

…ergibert/accelerate into UpdateExampleInstructions

TJ-Solergibert · 2023-11-15T11:23:59Z

Hi @muellerzr ! I've finally figured out what wasn't working in the multinode script:

I've noticed that both the Python scripts and accelerate launch can't handle multiline command arguments. Thats why I've changed in the scripts the parser to parse_known_args so it skips the \ but now I'm not passing all the tests. I don't know if that change could be possible or not, but in the worst case it is as easy as removing multiline command arguments in the slurm scripts (We lose readability). I also clarify this issue in examples/slurm/submit_multinode.sh.
Edit: Back to parse_args.
I added the main_process_port in the multinode training script and readme as if it is not specified it is passed to torch as None. Maybe this issue is fixed in current versions of accelerate and pytorch, but in our cluster we have versions 0.20.0 and 1.13 respectively, and are not able to update to newer versions (We will be able shortly with the new cluster)

Toni

TJ-Solergibert · 2023-11-15T11:49:10Z

For the multinode evaluation problem:
First of all, credit to @panserbjorn for this comment!

In short words, the datasets filelocks work with flock, but this way nodes can't get locks of other nodes in parallel filesystems. Fixing that issue is as simple as changing from flock to lockf.

class UnixFileLock(BaseFileLock):
    """
    Uses the :func:`fcntl.flock` to hard lock the lock file on unix systems.
    """

    def __init__(self, lock_file, timeout=-1, max_filename_length=None):
        max_filename_length = os.statvfs(os.path.dirname(lock_file)).f_namemax
        super().__init__(lock_file, timeout=timeout, max_filename_length=max_filename_length)

    def _acquire(self):
        open_mode = os.O_RDWR | os.O_CREAT | os.O_TRUNC
        fd = os.open(self._lock_file, open_mode)

        try:
-           fcntl.flock(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
+           fcntl.lockf(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
        except OSError:
            os.close(fd)
        else:
            self._lock_file_fd = fd
        return None

    def _release(self):
        # Do not remove the lockfile:
        #
        #   https://github.com/benediktschmitt/py-filelock/issues/31
        #   https://stackoverflow.com/questions/17708885/flock-removing-locked-file-without-race-condition
        fd = self._lock_file_fd
        self._lock_file_fd = None
-       fcntl.flock(fd, fcntl.LOCK_UN)
+       fcntl.lockf(fd, fcntl.LOCK_UN)
        os.close(fd)
        return None

Applying this patch makes that a single process couldn't detect its own lock. To address that it is necessary to modify the _check_all_processes_locks method of the EvaluationModule class in evaluate/module.py.

def _check_all_processes_locks(self):
    expected_lock_file_names = [
        os.path.join(self.data_dir, f"{self.experiment_id}-{self.num_process}-{process_id}.arrow.lock")
        for process_id in range(self.num_process)
    ]
-   for expected_lock_file_name in expected_lock_file_names:
+   for expected_lock_file_name in expected_lock_file_names[1:]:
        nofilelock = FileFreeLock(expected_lock_file_name)
        try:
            nofilelock.acquire(timeout=self.timeout)
        except Timeout:
            raise ValueError(
                f"Expected to find locked file {expected_lock_file_name} from process {self.process_id} but it doesn't exist."
            ) from None
        else:
            nofilelock.release()

In this line the FileLock class is defined. We could create a new class for parallel filesystems and select it via an environment variable? The same for the _check_all_processes_locks method.

PD: Check point 3 of @panserbjorn comment, I didn't find any documentation related to this behavior but he is completely right!

This reverts commit c3bef5c.

muellerzr · 2023-11-15T16:11:04Z

I'd probably include more detail in your open issue here, as that would be something that'd have to be handled by evaluate: huggingface/evaluate#481

muellerzr · 2023-11-20T14:48:00Z

@TJ-Solergibert aside from the issue is it okay to merge this guide?

TJ-Solergibert · 2023-11-20T15:33:27Z

For sure! I'm working on a project of distributing training across several nodes with transformers, datasets, accelerate and slurm. When I finish, it could be interesting for the zoo examples!

wormyu · 2023-12-12T20:45:38Z

Hi,

Thank you for your excellent work on implementing Accelerate with Slurm! I encountered some problems when using the script submit_multinode.sh. My first question is when testing in accelerate==0.21.0, running the script will give the following error:

  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/utils/launch.py", line 114, in prepare_multi_gpu_env
    setattr(args, "node_rank", int(args.machine_rank))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

I fixed this error by passing --machine_rank $SLURM_PROCID to accelerate launch. But other errors still exist. No matter which port I set, even randomly by $(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4)), it always reports the port already used by other process.

ConnectionError: Tried to launch distributed communication on port `11986`, but another process is utilizing it. 
Please specify a different port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in 
your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.

I am wondering this error may related to pytorch/pytorch#65992 (comment). It says that —master_port and —master_addr are only relevant when —rdzv_backend=static.
Should I open a new issue for this, or would you prefer to continue the discussion here? Maybe I can open a new issue and provide the detailed error log and NCCL INFO log, and the methods I have tried to solve the problem.

muellerzr · 2023-12-12T21:07:55Z

A new issue would be good. Ensure that you’ve tested with the latest accelerate though please before you do!

TJ-Solergibert · 2023-12-12T21:08:54Z

Hi @wormyu,

Exactly! It's necessary to add the --machine_rank flag as you indicate. I noticed a few days after the merge, but I'm glad to see that someone has checked it. Feel free to update it!

Regarding what you mentioned about the port: torchrun recommends using the c10d backend. I think, as the error indicates, it has to do with the network configuration of your infrastructure. I recommend that you consult with the support team of your machine and they can tell you which ports you can use. In my case, each user could use the port of their $UID. You can also try using torchrun instead of accelerate launch; check this example.

wormyu · 2023-12-13T22:31:55Z

Hi guys,

Thank you so much for your quick responce, I just create a new issue #2246 to detailly describe my problem, I have tried accelerate launch and torchrun but all fail. I think what @TJ-Solergibert said might me correct that the issue might related to the network configuration of our cluster. But still I want to make sure I launch the job correctly and let people see the NCCL log because I am not quit understand that. Thanks again for helping me !

Updated torchrun instructions

cb53dd1

muellerzr approved these changes Oct 27, 2023

View reviewed changes

muellerzr requested a review from BenjaminBossan October 27, 2023 11:59

BenjaminBossan requested changes Oct 27, 2023

View reviewed changes

TJ-Solergibert and others added 4 commits October 27, 2023 14:49

Update examples/README.md

233ff9e

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Update examples/README.md

d83a084

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Update examples/README.md

824b3d8

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Update examples/README.md

9604aa1

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

Update README.md for torchrun instructions

b9665bb

Added SLURM scripts and updated README

64954da

muellerzr requested changes Nov 1, 2023

View reviewed changes

examples/Slurm/submit-multiGPU.sh Outdated Show resolved Hide resolved

examples/Slurm/submit-multinode.sh Outdated Show resolved Hide resolved

examples/README.md Outdated Show resolved Hide resolved

examples/README.md Outdated Show resolved Hide resolved

TJ-Solergibert and others added 5 commits November 2, 2023 15:58

Update examples/Slurm/submit-multinode.sh

5a667b4

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

Update examples/Slurm/submit-multiGPU.sh

c43be25

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

Update examples/README.md

bb95909

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

Update examples/README.md

a5c26a8

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

final details

b7becf9

muellerzr approved these changes Nov 2, 2023

View reviewed changes

examples/README.md Outdated Show resolved Hide resolved

muellerzr requested a review from BenjaminBossan November 2, 2023 15:47

BenjaminBossan approved these changes Nov 2, 2023

View reviewed changes

pacman100 reviewed Nov 8, 2023

View reviewed changes

examples/slurm/submit_multinode.sh Outdated Show resolved Hide resolved

TJ-Solergibert added 2 commits November 15, 2023 01:26

modified argument parser

c3bef5c

modified slurm multigpu script

39e5398

TJ-Solergibert and others added 6 commits November 15, 2023 11:46

modified multinode slurm script

5f07a9c

Added accelerate multine issue

6525931

Update examples/README.md

55f3031

Co-authored-by: Zach Mueller <muellerzr@gmail.com>

fixed readme commnad

72069d5

Merge branch 'UpdateExampleInstructions' of https://github.com/TJ-Sol…

9429aaf

…ergibert/accelerate into UpdateExampleInstructions

added --main_process_port specification to readme

8491396

Revert "modified argument parser"

d1170b1

This reverts commit c3bef5c.

muellerzr merged commit 427ef8b into huggingface:main Nov 20, 2023
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated torchrun instructions #2096

Updated torchrun instructions #2096

TJ-Solergibert commented Oct 27, 2023

muellerzr commented Oct 27, 2023

muellerzr left a comment

HuggingFaceDocBuilderDev commented Oct 27, 2023

BenjaminBossan left a comment

TJ-Solergibert commented Oct 27, 2023

muellerzr commented Oct 27, 2023

muellerzr left a comment

TJ-Solergibert commented Nov 2, 2023

muellerzr left a comment

BenjaminBossan left a comment

TJ-Solergibert commented Nov 6, 2023 •

edited

Loading

pacman100 left a comment •

edited

Loading

muellerzr commented Nov 14, 2023

TJ-Solergibert commented Nov 15, 2023

muellerzr commented Nov 15, 2023 •

edited

Loading

TJ-Solergibert commented Nov 15, 2023 •

edited

Loading

TJ-Solergibert commented Nov 15, 2023 •

edited

Loading

muellerzr commented Nov 15, 2023

muellerzr commented Nov 20, 2023

TJ-Solergibert commented Nov 20, 2023

wormyu commented Dec 12, 2023 •

edited

Loading

muellerzr commented Dec 12, 2023

TJ-Solergibert commented Dec 12, 2023

wormyu commented Dec 13, 2023

Updated torchrun instructions #2096

Updated torchrun instructions #2096

Conversation

TJ-Solergibert commented Oct 27, 2023

What does this PR do?

Before submitting

Who can review?

muellerzr commented Oct 27, 2023

muellerzr left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 27, 2023

BenjaminBossan left a comment

Choose a reason for hiding this comment

TJ-Solergibert commented Oct 27, 2023

muellerzr commented Oct 27, 2023

muellerzr left a comment

Choose a reason for hiding this comment

TJ-Solergibert commented Nov 2, 2023

muellerzr left a comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

TJ-Solergibert commented Nov 6, 2023 • edited Loading

pacman100 left a comment • edited Loading

Choose a reason for hiding this comment

muellerzr commented Nov 14, 2023

TJ-Solergibert commented Nov 15, 2023

muellerzr commented Nov 15, 2023 • edited Loading

TJ-Solergibert commented Nov 15, 2023 • edited Loading

TJ-Solergibert commented Nov 15, 2023 • edited Loading

muellerzr commented Nov 15, 2023

muellerzr commented Nov 20, 2023

TJ-Solergibert commented Nov 20, 2023

wormyu commented Dec 12, 2023 • edited Loading

muellerzr commented Dec 12, 2023

TJ-Solergibert commented Dec 12, 2023

wormyu commented Dec 13, 2023

TJ-Solergibert commented Nov 6, 2023 •

edited

Loading

pacman100 left a comment •

edited

Loading

muellerzr commented Nov 15, 2023 •

edited

Loading

TJ-Solergibert commented Nov 15, 2023 •

edited

Loading

TJ-Solergibert commented Nov 15, 2023 •

edited

Loading

wormyu commented Dec 12, 2023 •

edited

Loading