Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training #179

Closed
wants to merge 161 commits into from
Closed

Distributed training #179

wants to merge 161 commits into from

Conversation

frostedoyster
Copy link
Collaborator

@frostedoyster frostedoyster commented May 14, 2024

This PR adds utilities and infrastructure changes to allow distributed, multi-GPU training. The SOAP-BPNN model now supports it.


📚 Documentation preview 📚: https://metatensor-models--179.org.readthedocs.build/en/179/

frostedoyster and others added 30 commits November 20, 2023 15:51
Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>
* Rename package to `metatensor-models`

* Rename module to `metatensor.models`
Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>
* Write train output to hydra's output directory

* Added evaluation function

* Add usage example for cli interface

* update train cli

* Disable MacOS tests

* Add cli skeleton for exporter

---------

Co-authored-by: frostedoyster <bigi.f@libero.it>
* Add gradient calculator
* Temporary losses
* Forces and stresses
* Support multiple model outputs in SOAP-BPNN
@frostedoyster frostedoyster marked this pull request as ready for review May 20, 2024 07:20
@frostedoyster frostedoyster requested a review from Luthaf May 20, 2024 07:20
@frostedoyster frostedoyster force-pushed the distributed branch 4 times, most recently from 83797d7 to 9ce4da6 Compare May 20, 2024 07:46
docs/src/advanced-concepts/multi-gpu.rst Outdated Show resolved Hide resolved
src/metatensor/models/experimental/soap_bpnn/train.py Outdated Show resolved Hide resolved
src/metatensor/models/experimental/soap_bpnn/train.py Outdated Show resolved Hide resolved
src/metatensor/models/experimental/soap_bpnn/train.py Outdated Show resolved Hide resolved
src/metatensor/models/utils/distributed/slurm.py Outdated Show resolved Hide resolved
src/metatensor/models/utils/distributed/slurm.py Outdated Show resolved Hide resolved
tests/resources/run_distributed.sh Outdated Show resolved Hide resolved
@frostedoyster frostedoyster requested a review from Luthaf May 21, 2024 15:09
def _setup_distr_env(self, port: int):
hostnames = hostlist.expand_hostlist(os.environ["SLURM_JOB_NODELIST"])
os.environ["MASTER_ADDR"] = hostnames[0] # set first node as master
os.environ["MASTER_PORT"] = str(port) # set port for communication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the error message (following the discussion in #194) looks like if you set the port to something already taken? You can test this by starting python -m http.server 39591 somewhere and set the port to 39591 here as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the error:

[2024-05-22 14:14:38][INFO] - This log is also available in 'outputs/2024-05-22/14-14-38/train.log'.
[2024-05-22 14:14:39][INFO] - random seed of this run is 42
[2024-05-22 14:14:39][INFO] - Setting up training set
[2024-05-22 14:14:39][INFO] - Forces found in section 'energy'. Forces are taken for training!
[2024-05-22 14:14:39][WARNING] - No Stress found in section 'energy'. Continue without stress!
[2024-05-22 14:14:39][INFO] - Setting up test set
[2024-05-22 14:14:39][INFO] - Setting up validation set
[2024-05-22 14:14:39][INFO] - Setting up model
[2024-05-22 14:14:39][INFO] - Calling architecture trainer
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:39591 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:39591 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
�[31mERROR: The error below most likely originates from an architecture. If you think this is a bug, please contact its maintainer (see the architecture's documentation).

The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:39591 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:39591 (errno: 98 - Address already in use).�[0m
srun: error: i70: task 0: Exited with exit code 1
srun: Terminating StepId=2083332.0
slurmstepd: error: *** STEP 2083332.0 ON i70 CANCELLED AT 2024-05-22T14:14:39 ***
10.91.27.70 - - [22/May/2024 14:14:39] code 400, message Bad HTTP/0.9 request type ('\x00Î÷')
10.91.27.70 - - [22/May/2024 14:14:39] "�Î÷�<���������init/��������" 400 -
srun: error: i70: task 1: Terminated
srun: Force Terminated StepId=2083332.0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final part is due to the http server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants