-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed training #179
Distributed training #179
Conversation
Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>
Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>
* Write train output to hydra's output directory * Added evaluation function * Add usage example for cli interface * update train cli * Disable MacOS tests * Add cli skeleton for exporter --------- Co-authored-by: frostedoyster <bigi.f@libero.it>
* Add gradient calculator * Temporary losses * Forces and stresses * Support multiple model outputs in SOAP-BPNN
83797d7
to
9ce4da6
Compare
9ce4da6
to
5c2dfd5
Compare
src/metatensor/models/utils/distributed/distributed_data_parallel.py
Outdated
Show resolved
Hide resolved
def _setup_distr_env(self, port: int): | ||
hostnames = hostlist.expand_hostlist(os.environ["SLURM_JOB_NODELIST"]) | ||
os.environ["MASTER_ADDR"] = hostnames[0] # set first node as master | ||
os.environ["MASTER_PORT"] = str(port) # set port for communication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the error message (following the discussion in #194) looks like if you set the port to something already taken? You can test this by starting python -m http.server 39591
somewhere and set the port to 39591
here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the error:
[2024-05-22 14:14:38][INFO] - This log is also available in 'outputs/2024-05-22/14-14-38/train.log'.
[2024-05-22 14:14:39][INFO] - random seed of this run is 42
[2024-05-22 14:14:39][INFO] - Setting up training set
[2024-05-22 14:14:39][INFO] - Forces found in section 'energy'. Forces are taken for training!
[2024-05-22 14:14:39][WARNING] - No Stress found in section 'energy'. Continue without stress!
[2024-05-22 14:14:39][INFO] - Setting up test set
[2024-05-22 14:14:39][INFO] - Setting up validation set
[2024-05-22 14:14:39][INFO] - Setting up model
[2024-05-22 14:14:39][INFO] - Calling architecture trainer
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:39591 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:39591 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
�[31mERROR: The error below most likely originates from an architecture. If you think this is a bug, please contact its maintainer (see the architecture's documentation).
The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:39591 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:39591 (errno: 98 - Address already in use).�[0m
srun: error: i70: task 0: Exited with exit code 1
srun: Terminating StepId=2083332.0
slurmstepd: error: *** STEP 2083332.0 ON i70 CANCELLED AT 2024-05-22T14:14:39 ***
10.91.27.70 - - [22/May/2024 14:14:39] code 400, message Bad HTTP/0.9 request type ('\x00Î÷')
10.91.27.70 - - [22/May/2024 14:14:39] "�Î÷�<���������init/��������" 400 -
srun: error: i70: task 1: Terminated
srun: Force Terminated StepId=2083332.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final part is due to the http server
Co-authored-by: Filippo Bigi <98903385+frostedoyster@users.noreply.github.com>
--------- Co-authored-by: Philip Loche <philip.loche@posteo.de>
6854333
to
68846e0
Compare
This PR adds utilities and infrastructure changes to allow distributed, multi-GPU training. The SOAP-BPNN model now supports it.
📚 Documentation preview 📚: https://metatensor-models--179.org.readthedocs.build/en/179/