Distributed training #179

frostedoyster · 2024-05-14T10:55:27Z

This PR adds utilities and infrastructure changes to allow distributed, multi-GPU training. The SOAP-BPNN model now supports it.

📚 Documentation preview 📚: https://metatensor-models--179.org.readthedocs.build/en/179/

Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>

* Rename package to `metatensor-models` * Rename module to `metatensor.models`

Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>

* Write train output to hydra's output directory * Added evaluation function * Add usage example for cli interface * update train cli * Disable MacOS tests * Add cli skeleton for exporter --------- Co-authored-by: frostedoyster <bigi.f@libero.it>

* Add gradient calculator * Temporary losses * Forces and stresses * Support multiple model outputs in SOAP-BPNN

docs/src/advanced-concepts/multi-gpu.rst

src/metatensor/models/experimental/soap_bpnn/train.py

src/metatensor/models/utils/distributed/distributed_data_parallel.py

src/metatensor/models/utils/distributed/slurm.py

tests/resources/run_distributed.sh

…-models into distributed

Luthaf · 2024-05-22T10:26:43Z

src/metatensor/models/utils/distributed/slurm.py

+    def _setup_distr_env(self, port: int):
+        hostnames = hostlist.expand_hostlist(os.environ["SLURM_JOB_NODELIST"])
+        os.environ["MASTER_ADDR"] = hostnames[0]  # set first node as master
+        os.environ["MASTER_PORT"] = str(port)  # set port for communication


How does the error message (following the discussion in #194) looks like if you set the port to something already taken? You can test this by starting python -m http.server 39591 somewhere and set the port to 39591 here as well.

Here is the error:

[2024-05-22 14:14:38][INFO] - This log is also available in 'outputs/2024-05-22/14-14-38/train.log'. [2024-05-22 14:14:39][INFO] - random seed of this run is 42 [2024-05-22 14:14:39][INFO] - Setting up training set [2024-05-22 14:14:39][INFO] - Forces found in section 'energy'. Forces are taken for training! [2024-05-22 14:14:39][WARNING] - No Stress found in section 'energy'. Continue without stress! [2024-05-22 14:14:39][INFO] - Setting up test set [2024-05-22 14:14:39][INFO] - Setting up validation set [2024-05-22 14:14:39][INFO] - Setting up model [2024-05-22 14:14:39][INFO] - Calling architecture trainer [W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:39591 (errno: 98 - Address already in use). [W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:39591 (errno: 98 - Address already in use). [E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address. �[31mERROR: The error below most likely originates from an architecture. If you think this is a bug, please contact its maintainer (see the architecture's documentation). The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:39591 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:39591 (errno: 98 - Address already in use).�[0m srun: error: i70: task 0: Exited with exit code 1 srun: Terminating StepId=2083332.0 slurmstepd: error: *** STEP 2083332.0 ON i70 CANCELLED AT 2024-05-22T14:14:39 *** 10.91.27.70 - - [22/May/2024 14:14:39] code 400, message Bad HTTP/0.9 request type ('\x00Î÷') 10.91.27.70 - - [22/May/2024 14:14:39] "�Î÷�<��init/��" 400 - srun: error: i70: task 1: Terminated srun: Force Terminated StepId=2083332.0

The final part is due to the http server

Co-authored-by: Filippo Bigi <98903385+frostedoyster@users.noreply.github.com>

--------- Co-authored-by: Philip Loche <philip.loche@posteo.de>

frostedoyster and others added 30 commits November 20, 2023 15:51

Initial commit

3177680

Update README.md

3e73d1f

Make the package installable

9d83073

Set up CI, linters and tests (#1)

a0222c9

Add scaffold of all types of model (#2)

8937336

Remove setup.py-dependent build test

77c6378

Debug CI

81b8d07

Allow ls for debugging

23e92c3

Retrieve sdist to build wheels from it

6ec7658

Debug again?

d931eeb

This works

b132499

Little cleanup for new purpose of the repo

f81dc7d

Replace matplotlib dependency

9d8e3cf

Merge pull request #3 from lab-cosmo/skeleton

fbdffd5

Add basic cli interface to scripts (#5)

aa3c55e

Fix ci job syntax error (#9)

9a939d3

Add xyz structure and target reader (#8)

b9f0502

Temporarily disable failing Windows tests (#12)

5b18376

Implement a temporary dataset class (#10)

bc8437f

Add documentation structure (#13)

56224a2

Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>

Add SOAP-BPNN (#7)

9c632bb

Rename to metatensor-models (#15)

99ca38b

* Rename package to `metatensor-models` * Rename module to `metatensor.models`

Extract species list from dataset (#17)

35a5996

Add hydra parsing (#18)

09c58cd

Implement saving and loading of models (#19)

44e4f11

Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>

Update cli interface (#21)

dc2299b

* Write train output to hydra's output directory * Added evaluation function * Add usage example for cli interface * update train cli * Disable MacOS tests * Add cli skeleton for exporter --------- Co-authored-by: frostedoyster <bigi.f@libero.it>

Change CLI API (#24)

1a98445

Make train function self-contained (#25)

3882faa

Integrate with metatensor.torch.atomistic (#28)

4d7f847

Add gradient calculators (#26)

51df872

* Add gradient calculator * Temporary losses * Forces and stresses * Support multiple model outputs in SOAP-BPNN

frostedoyster added 2 commits May 19, 2024 11:06

Multi-process handling of logging and final evaluation

abdecc3

Fix logging of metrics

b5905c5

frostedoyster marked this pull request as ready for review May 20, 2024 07:20

frostedoyster requested a review from Luthaf May 20, 2024 07:20

frostedoyster force-pushed the distributed branch 4 times, most recently from 83797d7 to 9ce4da6 Compare May 20, 2024 07:46

Add test and documentation for distributed training

5c2dfd5

frostedoyster force-pushed the distributed branch from 9ce4da6 to 5c2dfd5 Compare May 20, 2024 11:32

frostedoyster commented May 21, 2024

View reviewed changes

docs/src/advanced-concepts/multi-gpu.rst Outdated Show resolved Hide resolved

Luthaf reviewed May 21, 2024

View reviewed changes

frostedoyster and others added 4 commits May 21, 2024 16:26

Update docs/src/advanced-concepts/multi-gpu.rst

45c3b71

Suggestions from code review

1bd4566

Merge branch 'distributed' of https://github.com/lab-cosmo/metatensor…

239027e

…-models into distributed

Allow user to set port

d6dc7d5

frostedoyster requested a review from Luthaf May 21, 2024 15:09

Luthaf reviewed May 22, 2024

View reviewed changes

frostedoyster and others added 6 commits May 22, 2024 14:37

Raise error if multi-GPU and distributed are both requested

8327185

Split export into pure export and save function (#190)

ade59b4

minor docs tweaks (#197)

9599d31

Co-authored-by: Filippo Bigi <98903385+frostedoyster@users.noreply.github.com>

Merge branch 'main' into distributed

b3ce672

Evaluate per-atom properties (#191)

f29c2e7

--------- Co-authored-by: Philip Loche <philip.loche@posteo.de>

Merge branch 'main' into distributed

68846e0

frostedoyster force-pushed the distributed branch from 6854333 to 68846e0 Compare May 23, 2024 08:12

frostedoyster added 2 commits May 23, 2024 17:03

Allow users to set weights for different loss terms from outside (#193)

25113b9

Merge branch 'main' into distributed

05f6897

PicoCentauri closed this May 29, 2024

PicoCentauri force-pushed the main branch from e5b5c0a to 19451a6 Compare May 29, 2024 14:41

frostedoyster mentioned this pull request Jun 9, 2024

Distributed training #239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training #179

Distributed training #179

frostedoyster commented May 14, 2024 •

edited

Loading

Luthaf May 22, 2024

frostedoyster May 22, 2024

frostedoyster May 22, 2024

Distributed training #179

Distributed training #179

Conversation

frostedoyster commented May 14, 2024 • edited Loading

Luthaf May 22, 2024

Choose a reason for hiding this comment

frostedoyster May 22, 2024

Choose a reason for hiding this comment

frostedoyster May 22, 2024

Choose a reason for hiding this comment

frostedoyster commented May 14, 2024 •

edited

Loading