feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow by rapsealk · Pull Request #4244 · lablup/backend.ai

rapsealk · 2025-04-22T05:13:12Z

This pull request introduces support for distributed training by adding environment variables for PyTorch and TensorFlow configurations. It also includes a minor import addition to support JSON serialization.

Distributed training support:

Added environment variables (WORLD_SIZE, WORLD_RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, TF_CONFIG) to facilitate distributed training for PyTorch and TensorFlow. These variables are configured based on the cluster and kernel details in the get_image_conf function. (src/ai/backend/manager/registry.py, src/ai/backend/manager/registry.pyR1892-R1912)

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--4244.org.readthedocs.build/en/4244/

📚 Documentation preview 📚: https://sorna-ko--4244.org.readthedocs.build/ko/4244/

…ning

Copilot

Pull Request Overview

This pull request adds support for distributed training by introducing environment variables for both PyTorch and TensorFlow, and includes a minor import update to facilitate JSON serialization.

Adds environment variables (WORLD_SIZE, WORLD_RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, TF_CONFIG) for distributed training.
Introduces an import for dump_json_str to support JSON serialization in environment configuration.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
src/ai/backend/manager/registry.py	Adds new environment variable configurations and JSON serialization.
changes/.feature.md	Documents the feature addition for distributed training.

Comments suppressed due to low confidence (2)

src/ai/backend/manager/registry.py:1902

[nitpick] Consider renaming the variable in the list comprehension (e.g., to 'worker') to avoid shadowing the outer 'kernel' parameter for better clarity.

f"{kernel.cluster_hostname}:12345"

src/ai/backend/manager/registry.py:1897

[nitpick] Consider replacing the hardcoded port '12345' with a named constant or configurable value to improve maintainability.

"MASTER_PORT": "12345",

.feature.md -> 4244.feature.md Co-authored-by: octodog <mu001@lablup.com>

…ment-variables

leksikov

LGTM

leksikov · 2025-04-22T05:48:35Z

+.. list-table::
+   :header-rows: 1
+
+   * - Environment Variable


How about add more variables support to cover various distributed frameworks?

For PyTorch Distributed:
PYTORCH_CUDA_ALLOC_CONF - For memory allocation strategies
NCCL_DEBUG - For debugging NCCL communications
TORCH_DISTRIBUTED_DEBUG - For more verbose debugging information

For DeepSpeed:
DEEPSPEED_ZERO_STAGE - To control ZeRO optimization stages
DEEPSPEED_ALLGATHER_SIZE - For tuning communication efficiency
DEEPSPEED_CPU_OFFLOAD - To enable CPU offloading

For Horovod:

HOROVOD_FUSION_THRESHOLD - For operation fusion tuning
HOROVOD_CYCLE_TIME - For controlling cycle time
HOROVOD_CACHE_CAPACITY - For tensor fusion cache

General:
OMP_NUM_THREADS - For controlling OpenMP parallelism

Thank you for the feedback.

The primary objective of this pull request is to automatically configure environment variables necessary for distributed training with major frameworks (e.g., PyTorch, TensorFlow), particularly those related to inter-worker communication. These variables are deterministic, as the container and cluster topology is predefined.

In contrast, other environment variables—such as PYTORCH_CUDA_ALLOC_CONF—may vary depending on user preferences or optimization strategies. Therefore, I believe it is better not to auto-configure these, in order to avoid unintended side effects.

HyeockJinKim · 2025-04-24T01:18:19Z

                            "environ": {
+                                **_pytorch_distributed_environ,
+                                **_tf_distributed_environ,


Do you always apply the environment for pytorch and tf?

Per-image basis would be better. Thanks!

…ment-variables

Copilot

Pull Request Overview

This PR adds support for distributed training by introducing environment variables specific to PyTorch and TensorFlow configurations. Key changes include:

Adding two new Pydantic models, PyTorchDistributedEnviron and TensorFlowDistributedEnviron, for environment variable validation and serialization.
Integrating distributed training configuration into the image configuration logic in the registry.
Expanding tests to cover the new distributed environment models and updating documentation accordingly.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.

File	Description
tests/common/test_types.py	Added tests for new distributed training environment types
src/ai/backend/manager/registry.py	Integrated new environment models into the image config
src/ai/backend/common/types.py	Introduced PyTorchDistributedEnviron and TensorFlowDistributedEnviron
changes/4244.feature.md	Added feature summary for distributed training support

Files not reviewed (1)

docs/concepts/networking.rst: Language not supported

Comments suppressed due to low confidence (2)

src/ai/backend/manager/registry.py:1856

[nitpick] Consider defining a constant (e.g. DEFAULT_MASTER_PORT) for the hardcoded port value "12345" to improve maintainability and avoid potential issues if the default ever needs to change.

master_port="12345",

src/ai/backend/common/types.py:1655

Ensure that the 'override' decorator is imported or defined in the module to avoid a runtime error when overriding the model_dump method.

    @override

HyeockJinKim

Setting environment variables now will make the registry too large again, so rather than that, please establish a structure so that it appears that the registry is simply loading.

…ner startup Instead of computing PyTorch/TensorFlow environment variables in the manager registry (as proposed in PR #4244), derive them from existing BACKENDAI_* cluster variables at container startup via a sourced shell script. This keeps the manager clean, avoids fragile image-name matching, and lets any image benefit from the setup automatically. Resolves #4243 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rapsealk · 2026-04-01T11:19:35Z

Superseded by #10726, which derives the distributed training env vars from existing BACKENDAI_* cluster variables at container startup instead of computing them in the manager registry.

rapsealk added 2 commits April 22, 2025 12:04

feat(BA-1214): Add default environment variables for distributed trai…

809fdf5

…ning

docs(BA-1214): Add news fragment

6d8e2f1

rapsealk added this to the 25Q1 milestone Apr 22, 2025

rapsealk requested review from HyeockJinKim and Copilot April 22, 2025 05:13

github-actions bot assigned rapsealk Apr 22, 2025

Copilot AI reviewed Apr 22, 2025

View reviewed changes

github-actions bot added size:S 10~30 LoC comp:manager Related to Manager component labels Apr 22, 2025

docs: Rename the news fragment with the PR number

48edcbd

.feature.md -> 4244.feature.md Co-authored-by: octodog <mu001@lablup.com>

rapsealk changed the title ~~feat(BA-1214): Add environment variables for distributed training~~ feat(BA-1214): Add environment variables for PyTorch and TensorFlow distributed training Apr 22, 2025

rapsealk requested a review from leksikov April 22, 2025 05:18

docs(BA-1214): Update docs

8396628

github-actions bot added size:M 30~100 LoC area:docs Documentations and removed size:S 10~30 LoC labels Apr 22, 2025

rapsealk added 2 commits April 22, 2025 16:19

refactor(BA-1214): Separate distributed environ

616f4e5

Merge branch 'main' into feature/BA-1214-distributed-training-environ…

97e4e10

…ment-variables

leksikov reviewed Apr 23, 2025

View reviewed changes

HyeockJinKim reviewed Apr 24, 2025

View reviewed changes

rapsealk marked this pull request as draft April 24, 2025 01:40

rapsealk added 2 commits April 24, 2025 11:16

refactor(BA-1214): Per-image basis

7ad4826

feat(BA-1214): Add pydantic models

9cda165

rapsealk requested review from HyeockJinKim and leksikov April 24, 2025 04:59

github-actions bot added the size:L 100~500 LoC label Apr 24, 2025

rapsealk requested a review from Copilot April 24, 2025 05:00

github-actions bot added comp:common Related to Common component and removed size:M 30~100 LoC labels Apr 24, 2025

rapsealk marked this pull request as ready for review April 24, 2025 05:00

Merge branch 'main' into feature/BA-1214-distributed-training-environ…

84a4f65

…ment-variables

Copilot AI reviewed Apr 24, 2025

View reviewed changes

github-advanced-security AI found potential problems Apr 24, 2025

View reviewed changes

Comment thread tests/common/test_types.py Dismissed

Comment thread tests/common/test_types.py Dismissed

rapsealk changed the title ~~feat(BA-1214): Add environment variables for PyTorch and TensorFlow distributed training~~ feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow Apr 24, 2025

rapsealk added 2 commits April 24, 2025 14:31

docs(BA-1214): Update news fragment

56e30b8

docs(BA-1214): Update news fragment

295023e

HyeockJinKim requested changes Jun 23, 2025

View reviewed changes

HyeockJinKim force-pushed the main branch 4 times, most recently from 1a10632 to 2d8c9ea Compare November 23, 2025 14:45

HyeockJinKim force-pushed the main branch 2 times, most recently from 9552aac to 4af738e Compare December 31, 2025 15:41

rapsealk closed this Jan 21, 2026

rapsealk deleted the feature/BA-1214-distributed-training-environment-variables branch March 24, 2026 08:09

rapsealk mentioned this pull request Apr 1, 2026

Add default environment variables for PyTorch/TensorFlow distributed training #4243

Open

rapsealk mentioned this pull request Apr 1, 2026

feat(BA-1214): Derive PyTorch/TensorFlow distributed training env vars at container startup #10726

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow#4244

feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow#4244
rapsealk wants to merge 11 commits intomainfrom
feature/BA-1214-distributed-training-environment-variables

rapsealk commented Apr 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

leksikov left a comment

Uh oh!

leksikov Apr 22, 2025

Uh oh!

rapsealk Apr 24, 2025

Uh oh!

HyeockJinKim Apr 24, 2025

Uh oh!

rapsealk Apr 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

HyeockJinKim left a comment

Uh oh!

rapsealk commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rapsealk commented Apr 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Distributed training support:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

leksikov left a comment

Choose a reason for hiding this comment

Uh oh!

leksikov Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

rapsealk Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

HyeockJinKim Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

rapsealk Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

HyeockJinKim left a comment

Choose a reason for hiding this comment

Uh oh!

rapsealk commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rapsealk commented Apr 22, 2025 •

edited by github-actions bot

Loading