feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow#4244
feat(BA-1214): Add initial configs for distributed training in PyTorch and TensorFlow#4244
Conversation
There was a problem hiding this comment.
Pull Request Overview
This pull request adds support for distributed training by introducing environment variables for both PyTorch and TensorFlow, and includes a minor import update to facilitate JSON serialization.
- Adds environment variables (WORLD_SIZE, WORLD_RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT, TF_CONFIG) for distributed training.
- Introduces an import for dump_json_str to support JSON serialization in environment configuration.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/ai/backend/manager/registry.py | Adds new environment variable configurations and JSON serialization. |
| changes/.feature.md | Documents the feature addition for distributed training. |
Comments suppressed due to low confidence (2)
src/ai/backend/manager/registry.py:1902
- [nitpick] Consider renaming the variable in the list comprehension (e.g., to 'worker') to avoid shadowing the outer 'kernel' parameter for better clarity.
f"{kernel.cluster_hostname}:12345"
src/ai/backend/manager/registry.py:1897
- [nitpick] Consider replacing the hardcoded port '12345' with a named constant or configurable value to improve maintainability.
"MASTER_PORT": "12345",
.feature.md -> 4244.feature.md Co-authored-by: octodog <mu001@lablup.com>
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Environment Variable |
There was a problem hiding this comment.
How about add more variables support to cover various distributed frameworks?
For PyTorch Distributed:
PYTORCH_CUDA_ALLOC_CONF - For memory allocation strategies
NCCL_DEBUG - For debugging NCCL communications
TORCH_DISTRIBUTED_DEBUG - For more verbose debugging information
For DeepSpeed:
DEEPSPEED_ZERO_STAGE - To control ZeRO optimization stages
DEEPSPEED_ALLGATHER_SIZE - For tuning communication efficiency
DEEPSPEED_CPU_OFFLOAD - To enable CPU offloading
For Horovod:
HOROVOD_FUSION_THRESHOLD - For operation fusion tuning
HOROVOD_CYCLE_TIME - For controlling cycle time
HOROVOD_CACHE_CAPACITY - For tensor fusion cache
General:
OMP_NUM_THREADS - For controlling OpenMP parallelism
There was a problem hiding this comment.
Thank you for the feedback.
The primary objective of this pull request is to automatically configure environment variables necessary for distributed training with major frameworks (e.g., PyTorch, TensorFlow), particularly those related to inter-worker communication. These variables are deterministic, as the container and cluster topology is predefined.
In contrast, other environment variables—such as PYTORCH_CUDA_ALLOC_CONF—may vary depending on user preferences or optimization strategies. Therefore, I believe it is better not to auto-configure these, in order to avoid unintended side effects.
| "environ": { | ||
| **_pytorch_distributed_environ, | ||
| **_tf_distributed_environ, |
There was a problem hiding this comment.
Do you always apply the environment for pytorch and tf?
There was a problem hiding this comment.
Per-image basis would be better. Thanks!
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for distributed training by introducing environment variables specific to PyTorch and TensorFlow configurations. Key changes include:
- Adding two new Pydantic models, PyTorchDistributedEnviron and TensorFlowDistributedEnviron, for environment variable validation and serialization.
- Integrating distributed training configuration into the image configuration logic in the registry.
- Expanding tests to cover the new distributed environment models and updating documentation accordingly.
Reviewed Changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/common/test_types.py | Added tests for new distributed training environment types |
| src/ai/backend/manager/registry.py | Integrated new environment models into the image config |
| src/ai/backend/common/types.py | Introduced PyTorchDistributedEnviron and TensorFlowDistributedEnviron |
| changes/4244.feature.md | Added feature summary for distributed training support |
Files not reviewed (1)
- docs/concepts/networking.rst: Language not supported
Comments suppressed due to low confidence (2)
src/ai/backend/manager/registry.py:1856
- [nitpick] Consider defining a constant (e.g. DEFAULT_MASTER_PORT) for the hardcoded port value "12345" to improve maintainability and avoid potential issues if the default ever needs to change.
master_port="12345",
src/ai/backend/common/types.py:1655
- Ensure that the 'override' decorator is imported or defined in the module to avoid a runtime error when overriding the model_dump method.
@override
HyeockJinKim
left a comment
There was a problem hiding this comment.
Setting environment variables now will make the registry too large again, so rather than that, please establish a structure so that it appears that the registry is simply loading.
1a10632 to
2d8c9ea
Compare
9552aac to
4af738e
Compare
…ner startup Instead of computing PyTorch/TensorFlow environment variables in the manager registry (as proposed in PR #4244), derive them from existing BACKENDAI_* cluster variables at container startup via a sourced shell script. This keeps the manager clean, avoids fragile image-name matching, and lets any image benefit from the setup automatically. Resolves #4243 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Superseded by #10726, which derives the distributed training env vars from existing |
resolves #4243 (BA-1214)
This pull request introduces support for distributed training by adding environment variables for PyTorch and TensorFlow configurations. It also includes a minor import addition to support JSON serialization.
Distributed training support:
WORLD_SIZE,WORLD_RANK,LOCAL_RANK,MASTER_ADDR,MASTER_PORT,TF_CONFIG) to facilitate distributed training for PyTorch and TensorFlow. These variables are configured based on the cluster and kernel details in theget_image_conffunction. (src/ai/backend/manager/registry.py, src/ai/backend/manager/registry.pyR1892-R1912)Checklist: (if applicable)
docsdirectory📚 Documentation preview 📚: https://sorna--4244.org.readthedocs.build/en/4244/
📚 Documentation preview 📚: https://sorna-ko--4244.org.readthedocs.build/ko/4244/