Skip to content

Conversation

@HosseinKaviani-H
Copy link
Contributor

Auto-detect NCCL network configuration

Automatically detects and configures NCCL network settings based on actual cluster hardware instead of requiring manual configuration.

Changes:

  • Detects InfiniBand interfaces and enables them automatically
  • Falls back to Ethernet if no InfiniBand available
  • Respects user-set NCCL_SOCKET_IFNAME and NCCL_IB_DISABLE if already configured
  • Eliminates need for manual NCCL environment variable setup

Example:

  • Cluster with InfiniBand → Sets NCCL_SOCKET_IFNAME=ibp230s0,ibp212s0,... and NCCL_IB_DISABLE=0
  • Cluster without InfiniBand → Sets NCCL_SOCKET_IFNAME=^lo and NCCL_IB_DISABLE=1

- Add get_nccl_env_vars() function to automatically detect network configuration
- Detects InfiniBand interfaces (ibp*, ib*) and enables them automatically
- Falls back to Ethernet (^lo) if no InfiniBand is available
- Respects user-set NCCL environment variables
- Sets NCCL_SOCKET_IFNAME and NCCL_IB_DISABLE based on actual cluster hardware
- Eliminates need for manual NCCL configuration
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 13, 2025
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.10%. Comparing base (4410e90) to head (fc5e0d3).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #565   +/-   ##
=======================================
  Coverage   84.10%   84.10%           
=======================================
  Files          29       29           
  Lines        3687     3687           
=======================================
  Hits         3101     3101           
  Misses        586      586           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HosseinKaviani-H HosseinKaviani-H merged commit f99d7aa into meta-pytorch:main Nov 13, 2025
5 of 8 checks passed
HosseinKaviani-H pushed a commit to HosseinKaviani-H/forge that referenced this pull request Nov 13, 2025
felipemello1 pushed a commit that referenced this pull request Nov 13, 2025
Co-authored-by: Hossein Kavianihamedani <hosseinkh@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants