Skip to content

Conversation

@msaroufim
Copy link
Member

Summary

  • Remove round-robin GPU selection and explicit concurrency groups
  • Let GitHub's native self-hosted runner queuing handle job distribution
  • Self-hosted runners only run one job at a time by default, ensuring per-GPU isolation

Test plan

  • Monitor workflow runs to verify jobs distribute across available GPUs
  • Confirm no job cancellations due to concurrency conflicts

Remove round-robin GPU selection and concurrency groups in favor of
letting GitHub's native self-hosted runner queuing handle distribution.
Self-hosted runners only run one job at a time by default.
Copilot AI review requested due to automatic review settings January 14, 2026 01:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to simplify NVIDIA workflow runner selection by removing round-robin GPU selection and explicit concurrency groups, allowing GitHub's native self-hosted runner queuing to handle job distribution. The change adds the self-hosted label to the runner configuration.

Changes:

  • Modified the runs-on configuration in nvidia_workflow.yml to explicitly include the self-hosted label alongside the existing nvidia-docker-b200-8-x86-64 label

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

concurrency:
group: ${{ needs.select-runner.outputs.runner }}
cancel-in-progress: false
runs-on: [self-hosted, nvidia-docker-b200-8-x86-64]
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of the self-hosted label creates an inconsistency with other workflows in the repository. The workflows nvidia-on-prem-health.yml (line 12) and runner_ci.yml (lines 15 and 49) use the same runner label [nvidia-docker-b200-8-x86-64] without the self-hosted prefix. For consistency, either all workflows using this runner should include the self-hosted label, or none should. Additionally, the amd-health.yml workflow uses [amdgpu-mi300-8-x86-64] without self-hosted, suggesting a pattern in this repository of not including the self-hosted label explicitly.

Suggested change
runs-on: [self-hosted, nvidia-docker-b200-8-x86-64]
runs-on: [nvidia-docker-b200-8-x86-64]

Copilot uses AI. Check for mistakes.
@msaroufim msaroufim merged commit 2d6cc43 into main Jan 14, 2026
7 checks passed
@github-actions
Copy link

Coverage report

This PR does not seem to contain any modification to coverable code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants