Simplify NVIDIA workflow runner selection #400

msaroufim · 2026-01-14T01:02:29Z

Summary

Remove round-robin GPU selection and explicit concurrency groups
Let GitHub's native self-hosted runner queuing handle job distribution
Self-hosted runners only run one job at a time by default, ensuring per-GPU isolation

Test plan

Monitor workflow runs to verify jobs distribute across available GPUs
Confirm no job cancellations due to concurrency conflicts

Remove round-robin GPU selection and concurrency groups in favor of letting GitHub's native self-hosted runner queuing handle distribution. Self-hosted runners only run one job at a time by default.

Copilot

Pull request overview

This PR aims to simplify NVIDIA workflow runner selection by removing round-robin GPU selection and explicit concurrency groups, allowing GitHub's native self-hosted runner queuing to handle job distribution. The change adds the self-hosted label to the runner configuration.

Changes:

Modified the runs-on configuration in nvidia_workflow.yml to explicitly include the self-hosted label alongside the existing nvidia-docker-b200-8-x86-64 label

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-14T01:04:14Z

.github/workflows/nvidia_workflow.yml

-    concurrency:
-      group: ${{ needs.select-runner.outputs.runner }}
-      cancel-in-progress: false
+    runs-on: [self-hosted, nvidia-docker-b200-8-x86-64]


The addition of the self-hosted label creates an inconsistency with other workflows in the repository. The workflows nvidia-on-prem-health.yml (line 12) and runner_ci.yml (lines 15 and 49) use the same runner label [nvidia-docker-b200-8-x86-64] without the self-hosted prefix. For consistency, either all workflows using this runner should include the self-hosted label, or none should. Additionally, the amd-health.yml workflow uses [amdgpu-mi300-8-x86-64] without self-hosted, suggesting a pattern in this repository of not including the self-hosted label explicitly.

Suggested change

runs-on: [self-hosted, nvidia-docker-b200-8-x86-64]

runs-on: [nvidia-docker-b200-8-x86-64]

github-actions · 2026-01-14T01:05:07Z

Coverage report

This PR does not seem to contain any modification to coverable code.

msaroufim added 3 commits January 13, 2026 16:54

GPU concurrency

223a9fe

update

e5b72ea

Simplify NVIDIA workflow runner selection

89007f1

Remove round-robin GPU selection and concurrency groups in favor of letting GitHub's native self-hosted runner queuing handle distribution. Self-hosted runners only run one job at a time by default.

Copilot AI review requested due to automatic review settings January 14, 2026 01:02

Copilot started reviewing on behalf of msaroufim January 14, 2026 01:02 View session

Merge main and resolve conflicts

eabfcdf

Copilot AI reviewed Jan 14, 2026

View reviewed changes

msaroufim merged commit 2d6cc43 into main Jan 14, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify NVIDIA workflow runner selection #400

Simplify NVIDIA workflow runner selection #400

Uh oh!

msaroufim commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 14, 2026

Uh oh!

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	runs-on: [self-hosted, nvidia-docker-b200-8-x86-64]
	runs-on: [nvidia-docker-b200-8-x86-64]

Simplify NVIDIA workflow runner selection #400

Simplify NVIDIA workflow runner selection #400

Uh oh!

Conversation

msaroufim commented Jan 14, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 14, 2026

Coverage report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants