Skip to content

Comments

Enhance GPU health check with distributed testing#360

Merged
msaroufim merged 18 commits intomainfrom
msaroufim-patch-7
Sep 24, 2025
Merged

Enhance GPU health check with distributed testing#360
msaroufim merged 18 commits intomainfrom
msaroufim-patch-7

Conversation

@msaroufim
Copy link
Member

Increased timeout for health check job and added distributed health check step to verify GPU availability and initialization.

Description

Please provide a brief summary of the changes in this pull request.

Checklist

Before submitting this PR, ensure the following steps have been completed:

  • Run the slash command /verifyruns on your own server.
    • Run the cluster bot on your server:
      python discord-bot.py
    • Start training runs with the slash command /verifyruns.
    • Verify that the bot eventually responds with:
      ✅ All runs completed successfully!
      
      (It may take a few minutes for all runs to finish. In particular, the GitHub
      runs may take a little longer. The Modal run is typically quick.)
      For more information on running a cluster bot on your own server, see
      README.md.

Increased timeout for health check job and added distributed health check step to verify GPU availability and initialization.
Copilot AI review requested due to automatic review settings September 23, 2025 20:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the AMD GPU health check workflow by adding distributed GPU testing capabilities and increasing the timeout period. The changes expand the health check from simple GPU availability to include multi-GPU distributed initialization and communication testing.

  • Increased job timeout from 5 to 10 minutes to accommodate distributed testing
  • Added comprehensive distributed health check that tests NCCL initialization and inter-GPU communication
  • Implemented dynamic GPU detection to test with available hardware (up to 8 GPUs)

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

os.environ['MASTER_PORT'] = str(master_port)

try:
dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=torch.device(f'cuda:{rank}'))
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device_id parameter in init_process_group should be an integer, not a torch.device object. Change to device_id=rank instead of device_id=torch.device(f'cuda:{rank}').

Suggested change
dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=torch.device(f'cuda:{rank}'))
dist.init_process_group('nccl', rank=rank, world_size=world_size, device_id=rank)

Copilot uses AI. Check for mistakes.

num_gpus = torch.cuda.device_count()
world_size = min(num_gpus, 8) # Test with available GPUs, up to 8
master_port = 12345 + int(time.time()) % 1000 # One port for all ranks
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment 'One port for all ranks' is misleading. This calculation generates a single port number for the entire distributed group, not one port per rank. Consider updating the comment to 'Generate unique port for this test run' for clarity.

Suggested change
master_port = 12345 + int(time.time()) % 1000 # One port for all ranks
master_port = 12345 + int(time.time()) % 1000 # Generate unique port for this test run

Copilot uses AI. Check for mistakes.
Comment on lines 76 to 78
if p.exitcode != 0:
print('✗ Distributed test failed')
exit(1)
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using exit(1) in a workflow step may not properly propagate the failure. Consider using sys.exit(1) after importing sys, or restructure to use a return statement and check the result outside the Python code block.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link

Coverage report

This PR does not seem to contain any modification to coverable code.

@msaroufim msaroufim merged commit dcaaea9 into main Sep 24, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant