Skip to content

Conversation

allenwang28
Copy link
Contributor

it seems with the Monarch pin update, we need to use

MONARCH_HOST_MESH_V1_REMOVE_ME_BEFORE_RELEASE

rather than

MONARCH_HOSTMESH_V1

Additionally, spawn_procs APIs match now. Tomorrow we can probably just remove the V1 path altogether, but keeping it for now just to make sure things are working.

With this, I see that SLURM 32B works:

MONARCH_HOST_MESH_V1_REMOVE_ME_BEFORE_RELEASE=1 TORCHSTORE_RDMA_ENABLED=1 python -m apps.grpo.main --config=apps/grpo/qwen3_32b.yaml

Note:

  rl_trainer_perf/push_weights/ts_save/duration_avg_s: 9.150277181121055
  rl_trainer_perf/push_weights/ts_save/duration_max_s: 9.425452718045563
  policy_worker_perf/update_weights/total_duration_avg_s: 77.36585016347817
  policy_worker_perf/update_weights/total_duration_max_s: 88.17838381230831

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 13, 2025
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@4b3b3c2). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #398   +/-   ##
=======================================
  Coverage        ?   64.70%           
=======================================
  Files           ?       79           
  Lines           ?     7700           
  Branches        ?        0           
=======================================
  Hits            ?     4982           
  Misses          ?     2718           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@allenwang28 allenwang28 merged commit ad3f3f5 into meta-pytorch:main Oct 13, 2025
6 checks passed
@allenwang28 allenwang28 deleted the slurm_fixes branch October 13, 2025 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants