-
Notifications
You must be signed in to change notification settings - Fork 66
Closed
Description
🐛 Describe the bug
Get metric logger reference in distributed actors
Issue
Calling get_or_create_metric_logger() inside Forge actors fails with:
AttributeError: NYI: attempting to get ProcMesh attribute 'slice' on object that's actually a ProcMeshRefRoot Cause
get_or_create_metric_logger() internally calls this_proc() which doesn't work inside actors spawned from ProcMesh:
- Inside actors:
this_proc()returnsProcMeshRef(proxy object) - Expected: Returns
ProcMesh(actual mesh object) - Result: AttributeError when trying to access ProcMesh methods
Solution
Use get_or_spawn_controller() from Monarch to get a reference to the already-created global logger:
File: apps/sft/main.py (line 112-120)
async def setup_metric_logger(self):
"""Retrieve the already-initialized metric logger from main process"""
from monarch.actor import get_or_spawn_controller
from forge.observability.metric_actors import GlobalLoggingActor
# Get reference to the existing global logger (don't create new one)
mlogger = await get_or_spawn_controller("global_logger", GlobalLoggingActor)
return mloggerWhy This Works
- Main process (line 322): Creates global logger with
get_or_create_metric_logger(process_name="Controller") - Actor setup (line 132): Gets reference using
get_or_spawn_controller("global_logger", GlobalLoggingActor)- Looks up the existing controller by name
- Returns a reference without calling
this_proc() - No ProcMeshRef errors!
- During training (line 297): Flushes metrics with
await self.mlogger.flush.call_one(global_step=self.current_step)
Verified
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
# WandB now shows:
# - ForgeSFTRecipe/train_step/loss
# - ForgeSFTRecipe/train/stepVersions
No response
Metadata
Metadata
Assignees
Labels
No labels