[metric logging ] - open TODOs

Open issues in no order of priority. None of them are urgent, but they would lead to better design/reliability/UX.

1. [ ] Allow user to change wandb run name using the config, instead of generating a random name every time
2. [x] In policy, we dont track metric for the workers because LocalFetcherActor was not spawning properly. Perhaps because its being spawned inside of an actor? but that shouldnt be an issue, because it goes through provisioner.py anyway. Solved in https://github.com/meta-pytorch/forge/pull/286
3. [ ] Delete apps/toy_rl/toy_metrics/main.py and make it an integration test
4. [ ] When 'reduce_across_ranks=True' we spawn a wandb per rank. Each one of them outputs all the system metrics to wandb. So in a machine with 8 gpus, we see 8x8 system metric curves (gpu stats are repeated 8x)
5. [x] When 'reduce_across_ranks=True' we spawn a wandb per rank. The name of the run is defined by ['get_actor_name_with_rank'](https://github.com/meta-pytorch/forge/blob/main/src/forge/observability/metrics.py#L38). This is currently fetching "LocalFetcherActor" for the actor name, but ideally it should get 'TrainActor', 'PolicyActor', etc. Perhaps we can get the name in provisioner.py and pass as an argument to the localfetcher being spawned. Solved in https://github.com/meta-pytorch/forge/pull/303
4. [ ] forge.observability.perf_tracker.py::Tracer calls `self._timer.step("end")`. This is a way to close the time since the last step, when the user calls Tracer.stop(). This last interval is added to the total sum, but it is not logged as an individual step. To do it well, we have to force CudaTimer to sort and return times in the order they came. The logic works, but is brittle and could be improved.
5. [ ] forge.observability.perf_tracker.py::Tracer Support tracking memory on .step(). Currently we only track .stop() - start().
6. [ ] forge.observability.perf_tracker.py::Tracer We currently do not supported memory tracking of nested functions. E.g. if we have `outer(inner())`, and we try to measure memory on both, triggering it in inner would reset the max memory allocated. Maybe there is a solution to this without torch profiler?
7. [ ] forge.observability.perf_tracker.py::Tracer We have `end_mem = torch.cuda.memory_allocated()`. Consider using the provisioner assigned device ID, to make sure we are tracking the correct device.
8. [ ] in apps.grpo.main.py, we need to first shutdown the metric logger, wait a couple of seconds, then shutdown the other services. If we don't do that, when `reduce_across_ranks=False`, wandb instances in other ranks dont have enough time to shutdown, and they stay alive. Investigate a more elegant/certain way of doing it.
9. [ ] forge.observability.perf_tracker.py::Tracer Mark memory tracking as experimental 
10. [ ] On shutdown, annoying message pops up for every local fetch actor. It seems that it tries to shutdown after the parent process. Unclear why, but it seems harmless. The instantiation of LocalFetcherActor happens in provisioner.py. `[0] [0]E0929 17:11:44.961732 1611188 hyperactor/src/proc.rs:1097] _1GrKkNhv3Bm8[0].local_fetcher_actor[1]: failed to send stop message to parent pid 0: ActorError { actor_id: ActorId(Ranked(WorldId("_1GrKkNhv3Bm8"), 0), "local_fetcher_actor", 0), kind: MailboxSender(MailboxSenderError { location: Unbound(ActorId(Ranked(WorldId("_1GrKkNhv3Bm8"), 0), "local_fetcher_actor", 0), "hyperactor::actor::Signal"), kind: Other(channel closed) }) }`
11. [ ] We currently log {class_name}/{fn_name}/{metric_name}. @joecummings suggested that capturing class and fn should be done automatically. Investigate if this is a good choice.
12. [x] When initializing backends in main.py, we need to call it AFTER the services spawn, or it can hang in the mode
(logging_mode: per_rank_no_reduce, per_rank_share_run: True). Perhaps some run condition and we need a delay between creating mlogger and init_backends?. Edit: solved in https://github.com/meta-pytorch/forge/pull/345

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[metric logging ] - open TODOs #258

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[metric logging ] - open TODOs #258

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions