Skip to content

[metric logging ] - open TODOs #258

@felipemello1

Description

@felipemello1

Open issues in no order of priority. None of them are urgent, but they would lead to better design/reliability/UX.

  1. Allow user to change wandb run name using the config, instead of generating a random name every time
  2. In policy, we dont track metric for the workers because LocalFetcherActor was not spawning properly. Perhaps because its being spawned inside of an actor? but that shouldnt be an issue, because it goes through provisioner.py anyway. Solved in [Logging] Measure policy weight update #286
  3. Delete apps/toy_rl/toy_metrics/main.py and make it an integration test
  4. When 'reduce_across_ranks=True' we spawn a wandb per rank. Each one of them outputs all the system metrics to wandb. So in a machine with 8 gpus, we see 8x8 system metric curves (gpu stats are repeated 8x)
  5. When 'reduce_across_ranks=True' we spawn a wandb per rank. The name of the run is defined by 'get_actor_name_with_rank'. This is currently fetching "LocalFetcherActor" for the actor name, but ideally it should get 'TrainActor', 'PolicyActor', etc. Perhaps we can get the name in provisioner.py and pass as an argument to the localfetcher being spawned. Solved in [Logging] add time stamp logging + test #303
  6. forge.observability.perf_tracker.py::Tracer calls self._timer.step("end"). This is a way to close the time since the last step, when the user calls Tracer.stop(). This last interval is added to the total sum, but it is not logged as an individual step. To do it well, we have to force CudaTimer to sort and return times in the order they came. The logic works, but is brittle and could be improved.
  7. forge.observability.perf_tracker.py::Tracer Support tracking memory on .step(). Currently we only track .stop() - start().
  8. forge.observability.perf_tracker.py::Tracer We currently do not supported memory tracking of nested functions. E.g. if we have outer(inner()), and we try to measure memory on both, triggering it in inner would reset the max memory allocated. Maybe there is a solution to this without torch profiler?
  9. forge.observability.perf_tracker.py::Tracer We have end_mem = torch.cuda.memory_allocated(). Consider using the provisioner assigned device ID, to make sure we are tracking the correct device.
  10. in apps.grpo.main.py, we need to first shutdown the metric logger, wait a couple of seconds, then shutdown the other services. If we don't do that, when reduce_across_ranks=False, wandb instances in other ranks dont have enough time to shutdown, and they stay alive. Investigate a more elegant/certain way of doing it.
  11. forge.observability.perf_tracker.py::Tracer Mark memory tracking as experimental
  12. On shutdown, annoying message pops up for every local fetch actor. It seems that it tries to shutdown after the parent process. Unclear why, but it seems harmless. The instantiation of LocalFetcherActor happens in provisioner.py. [0] [0]E0929 17:11:44.961732 1611188 hyperactor/src/proc.rs:1097] _1GrKkNhv3Bm8[0].local_fetcher_actor[1]: failed to send stop message to parent pid 0: ActorError { actor_id: ActorId(Ranked(WorldId("_1GrKkNhv3Bm8"), 0), "local_fetcher_actor", 0), kind: MailboxSender(MailboxSenderError { location: Unbound(ActorId(Ranked(WorldId("_1GrKkNhv3Bm8"), 0), "local_fetcher_actor", 0), "hyperactor::actor::Signal"), kind: Other(channel closed) }) }
  13. We currently log {class_name}/{fn_name}/{metric_name}. @joecummings suggested that capturing class and fn should be done automatically. Investigate if this is a good choice.
  14. When initializing backends in main.py, we need to call it AFTER the services spawn, or it can hang in the mode
    (logging_mode: per_rank_no_reduce, per_rank_share_run: True). Perhaps some run condition and we need a delay between creating mlogger and init_backends?. Edit: solved in Metric Logging updates 1/4 #345

Metadata

Metadata

Assignees

No one assigned

    Labels

    Better EngineeringTasks which help improve eng productivity e.g. building tools, cleaning up code, writing docsTracking IssueContext for a long tailed trackingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions