Skip to content

Latest commit

 

History

History
35 lines (29 loc) · 1.78 KB

monitor.rst

File metadata and controls

35 lines (29 loc) · 1.78 KB

Monitoring

Deepspeed’s Monitor module can log training details into a Tensorboard-compatible file, to WandB, or to simple CSV files. Below is an overview of what DeepSpeed will log automatically.

Automatically Logged Data
Field Description Condition
Train/Samples/train_loss The training loss. None
Train/Samples/lr The learning rate during training. None
Train/Samples/loss_scale The loss scale when training using fp16. fp16 must be enabled.
Train/Eigenvalues/ModelBlockParam_{i} Eigen values per param block. eigenvalue must be enabled.
Train/Samples/elapsed_time_ms_forward The global duration of the forward pass. flops_profiler.enabled or wall_clock_breakdown.
Train/Samples/elapsed_time_ms_backward The global duration of the forward pass. flops_profiler.enabled or wall_clock_breakdown.
Train/Samples/elapsed_time_ms_backward_inner The backward time that does not include the gradient reduction time. Only in cases where the gradient reduction is not overlapped if it is overlapped then the inner time should be about the same as the entire backward time.
Train/Samples/elapsed_time_ms_backward_allreduce The global duration of the allreduce operation. flops_profiler.enabled or wall_clock_breakdown.
Train/Samples/elapsed_time_ms_step The optimizer step time flops_profiler.enabled or wall_clock_breakdown.

TensorBoard

WandB

CSV Monitor