Our goal for the training library is for other consumers to track information about the state of their training job.
In order to effectively track the progress of a training job, we need to emit the following information:
- total number of samples per epoch
- total number of epochs
- samples seen in an epoch thus far
We already track #2, and #3, we just need to track #1.
The fix for this is really simple, we just need to to add a field to our log statement of total_samples.
An alternative to tracking this would be to instead enumerate over the dataloader and simply compute step/len(train_loader).
Our goal for the training library is for other consumers to track information about the state of their training job.
In order to effectively track the progress of a training job, we need to emit the following information:
We already track #2, and #3, we just need to track #1.
The fix for this is really simple, we just need to to add a field to our log statement of
total_samples.An alternative to tracking this would be to instead enumerate over the dataloader and simply compute
step/len(train_loader).