New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The "Test of Epoch [number]" displays an incorrect number #1425
Comments
@Franck-Dernoncourt This is actually not a bug, as the current epoch is computed on base of your current parameters and the snapshot-persisted global step-count. Take a close look into this excerpt: # Number of GPUs per worker - fixed for now by local reality or cluster setup
gpus_per_worker = len(available_devices)
# Number of batches processed per job per worker
batches_per_job = gpus_per_worker * max(1, FLAGS.iters_per_worker)
# Number of batches per global step
batches_per_step = gpus_per_worker * max(1, FLAGS.replicas_to_agg)
# Number of global steps per epoch - to be at least 1
steps_per_epoch = max(1, model_feeder.train.total_batches // batches_per_step)
# The start epoch of our training
# Number of GPUs per worker - fixed for now by local reality or cluster setup
gpus_per_worker = len(available_devices)
# Number of batches processed per job per worker
batches_per_job = gpus_per_worker * max(1, FLAGS.iters_per_worker)
# Number of batches per global step
batches_per_step = gpus_per_worker * max(1, FLAGS.replicas_to_agg)
# Number of global steps per epoch - to be at least 1
steps_per_epoch = max(1, model_feeder.train.total_batches // batches_per_step)
# The start epoch of our training
self._epoch = step // steps_per_epoch So what happens is that your set-size during training differs from your current set size. Thus the strange epoch number. Simplified example (without confusing batch size): |
Got it, thanks for the explanation! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
The "Test of Epoch [number]" seems to sometimes display an incorrect number.
In the following example, it says
Test of Epoch 77263
, even though there should be just 1 epoch from my understanding, since I gave--display_step 1 --limit_train 1 --limit_dev 1 --limit_test 1 --early_stop False --epoch 1
as arguments:Corresponding Discourse thread: https://discourse.mozilla.org/t/what-does-the-test-of-epoch-number-mean/29770/2
The text was updated successfully, but these errors were encountered: