-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime estimator #1991
Runtime estimator #1991
Conversation
…mvpatel2000/composer into mvpatel2000/bottom-up-runtime-estimator
Note: The weird spikes in the wandb graph are because wandb associates the last logged time for a given step, which makes graph weird when x-axis is time. I spent a while investigating -- happy to discuss offline since it's a little too complicated to writeup. As a side effect, @eracah and I discovered where timestamp is advanced for batches vs. epochs compared to when metrics are calculated is inconsistent -- along for the ride I reorder the epoch one to be consistent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say sorry to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jk
Let's discuss offline... we should get you through a starter task and maybe a training project before you're ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some comments
Also, I'm willing to merge this with just manual tests, but please do add some unit tests in a follow on pr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, after addressing comments. +1 to adding tests in a separate PR
Added a test |
What does this PR do?
Estimates remaining runtime. Uses a simple timer interpolation + correction for eval. Graph shows a "normal" run and a corrected run with eval.
Note that this doesn't always work super smoothly, e.g. resnet estimates are not great because first epoch is a lot slower due to dataloader issues.
We will probably hit more issues as well later on, so this will require corrections.
Import Questions:
wall_clock/
is intuitive, but everything else is in seconds so we need to report in seconds (whereas we would prefer to probably report hours?)What issue(s) does this change relate to?
CO-1664