-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for walltime-based saving/logging/evaluating #29984
Comments
FYI @muellerzr |
Seems like a good idea to me! Re; |
@muellerzr Just an idea, maybe the start time can be added as a property to TrainerState? It can then be read in the
So concretely:
If that sounds good I can give it a go. |
Yes I'm open to that! On init we can set it to |
@muellerzr I started working on this. I am not entirely sure how to specify the interval, though. So in case IntervalStrategy==TIME, do we assume that |
Just an aside, to me this would be both, better to oversave than under. The For interval time, use |
Ah, that's also possible, as just an extra check. What about:
as additional arguments in TrainingArguments? Yeah for the delta we can just do |
Yep! :D And now it's a very simple API |
Feature request
We currently have the save strategies
epoch
orsteps
. It would be useful to add one fortime
, too. After every backward pass we check if a given time interval has passed. If it has, save a checkpoint and reset the timer.Motivation
Motivation comes from usage on clusters where you have a job time limit. You can first do a test run and see how long a step takes on average and extrapolate from there, but relying on walltime would probably be easier.
Your contribution
I can work on this. I think a condition should be added to the defaultflowcallback (https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L432) to also include this new strategy. I am not sure yet how to track the starting time, though. Should it be passed separately and saved in the
trainer
instance? Or added toargs
?In terms of implementation, a lot of inspiration can be taken from https://twitter.com/StasBekman/status/1774842972795982160
The text was updated successfully, but these errors were encountered: