Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for walltime-based saving/logging/evaluating #29984

Open
BramVanroy opened this issue Apr 1, 2024 · 8 comments
Open

Add support for walltime-based saving/logging/evaluating #29984

BramVanroy opened this issue Apr 1, 2024 · 8 comments
Labels
Feature request Request for a new feature trainer

Comments

@BramVanroy
Copy link
Collaborator

BramVanroy commented Apr 1, 2024

Feature request

We currently have the save strategies epoch or steps. It would be useful to add one for time, too. After every backward pass we check if a given time interval has passed. If it has, save a checkpoint and reset the timer.

Motivation

Motivation comes from usage on clusters where you have a job time limit. You can first do a test run and see how long a step takes on average and extrapolate from there, but relying on walltime would probably be easier.

Your contribution

I can work on this. I think a condition should be added to the defaultflowcallback (https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L432) to also include this new strategy. I am not sure yet how to track the starting time, though. Should it be passed separately and saved in the trainer instance? Or added to args?

In terms of implementation, a lot of inspiration can be taken from https://twitter.com/StasBekman/status/1774842972795982160

@ArthurZucker ArthurZucker added Feature request Request for a new feature trainer labels Apr 2, 2024
@ArthurZucker
Copy link
Collaborator

FYI @muellerzr

@muellerzr
Copy link
Contributor

Seems like a good idea to me! Re; _TRAIN_START_TIME, we can set that to trainer.train() being called I think, the callbacks have a workflow that's called on training begin. (called literally on_train_begin) which only gets called once in the _inner_training_loop before the epoch iterations start.

@BramVanroy
Copy link
Collaborator Author

BramVanroy commented Apr 2, 2024

@muellerzr Just an idea, maybe the start time can be added as a property to TrainerState? It can then be read in the on step end and on epoch end of the callbacks since the state is passed to it. That would mean that the train start time is set to the initialization of the trainer, though, so perhaps the training time in the state should be set/updated on_train_begin.

So concretely:

  • add "train_start_time" to TrainerState
  • add on_train_begin to DefaultFlowCallback which will set train_start_time to the current time in the state
  • add logic that if save/log/evaluate is set in the args, the on_step_end and on_epoch_end will set should_X to true and reset the timer

If that sounds good I can give it a go.

@muellerzr
Copy link
Contributor

Yes I'm open to that!

On init we can set it to -99 or something equivalent to know that it's been instantiated but not started

@BramVanroy
Copy link
Collaborator Author

@muellerzr I started working on this. I am not entirely sure how to specify the interval, though. So in case IntervalStrategy==TIME, do we assume that logging_steps (and save, eval) are given in minutes? I considered allowing datetime strings, but that would a typing nightmare on the CLI, I fear, so keeping it as an int seems best. WDYT?

@muellerzr
Copy link
Contributor

Just an aside, to me this would be both, better to oversave than under. The time is more of a "backup" and we keep as epoch and step based.

For interval time, use timedelta, similar to what torch.distributed uses for timeout: https://docs.python.org/3/library/datetime.html#datetime.timedelta

@BramVanroy
Copy link
Collaborator Author

BramVanroy commented Apr 11, 2024

Ah, that's also possible, as just an extra check. What about:

  • save_every_minutes
  • log_every_minutes
  • eval_every_minutes

as additional arguments in TrainingArguments? Yeah for the delta we can just do (datetime-datetime).total_seconds()/60

@muellerzr
Copy link
Contributor

Yep! :D

And now it's a very simple API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature trainer
Projects
None yet
Development

No branches or pull requests

3 participants