Add support for walltime-based saving/logging/evaluating #29984

BramVanroy · 2024-04-01T17:49:23Z

Feature request

We currently have the save strategies epoch or steps. It would be useful to add one for time, too. After every backward pass we check if a given time interval has passed. If it has, save a checkpoint and reset the timer.

Motivation

Motivation comes from usage on clusters where you have a job time limit. You can first do a test run and see how long a step takes on average and extrapolate from there, but relying on walltime would probably be easier.

Your contribution

I can work on this. I think a condition should be added to the defaultflowcallback (https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L432) to also include this new strategy. I am not sure yet how to track the starting time, though. Should it be passed separately and saved in the trainer instance? Or added to args?

In terms of implementation, a lot of inspiration can be taken from https://twitter.com/StasBekman/status/1774842972795982160

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-04-02T08:31:18Z

FYI @muellerzr

muellerzr · 2024-04-02T13:14:36Z

Seems like a good idea to me! Re; _TRAIN_START_TIME, we can set that to trainer.train() being called I think, the callbacks have a workflow that's called on training begin. (called literally on_train_begin) which only gets called once in the _inner_training_loop before the epoch iterations start.

BramVanroy · 2024-04-02T14:59:09Z

@muellerzr Just an idea, maybe the start time can be added as a property to TrainerState? It can then be read in the on step end and on epoch end of the callbacks since the state is passed to it. That would mean that the train start time is set to the initialization of the trainer, though, so perhaps the training time in the state should be set/updated on_train_begin.

transformers/src/transformers/trainer_callback.py

Line 35 in cb5927c

class TrainerState:

So concretely:

add "train_start_time" to TrainerState
add on_train_begin to DefaultFlowCallback which will set train_start_time to the current time in the state
add logic that if save/log/evaluate is set in the args, the on_step_end and on_epoch_end will set should_X to true and reset the timer

If that sounds good I can give it a go.

muellerzr · 2024-04-02T20:10:17Z

Yes I'm open to that!

On init we can set it to -99 or something equivalent to know that it's been instantiated but not started

BramVanroy · 2024-04-11T13:47:51Z

@muellerzr I started working on this. I am not entirely sure how to specify the interval, though. So in case IntervalStrategy==TIME, do we assume that logging_steps (and save, eval) are given in minutes? I considered allowing datetime strings, but that would a typing nightmare on the CLI, I fear, so keeping it as an int seems best. WDYT?

muellerzr · 2024-04-11T14:26:32Z

Just an aside, to me this would be both, better to oversave than under. The time is more of a "backup" and we keep as epoch and step based.

For interval time, use timedelta, similar to what torch.distributed uses for timeout: https://docs.python.org/3/library/datetime.html#datetime.timedelta

BramVanroy · 2024-04-11T14:29:26Z

Ah, that's also possible, as just an extra check. What about:

save_every_minutes
log_every_minutes
eval_every_minutes

as additional arguments in TrainingArguments? Yeah for the delta we can just do (datetime-datetime).total_seconds()/60

muellerzr · 2024-04-11T14:30:38Z

Yep! :D

And now it's a very simple API

ArthurZucker added Feature request Request for a new feature trainer labels Apr 2, 2024

BramVanroy mentioned this issue Apr 11, 2024

Add time-based save, log, eval #30188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for walltime-based saving/logging/evaluating #29984

Add support for walltime-based saving/logging/evaluating #29984

BramVanroy commented Apr 1, 2024 •

edited

ArthurZucker commented Apr 2, 2024

muellerzr commented Apr 2, 2024

BramVanroy commented Apr 2, 2024 •

edited

muellerzr commented Apr 2, 2024

BramVanroy commented Apr 11, 2024

muellerzr commented Apr 11, 2024

BramVanroy commented Apr 11, 2024 •

edited

muellerzr commented Apr 11, 2024

Add support for walltime-based saving/logging/evaluating #29984

Add support for walltime-based saving/logging/evaluating #29984

Comments

BramVanroy commented Apr 1, 2024 • edited

Feature request

Motivation

Your contribution

ArthurZucker commented Apr 2, 2024

muellerzr commented Apr 2, 2024

BramVanroy commented Apr 2, 2024 • edited

muellerzr commented Apr 2, 2024

BramVanroy commented Apr 11, 2024

muellerzr commented Apr 11, 2024

BramVanroy commented Apr 11, 2024 • edited

muellerzr commented Apr 11, 2024

BramVanroy commented Apr 1, 2024 •

edited

BramVanroy commented Apr 2, 2024 •

edited

BramVanroy commented Apr 11, 2024 •

edited