New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Trainer] memory tracker metrics #10225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this functionality! One general comment I have is on the type of the stage
argument. Since it has only four possible values from what I can see, it would be better to create an enum for those (to avoid typos and have auto-complete in an IDE).
Otherwise, it looks good!
Oh, let me make it absolutely automatic with And I will collapse the two calls into one in all but |
So, the API has been simplified to remove the need for naming the stages in the caller, tests added. I'm sure we will think of further improvements down the road, please let me know if this is good for the first iteration. I'm not sure if anybody else wants to review before we merge this. |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This PR introduced memory usage metrics in Trainer:
TrainerMemoryTracker
(pytorch only, no-op for tf), which records deltas of the first gpu and cpu of the main process - and records them forinit|train|eval|test
stages - if there is no gpu it reports cpu only.--skip_memory_metrics
to disable this new behavior - i.e. by default it'll print the memory metricstrainer.metrics_format
which will intelligently reformat the metrics to do the right thing - this is only for logger - moves manual rounding from the scripts into that helper method.2285698228224.0
, which is very unreadable and now it will be a nice2128GF
(similar to100MB
)run_seq2seq.py
to usetrainer.metrics_format
- can replicate to other scripts in another PR.run_seq2seq.py
to align data, so that it's easy to read the relative numbers e.g. allocated plus peak memory should be in the same column to make a quick read of the situation.is_torch_cuda_available
to detect no gpu setups in one call.train/eval/test
trio - it's very confusing - but at least it's consistent - I proposed to fix thisexamples
-wide in [example scripts] inconsistency around eval vs val #10165Request: I beg you to allow me to restore the original refactored metrics dump logic in
run_seq2seq.py
- the current repetition doesn't help the readability and it's just dumping a dict - nothing ML/NLP specific here, there is nothing to understand there IMHO. and then it'd be easy to replicate this to other examples. Thanks. This is the original (and will need to add to it a few formatting entries I added in this PR):transformers/examples/legacy/seq2seq/finetune_trainer.py
Lines 132 to 145 in e94d63f
A picture is worth a thousand words:
gives:
To understand the memory reports:
alloc_delta
- is the difference in the used/allocated memory counter between the end and the start of the stage - it can be negative if a function released more memory than it allocatedpeaked_delta
- is any extra memory that was consumed and then freed - relative to the current allocated memory counter - it is never negative - this is the mysterious cause of OOM, since normally it doesn't register when everything fits into the memory.alloc_delta
+peaked_delta
and you know how much memory was needed to complete that stage. But the two numbers need to be separate.We can change the names if you'd like, but if we do, let's make sure that allocated/used shows up before peaked when alphabetically sorted - as they should be read in that order.
Also it would be useful to have them of the same length so it's less noisy vertically. I was thinking perhaps to add
m
toalloc
? Then it becomes perfect:Logic behind
init
:__init__
can consume a lot of memory, it's important that we trace it too, but since any of the stages can be skipped, I basically push it into the metrics of whichever stage gets to update metrics first, so it gets tacked on to that group of metrics. In the above example it happens to betrain
.Logic behind nested calls:
torch.cuda.max_memory_allocated
is a single counter, so if it gets reset by a nested eval call, train will report incorrect info. One day pytorch will fix this issue: support for multiple torch.cuda.max_memory_allocated() counters pytorch/pytorch#16266 and then it will be possible to be re-entrant, for now we will only track the outer leveltrain
/evaluation
/predict
functions.After this addition we can already profile/detect regressions for specific training stages. But this doesn't give us the full picture as there other allocations outside of the trainer - i.e. in user's code. It's a start.
Down the road I may code a different version, based on pynvml, which gives somewhat different numbers, and has its own complications. But it gives you the exact gpu memory usage, so you know exactly how much memory is used or left. PyTorch only reports its internal allocations on the other hand.
@patrickvonplaten, this feature should give us already a partial way to track memory regression. So this could be the low hanging fruit you and I were discussing.
It also should be possible to extend the tracker to use TF, but I don't know anything about TF.
@sgugger, @patil-suraj, @LysandreJik, @patrickvonplaten