Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Trainer] memory tracker metrics #10225

Merged
merged 15 commits into from Feb 18, 2021
Merged

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Feb 17, 2021

This PR introduced memory usage metrics in Trainer:

  • adds TrainerMemoryTracker (pytorch only, no-op for tf), which records deltas of the first gpu and cpu of the main process - and records them for init|train|eval|test stages - if there is no gpu it reports cpu only.
  • adds --skip_memory_metrics to disable this new behavior - i.e. by default it'll print the memory metrics
  • adds trainer.metrics_format which will intelligently reformat the metrics to do the right thing - this is only for logger - moves manual rounding from the scripts into that helper method.
  • formats GFlops as GF number, so 2285698228224.0, which is very unreadable and now it will be a nice 2128GF (similar to 100MB)
  • as a sample changes run_seq2seq.py to use trainer.metrics_format - can replicate to other scripts in another PR.
  • changes the metrics logger in run_seq2seq.py to align data, so that it's easy to read the relative numbers e.g. allocated plus peak memory should be in the same column to make a quick read of the situation.
  • adds a new file_utils helper function is_torch_cuda_available to detect no gpu setups in one call.
  • adds a test
  • consistently use the strange train/eval/test trio - it's very confusing - but at least it's consistent - I proposed to fix this examples-wide in [example scripts] inconsistency around eval vs val #10165

Request: I beg you to allow me to restore the original refactored metrics dump logic in run_seq2seq.py - the current repetition doesn't help the readability and it's just dumping a dict - nothing ML/NLP specific here, there is nothing to understand there IMHO. and then it'd be easy to replicate this to other examples. Thanks. This is the original (and will need to add to it a few formatting entries I added in this PR):

def handle_metrics(split, metrics, output_dir):
"""
Log and save metrics
Args:
- split: one of train, val, test
- metrics: metrics dict
- output_dir: where to save the metrics
"""
logger.info(f"***** {split} metrics *****")
for key in sorted(metrics.keys()):
logger.info(f" {key} = {metrics[key]}")
save_json(metrics, os.path.join(output_dir, f"{split}_results.json"))

A picture is worth a thousand words:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 python examples/seq2seq/run_seq2seq.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --do_eval --do_train --do_predict --evaluation_strategy=steps  --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro  --val_max_target_length 128 --warmup_steps 500 --max_train_samples 100 --max_val_samples 100 --max_test_samples 100  --dataset_name wmt16-en-ro-pre-processed --source_prefix "translate English to Romanian: "

gives:

02/16/2021 17:06:39 - INFO - __main__ -   ***** train metrics *****
02/16/2021 17:06:39 - INFO - __main__ -     epoch                      =    1.0
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_cpu_alloc_delta   =    2MB
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_cpu_peaked_delta  =    0MB
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_gpu_alloc_delta   =  230MB
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_gpu_peaked_delta  =    0MB
02/16/2021 17:06:39 - INFO - __main__ -     total_flos                 = 2128GF
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_cpu_alloc_delta  =   55MB
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_cpu_peaked_delta =    0MB
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_gpu_alloc_delta  =  692MB
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_gpu_peaked_delta =  661MB
02/16/2021 17:06:39 - INFO - __main__ -     train_runtime              = 2.3114
02/16/2021 17:06:39 - INFO - __main__ -     train_samples              =    100
02/16/2021 17:06:39 - INFO - __main__ -     train_samples_per_second   =  3.028

02/16/2021 17:06:43 - INFO - __main__ -   ***** val metrics *****
02/16/2021 17:13:05 - INFO - __main__ -     epoch                     =     1.0
02/16/2021 17:13:05 - INFO - __main__ -     eval_bleu                 = 24.6502
02/16/2021 17:13:05 - INFO - __main__ -     eval_gen_len              =    32.9
02/16/2021 17:13:05 - INFO - __main__ -     eval_loss                 =  3.7533
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_cpu_alloc_delta  =     0MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_cpu_peaked_delta =     0MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_gpu_alloc_delta  =     0MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_gpu_peaked_delta =   510MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_runtime              =  3.9266
02/16/2021 17:13:05 - INFO - __main__ -     eval_samples              =     100
02/16/2021 17:13:05 - INFO - __main__ -     eval_samples_per_second   =  25.467

02/16/2021 17:06:48 - INFO - __main__ -     ***** test metrics *****
02/16/2021 17:06:48 - INFO - __main__ -     test_bleu                 = 27.146
02/16/2021 17:06:48 - INFO - __main__ -     test_gen_len              =  41.37
02/16/2021 17:06:48 - INFO - __main__ -     test_loss                 = 3.6682
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_cpu_alloc_delta  =    0MB
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_cpu_peaked_delta =    0MB
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_gpu_alloc_delta  =    0MB
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_gpu_peaked_delta =  645MB
02/16/2021 17:06:48 - INFO - __main__ -     test_runtime              = 5.1136
02/16/2021 17:06:48 - INFO - __main__ -     test_samples              =    100
02/16/2021 17:06:48 - INFO - __main__ -     test_samples_per_second   = 19.556

To understand the memory reports:

  • alloc_delta - is the difference in the used/allocated memory counter between the end and the start of the stage - it can be negative if a function released more memory than it allocated
  • peaked_delta - is any extra memory that was consumed and then freed - relative to the current allocated memory counter - it is never negative - this is the mysterious cause of OOM, since normally it doesn't register when everything fits into the memory.
  • so when you look at the metrics of any stage you add up alloc_delta + peaked_delta and you know how much memory was needed to complete that stage. But the two numbers need to be separate.

We can change the names if you'd like, but if we do, let's make sure that allocated/used shows up before peaked when alphabetically sorted - as they should be read in that order.

Also it would be useful to have them of the same length so it's less noisy vertically. I was thinking perhaps to add m to alloc? Then it becomes perfect:

test_mem_cpu_malloc_delta =    0MB
test_mem_cpu_peaked_delta =    0MB

Logic behind init:

  • since Trainer's __init__ can consume a lot of memory, it's important that we trace it too, but since any of the stages can be skipped, I basically push it into the metrics of whichever stage gets to update metrics first, so it gets tacked on to that group of metrics. In the above example it happens to be train.

Logic behind nested calls:

  • since eval calls may be intermixed with train calls, we can't handle nested invocations because torch.cuda.max_memory_allocated is a single counter, so if it gets reset by a nested eval call, train will report incorrect info. One day pytorch will fix this issue: support for multiple torch.cuda.max_memory_allocated() counters pytorch/pytorch#16266 and then it will be possible to be re-entrant, for now we will only track the outer level train / evaluation / predict functions.

After this addition we can already profile/detect regressions for specific training stages. But this doesn't give us the full picture as there other allocations outside of the trainer - i.e. in user's code. It's a start.

Down the road I may code a different version, based on pynvml, which gives somewhat different numbers, and has its own complications. But it gives you the exact gpu memory usage, so you know exactly how much memory is used or left. PyTorch only reports its internal allocations on the other hand.

@patrickvonplaten, this feature should give us already a partial way to track memory regression. So this could be the low hanging fruit you and I were discussing.

It also should be possible to extend the tracker to use TF, but I don't know anything about TF.

@sgugger, @patil-suraj, @LysandreJik, @patrickvonplaten

src/transformers/trainer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this functionality! One general comment I have is on the type of the stage argument. Since it has only four possible values from what I can see, it would be better to create an enum for those (to avoid typos and have auto-complete in an IDE).

Otherwise, it looks good!

src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
@stas00
Copy link
Contributor Author

stas00 commented Feb 18, 2021

Thanks for adding this functionality! One general comment I have is on the type of the stage argument. Since it has only four possible values from what I can see, it would be better to create an enum for those (to avoid typos and have auto-complete in an IDE).

Oh, let me make it absolutely automatic with inspect so it won't need a stage argument at all.

And I will collapse the two calls into one in all but __init__, so it'll be less noisy.

@stas00
Copy link
Contributor Author

stas00 commented Feb 18, 2021

So, the API has been simplified to remove the need for naming the stages in the caller, tests added.

I'm sure we will think of further improvements down the road, please let me know if this is good for the first iteration.

I'm not sure if anybody else wants to review before we merge this.

src/transformers/trainer_utils.py Outdated Show resolved Hide resolved
src/transformers/trainer_utils.py Outdated Show resolved Hide resolved
stas00 and others added 3 commits February 18, 2021 08:08
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@stas00 stas00 merged commit 97e688b into huggingface:master Feb 18, 2021
@stas00 stas00 deleted the mem-metrics-trainer branch February 18, 2021 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants