[Trainer] memory tracker metrics #10225

stas00 · 2021-02-17T01:32:36Z

This PR introduced memory usage metrics in Trainer:

adds TrainerMemoryTracker (pytorch only, no-op for tf), which records deltas of the first gpu and cpu of the main process - and records them for init|train|eval|test stages - if there is no gpu it reports cpu only.
adds --skip_memory_metrics to disable this new behavior - i.e. by default it'll print the memory metrics
adds trainer.metrics_format which will intelligently reformat the metrics to do the right thing - this is only for logger - moves manual rounding from the scripts into that helper method.
formats GFlops as GF number, so 2285698228224.0, which is very unreadable and now it will be a nice 2128GF (similar to 100MB)
as a sample changes run_seq2seq.py to use trainer.metrics_format - can replicate to other scripts in another PR.
changes the metrics logger in run_seq2seq.py to align data, so that it's easy to read the relative numbers e.g. allocated plus peak memory should be in the same column to make a quick read of the situation.
adds a new file_utils helper function is_torch_cuda_available to detect no gpu setups in one call.
adds a test
consistently use the strange train/eval/test trio - it's very confusing - but at least it's consistent - I proposed to fix this examples-wide in [example scripts] inconsistency around eval vs val #10165

Request: I beg you to allow me to restore the original refactored metrics dump logic in run_seq2seq.py - the current repetition doesn't help the readability and it's just dumping a dict - nothing ML/NLP specific here, there is nothing to understand there IMHO. and then it'd be easy to replicate this to other examples. Thanks. This is the original (and will need to add to it a few formatting entries I added in this PR):

transformers/examples/legacy/seq2seq/finetune_trainer.py

Lines 132 to 145 in e94d63f

    
           def handle_metrics(split, metrics, output_dir): 
        
               """ 
        
               Log and save metrics 
        
               Args: 
        
               - split: one of train, val, test 
        
               - metrics: metrics dict 
        
               - output_dir: where to save the metrics 
        
               """ 
        
               logger.info(f"***** {split} metrics *****") 
        
               for key in sorted(metrics.keys()): 
        
                   logger.info(f"  {key} = {metrics[key]}") 
        
               save_json(metrics, os.path.join(output_dir, f"{split}_results.json"))

A picture is worth a thousand words:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0 python examples/seq2seq/run_seq2seq.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --do_eval --do_train --do_predict --evaluation_strategy=steps  --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro  --val_max_target_length 128 --warmup_steps 500 --max_train_samples 100 --max_val_samples 100 --max_test_samples 100  --dataset_name wmt16-en-ro-pre-processed --source_prefix "translate English to Romanian: "

gives:

02/16/2021 17:06:39 - INFO - __main__ -   ***** train metrics *****
02/16/2021 17:06:39 - INFO - __main__ -     epoch                      =    1.0
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_cpu_alloc_delta   =    2MB
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_cpu_peaked_delta  =    0MB
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_gpu_alloc_delta   =  230MB
02/16/2021 17:06:39 - INFO - __main__ -     init_mem_gpu_peaked_delta  =    0MB
02/16/2021 17:06:39 - INFO - __main__ -     total_flos                 = 2128GF
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_cpu_alloc_delta  =   55MB
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_cpu_peaked_delta =    0MB
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_gpu_alloc_delta  =  692MB
02/16/2021 17:06:39 - INFO - __main__ -     train_mem_gpu_peaked_delta =  661MB
02/16/2021 17:06:39 - INFO - __main__ -     train_runtime              = 2.3114
02/16/2021 17:06:39 - INFO - __main__ -     train_samples              =    100
02/16/2021 17:06:39 - INFO - __main__ -     train_samples_per_second   =  3.028

02/16/2021 17:06:43 - INFO - __main__ -   ***** val metrics *****
02/16/2021 17:13:05 - INFO - __main__ -     epoch                     =     1.0
02/16/2021 17:13:05 - INFO - __main__ -     eval_bleu                 = 24.6502
02/16/2021 17:13:05 - INFO - __main__ -     eval_gen_len              =    32.9
02/16/2021 17:13:05 - INFO - __main__ -     eval_loss                 =  3.7533
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_cpu_alloc_delta  =     0MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_cpu_peaked_delta =     0MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_gpu_alloc_delta  =     0MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_mem_gpu_peaked_delta =   510MB
02/16/2021 17:13:05 - INFO - __main__ -     eval_runtime              =  3.9266
02/16/2021 17:13:05 - INFO - __main__ -     eval_samples              =     100
02/16/2021 17:13:05 - INFO - __main__ -     eval_samples_per_second   =  25.467

02/16/2021 17:06:48 - INFO - __main__ -     ***** test metrics *****
02/16/2021 17:06:48 - INFO - __main__ -     test_bleu                 = 27.146
02/16/2021 17:06:48 - INFO - __main__ -     test_gen_len              =  41.37
02/16/2021 17:06:48 - INFO - __main__ -     test_loss                 = 3.6682
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_cpu_alloc_delta  =    0MB
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_cpu_peaked_delta =    0MB
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_gpu_alloc_delta  =    0MB
02/16/2021 17:06:48 - INFO - __main__ -     test_mem_gpu_peaked_delta =  645MB
02/16/2021 17:06:48 - INFO - __main__ -     test_runtime              = 5.1136
02/16/2021 17:06:48 - INFO - __main__ -     test_samples              =    100
02/16/2021 17:06:48 - INFO - __main__ -     test_samples_per_second   = 19.556

To understand the memory reports:

alloc_delta - is the difference in the used/allocated memory counter between the end and the start of the stage - it can be negative if a function released more memory than it allocated
peaked_delta - is any extra memory that was consumed and then freed - relative to the current allocated memory counter - it is never negative - this is the mysterious cause of OOM, since normally it doesn't register when everything fits into the memory.
so when you look at the metrics of any stage you add up alloc_delta + peaked_delta and you know how much memory was needed to complete that stage. But the two numbers need to be separate.

We can change the names if you'd like, but if we do, let's make sure that allocated/used shows up before peaked when alphabetically sorted - as they should be read in that order.

Also it would be useful to have them of the same length so it's less noisy vertically. I was thinking perhaps to add m to alloc? Then it becomes perfect:

test_mem_cpu_malloc_delta =    0MB
test_mem_cpu_peaked_delta =    0MB

Logic behind init:

since Trainer's __init__ can consume a lot of memory, it's important that we trace it too, but since any of the stages can be skipped, I basically push it into the metrics of whichever stage gets to update metrics first, so it gets tacked on to that group of metrics. In the above example it happens to be train.

Logic behind nested calls:

since eval calls may be intermixed with train calls, we can't handle nested invocations because torch.cuda.max_memory_allocated is a single counter, so if it gets reset by a nested eval call, train will report incorrect info. One day pytorch will fix this issue: support for multiple torch.cuda.max_memory_allocated() counters pytorch/pytorch#16266 and then it will be possible to be re-entrant, for now we will only track the outer level train / evaluation / predict functions.

After this addition we can already profile/detect regressions for specific training stages. But this doesn't give us the full picture as there other allocations outside of the trainer - i.e. in user's code. It's a start.

Down the road I may code a different version, based on pynvml, which gives somewhat different numbers, and has its own complications. But it gives you the exact gpu memory usage, so you know exactly how much memory is used or left. PyTorch only reports its internal allocations on the other hand.

@patrickvonplaten, this feature should give us already a partial way to track memory regression. So this could be the low hanging fruit you and I were discussing.

It also should be possible to extend the tracker to use TF, but I don't know anything about TF.

@sgugger, @patil-suraj, @LysandreJik, @patrickvonplaten

src/transformers/trainer.py

sgugger

Thanks for adding this functionality! One general comment I have is on the type of the stage argument. Since it has only four possible values from what I can see, it would be better to create an enum for those (to avoid typos and have auto-complete in an IDE).

Otherwise, it looks good!

src/transformers/trainer.py

stas00 · 2021-02-18T01:32:42Z

Thanks for adding this functionality! One general comment I have is on the type of the stage argument. Since it has only four possible values from what I can see, it would be better to create an enum for those (to avoid typos and have auto-complete in an IDE).

Oh, let me make it absolutely automatic with inspect so it won't need a stage argument at all.

And I will collapse the two calls into one in all but __init__, so it'll be less noisy.

stas00 · 2021-02-18T03:28:07Z

So, the API has been simplified to remove the need for naming the stages in the caller, tests added.

I'm sure we will think of further improvements down the road, please let me know if this is good for the first iteration.

I'm not sure if anybody else wants to review before we merge this.

src/transformers/trainer_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

stas00 added 6 commits February 16, 2021 17:05

memory tracker metrics

b47cfa1

go back to eval for somewhat consistency

e034c66

handle no-gpu case

865e7fd

deal with stackable eval calls

850efd1

restore callback order

56d97a8

style

67458f8

stas00 commented Feb 17, 2021

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

sgugger approved these changes Feb 18, 2021

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

src/transformers/trainer.py Outdated Show resolved Hide resolved

stas00 added 4 commits February 17, 2021 17:35

Merge remote-tracking branch 'origin/master' into mem-metrics-trainer

2896487

simplify the API

794270a

add test

94d05e9

docs

2dfba56

stas00 added 2 commits February 17, 2021 20:35

consistently use eval_ prefix

6fddf84

improve docs

9487e83

stas00 mentioned this pull request Feb 18, 2021

Seq2seq now has larger memory requirements, OOM w/Deepspeed on previously runnable models #10161

Closed

sgugger reviewed Feb 18, 2021

View reviewed changes

src/transformers/trainer_utils.py Outdated Show resolved Hide resolved

src/transformers/trainer_utils.py Outdated Show resolved Hide resolved

stas00 and others added 3 commits February 18, 2021 08:08

Update src/transformers/trainer_utils.py

c853fc1

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

rename method

72c651a

style

938918c

stas00 merged commit 97e688b into huggingface:master Feb 18, 2021

stas00 deleted the mem-metrics-trainer branch February 18, 2021 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Trainer] memory tracker metrics #10225

[Trainer] memory tracker metrics #10225

stas00 commented Feb 17, 2021 •

edited

sgugger left a comment

stas00 commented Feb 18, 2021 •

edited

stas00 commented Feb 18, 2021 •

edited

	def handle_metrics(split, metrics, output_dir):
	"""
	Log and save metrics

	Args:
	- split: one of train, val, test
	- metrics: metrics dict
	- output_dir: where to save the metrics
	"""

	logger.info(f"*** {split} metrics ***")
	for key in sorted(metrics.keys()):
	logger.info(f" {key} = {metrics[key]}")
	save_json(metrics, os.path.join(output_dir, f"{split}_results.json"))

[Trainer] memory tracker metrics #10225

[Trainer] memory tracker metrics #10225

Conversation

stas00 commented Feb 17, 2021 • edited

sgugger left a comment

Choose a reason for hiding this comment

stas00 commented Feb 18, 2021 • edited

stas00 commented Feb 18, 2021 • edited

stas00 commented Feb 17, 2021 •

edited

stas00 commented Feb 18, 2021 •

edited

stas00 commented Feb 18, 2021 •

edited