dvclive callback: warn instead of fail when logging non-scalars #27608

dberenbaum · 2023-11-20T15:43:39Z

What does this PR do?

Fixes #27352 (comment). This will warn instead of fail when trying to log non-scalars as metrics.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerz Could you please take a look?

dberenbaum · 2023-11-20T15:46:53Z

@muellerz This makes the tests pass, but I'm not sure if it's intended that the test here logs the learning rate as a list rather than as a scalar (which will fail under several of the existing loggers, but only with a warning like in this PR).

transformers/tests/trainer/test_trainer.py

Line 675 in e4280d6

logs["learning_rate"] = self.lr_scheduler._last_lr

self.lr_scheduler._last_lr is a list. Should a scalar value be extracted like self.lr_scheduler._last_lr[0]? That's the value being tested later as ["learning_rate"][0]:

transformers/tests/trainer/test_trainer.py

Lines 699 to 712 in f31af39

    
           for i, log in enumerate(logs[:-1]):  # Compare learning rate to next epoch's 
        
               loss = log["eval_loss"] 
        
               just_decreased = False 
        
               if loss > best_loss: 
        
                   bad_epochs += 1 
        
                   if bad_epochs > patience: 
        
                       self.assertLess(logs[i + 1]["learning_rate"][0], log["learning_rate"][0]) 
        
                       just_decreased = True 
        
                       bad_epochs = 0 
        
               else: 
        
                   best_loss = loss 
        
                   bad_epochs = 0 
        
               if not just_decreased: 
        
                   self.assertEqual(logs[i + 1]["learning_rate"][0], log["learning_rate"][0])

Everywhere else in the codebase, it looks like a scalar is extracted:

transformers/src/transformers/trainer_pt_utils.py

Lines 847 to 867 in f31af39

    
           def _get_learning_rate(self): 
        
               if self.is_deepspeed_enabled: 
        
                   # with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may 
        
                   # not run for the first few dozen steps while loss scale is too large, and thus during 
        
                   # that time `get_last_lr` will fail if called during that warm up stage, so work around it: 
        
                   try: 
        
                       last_lr = self.lr_scheduler.get_last_lr()[0] 
        
                   except AssertionError as e: 
        
                       if "need to call step" in str(e): 
        
                           logger.warning("tried to get lr value before scheduler/optimizer started stepping, returning lr=0") 
        
                           last_lr = 0 
        
                       else: 
        
                           raise 
        
               else: 
        
                   if isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau): 
        
                       last_lr = self.optimizer.param_groups[0]["lr"] 
        
                   else: 
        
                       last_lr = self.lr_scheduler.get_last_lr()[0] 
        
                   if torch.is_tensor(last_lr): 
        
                       last_lr = last_lr.item() 
        
               return last_lr

transformers/examples/legacy/pytorch-lightning/run_glue.py

Line 46 in f31af39

tensorboard_logs = {"loss": loss, "rate": lr_scheduler.get_last_lr()[-1]}

muellerzr · 2023-11-20T16:16:59Z

In the future it's @muellerzr @dberenbaum, don't want to be pinging random people :)

Yes, let's go with [0] as the one being extracted/the scalar.

muellerzr

Thanks for the fix! If we can log just the scalar too as part of this that would be great too. Otherwise this PR LG2M. Appreciate the quickfix :)

dberenbaum · 2023-11-20T17:10:42Z

@muellerzr Apologies to you and the other person who was pinged here Zach! Added the change to the test in the last commit. The current test failures look unrelated.

ArthurZucker

Thanks 😉

dvclive callback: warn instead of fail when logging non-scalars

64370c5

muellerzr approved these changes Nov 20, 2023

View reviewed changes

muellerzr requested a review from ArthurZucker November 20, 2023 16:18

tests: log lr as scalar

a95bd2d

ArthurZucker approved these changes Nov 21, 2023

View reviewed changes

ArthurZucker merged commit 8eb9e29 into huggingface:main Nov 21, 2023
21 checks passed

muellerzr mentioned this pull request Nov 28, 2023

Apply DVC warning to Accelerate huggingface/accelerate#2197

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvclive callback: warn instead of fail when logging non-scalars #27608

dvclive callback: warn instead of fail when logging non-scalars #27608

dberenbaum commented Nov 20, 2023

dberenbaum commented Nov 20, 2023

muellerzr commented Nov 20, 2023

muellerzr left a comment

dberenbaum commented Nov 20, 2023

ArthurZucker left a comment

dvclive callback: warn instead of fail when logging non-scalars #27608

dvclive callback: warn instead of fail when logging non-scalars #27608

Conversation

dberenbaum commented Nov 20, 2023

What does this PR do?

Before submitting

Who can review?

dberenbaum commented Nov 20, 2023

muellerzr commented Nov 20, 2023

muellerzr left a comment

Choose a reason for hiding this comment

dberenbaum commented Nov 20, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment