EMA Bug #87

CiaoHe · 2022-05-12T02:38:37Z

Hi Phil,

This morning I tried to run the decoder training part. I decided to use DecoderTrainer but found one issue when ema update.

When after using decoder_trainer do sampling, the next train forward run will throw RunError:

Traceback (most recent call last):
  File "/home/caohe/DPMs/dalle2/train_decoder.py", line 321, in <module>    main()
  File "/home/caohe/DPMs/dalle2/train_decoder.py", line 318, in main
    train(decoder_trainer, train_dl, val_dl, train_config, device)
  File "/home/caohe/DPMs/dalle2/train_decoder.py", line 195, in train
    trainer.update(unet_number)
  File "/home/caohe/DPMs/dalle2/dalle2_pytorch/train.py", line 288, in update
    self.ema_unets[index].update()
  File "/home/caohe/DPMs/dalle2/dalle2_pytorch/train.py", line 119, in update
    self.update_moving_average(self.ema_model, self.online_model)
  File "/home/caohe/DPMs/dalle2/dalle2_pytorch/train.py", line 129, in update_moving_average
    ema_param.data = calculate_ema(self.beta, old_weight, up_weight)
  File "/home/caohe/DPMs/dalle2/dalle2_pytorch/train.py", line 125, in calculate_ema
    return old * beta + new * (1 - beta)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!

DALLE2-pytorch/dalle2_pytorch/train.py

Lines 108 to 118 in 6f76652

    
           def update(self): 
        
               self.step += 1 
        
               if self.step <= self.update_after_step or (self.step % self.update_every) != 0: 
        
                   return 
        
               if not self.initted: 
        
                   self.ema_model.state_dict(self.online_model.state_dict()) 
        
                   self.initted.data.copy_(torch.Tensor([True])) 
        
               self.update_moving_average(self.ema_model, self.online_model)

And I checked the up_weight.device(online model) and old_weight.device(ema model), found online model is on cuda:0 but ema model is on cpu. It's really weird, I debugged for a long time and I think it might be caused by the DecoderTrainer.sample() process.
When swapping across ema and online model, there exists some problem related to the device.

DALLE2-pytorch/dalle2_pytorch/train.py

Lines 298 to 308 in 6021945

    
           @torch.no_grad() 
        
           def sample(self, *args, **kwargs): 
        
               if self.use_ema: 
        
                   trainable_unets = self.decoder.unets 
        
                   self.decoder.unets = self.unets                  # swap in exponential moving averaged unets for sampling 
        
               output = self.decoder.sample(*args, **kwargs) 
        
               if self.use_ema: 
        
                   self.decoder.unets = trainable_unets             # restore original training unets 
        
               return output

The way I fixed it just add self.ema_model = self.ema_model.to(next(self.online_model.parameters()).device) before use self.update_moving_average(self.ema_model, self.online_model) (pretty naive haha)

Hope to hear your solution

Enjoy!

The text was updated successfully, but these errors were encountered:

lucidrains · 2022-05-12T02:57:45Z

@CiaoHe ohh yes, you are correct, thank you! i think this should fix it 924455d

CiaoHe · 2022-05-12T03:13:15Z

btw, how do you usually debug/test when adding some new functions or starting a new repo? I found my efficiency is quite low (Either run in command and wait for ERROR, or copy codes into jupyter-notebook and test again and again...)

lucidrains · 2022-05-12T03:20:17Z

@CiaoHe i've come full circle and just use a simple test.py in the root directory + print lol

CiaoHe · 2022-05-12T03:26:40Z

@CiaoHe i've come full circle and just use a simple test.py in the root directory + print lol

@lucidrains lol. But when moving to cluster do train, things gonna be out of control sometimes (I hate bugs)

lucidrains · 2022-05-12T03:29:08Z

🪰 🪱 🐞

CiaoHe closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EMA Bug #87

EMA Bug #87

CiaoHe commented May 12, 2022

lucidrains commented May 12, 2022

CiaoHe commented May 12, 2022

lucidrains commented May 12, 2022

CiaoHe commented May 12, 2022

lucidrains commented May 12, 2022

EMA Bug #87

EMA Bug #87

Comments

CiaoHe commented May 12, 2022

lucidrains commented May 12, 2022

CiaoHe commented May 12, 2022

lucidrains commented May 12, 2022

CiaoHe commented May 12, 2022

lucidrains commented May 12, 2022