[Bug] MMSeparateDistributedDataParallel skip init_weights #1042

makecent · 2023-04-03T06:17:19Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Reproduces the problem - code sample

I encountered this problem when running my own custom project based on mmengine, which is too complicated to present.

Additional information

The MMSeparateDistributedDataParallel is a model wrapper of model wrapper, whose module may contain the MMDistributedDataParallel.

mmengine/mmengine/model/wrappers/seperate_distributed.py

Lines 82 to 87 in 4d7a293

    
               sub_module = MMDistributedDataParallel( 
        
                   module=sub_module.to(device), 
        
                   broadcast_buffers=broadcast_buffers, 
        
                   find_unused_parameters=find_unused_parameters, 
        
                   **kwargs) 
        
           module._modules[name] = sub_module

When initializing the model_wrapper weights, the runner initialize the weight of the runner.model.module:

mmengine/mmengine/runner/runner.py

Lines 893 to 902 in 4d7a293

    
               def _init_model_weights(self) -> None: 
        
                   """Initialize the model weights if the model has 
        
                   :meth:`init_weights`""" 
        
                   model = self.model.module if is_model_wrapper( 
        
                       self.model) else self.model 
        
                   if hasattr(model, 'init_weights'): 
        
                       model.init_weights() 
        
                       # sync params and buffers 
        
                       for name, params in model.state_dict().items(): 
        
                           broadcast(params)

The above codes will cause the init_weights functions in model not work because the children module of the runner.model.module in this case is of type MMDistributedDataParallel which does NOT have init_weights function.

The text was updated successfully, but these errors were encountered:

zhouzaida · 2023-04-03T07:37:11Z

Hi, the module is defined in

mmengine/mmengine/model/wrappers/seperate_distributed.py

Line 71 in 4d7a293

self.module = module

Therefore, if you implement the init_weights method in the module, the runner can call it as expected.

makecent · 2023-04-03T07:53:47Z

@zhouzaida But the sub_modules in self.module are modified to be model_wrapper MMDistributedDataParallel:

mmengine/mmengine/model/wrappers/seperate_distributed.py

Lines 75 to 87 in 4d7a293

    
           for name, sub_module in module._modules.items(): 
        
               # module without parameters. 
        
               if next(sub_module.parameters(), None) is None: 
        
                   sub_module = sub_module.to(device) 
        
               elif all(not p.requires_grad for p in sub_module.parameters()): 
        
                   sub_module = sub_module.to(device) 
        
               else: 
        
                   sub_module = MMDistributedDataParallel( 
        
                       module=sub_module.to(device), 
        
                       broadcast_buffers=broadcast_buffers, 
        
                       find_unused_parameters=find_unused_parameters, 
        
                       **kwargs) 
        
               module._modules[name] = sub_module

Therefore, the MMSeparateDistributedDataParallel.module does have init_weights function but its sub-modules ( MMDistributedDataParallel) do not have init_weights function, which causes the init_weights functions of the module wrapped in the MMDistributedDataParallel not accessible during the initialzation.

zhouzaida · 2023-04-03T08:16:35Z

Oh, the init_weights method is defined in the sub-module, so there might be an issue with that.

HAOCHENYE · 2023-04-03T10:00:32Z

@makecent Hi, I create a PR, will this solve the problem?

makecent · 2023-04-04T00:59:30Z

@HAOCHENYE LGTM. The initialization works as expected with the PR.

makecent added the bug Something isn't working label Apr 3, 2023

HAOCHENYE mentioned this issue Apr 3, 2023

[Fix] Initialize sub-modules in ddp which define 'init_weights' me… #1045

Merged

makecent closed this as completed Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] MMSeparateDistributedDataParallel skip init_weights #1042

[Bug] MMSeparateDistributedDataParallel skip init_weights #1042

makecent commented Apr 3, 2023 •

edited

zhouzaida commented Apr 3, 2023

makecent commented Apr 3, 2023

zhouzaida commented Apr 3, 2023 •

edited

HAOCHENYE commented Apr 3, 2023

makecent commented Apr 4, 2023

[Bug] MMSeparateDistributedDataParallel skip init_weights #1042

[Bug] MMSeparateDistributedDataParallel skip init_weights #1042

Comments

makecent commented Apr 3, 2023 • edited

Prerequisite

Reproduces the problem - code sample

Additional information

zhouzaida commented Apr 3, 2023

makecent commented Apr 3, 2023

zhouzaida commented Apr 3, 2023 • edited

HAOCHENYE commented Apr 3, 2023

makecent commented Apr 4, 2023

makecent commented Apr 3, 2023 •

edited

zhouzaida commented Apr 3, 2023 •

edited