[Fix] fix the wrong weight reference bug in BaseTransformerLayer #1418

gaotongxiao · 2021-10-20T08:34:14Z

Fixes the wrong function/weight reference bug in BaseTransformerLayer when batch_first is True.

Motivation

There are some cases that users clone a model with copy.deepcopy to create, for example, a ModuleList. However, things become tricky when it comes to BaseTransformerLayer. If we create a ModuleList on GPU in this way with batch_first=True and run the forward function, an error will prompt saying the weights are not on GPU.

baselayer = BaseTransformerLayer(
    operation_order=('self_attn', 'ffn'),
    batch_first=True,
    attn_cfgs=dict(
        type='MultiheadAttention',
        embed_dims=256,
        num_heads=8,
    ),
)
baselayers = ModuleList([copy.deepcopy(baselayer) for _ in range(2)])
baselayers.to('cuda')
x = torch.rand(2, 10, 256).cuda()
out = baselayers[0](x)

The error:

RuntimeError: Tensor for argument #3 'mat2' is on CPU, but expected it to be on GPU (while checking arguments for addmm)

It turned out to be the problem of forward_wrapper, which somehow stores the function pointer of self.attn.forward during the initialization and would always call this address after transpose operations. Such logic works well for this module but not for modules deepcopied from it, which will still call the forward function of the original module. Essentially they are still using the weights of the original module to conduct computations!

mmcv/mmcv/cnn/bricks/transformer.py

Lines 105 to 125 in e8489a7

    
                   if self.batch_first: 
        
                       def _bnc_to_nbc(forward): 
        
                           """Because the dataflow('key', 'query', 'value') of 
        
                           ``torch.nn.MultiheadAttention`` is (num_query, batch, 
        
                           embed_dims), We should adjust the shape of dataflow from 
        
                           batch_first (batch, num_query, embed_dims) to num_query_first 
        
                           (num_query ,batch, embed_dims), and recover ``attn_output`` 
        
                           from num_query_first to batch_first.""" 
        
                           def forward_wrapper(**kwargs): 
        
                               convert_keys = ('key', 'query', 'value') 
        
                               for key in kwargs.keys(): 
        
                                   if key in convert_keys: 
        
                                       kwargs[key] = kwargs[key].transpose(0, 1) 
        
                               attn_output, attn_output_weights = forward(**kwargs) 
        
                               return attn_output.transpose(0, 1), attn_output_weights 
        
                           return forward_wrapper 
        
                       self.attn.forward = _bnc_to_nbc(self.attn.forward)

Modification

Hardcoding a function pointer in the initializer is not a good practice. I move the transpose logic to forward and it should be much safer. A unit test for this case is also added.

BC-breaking (Optional)

No

…ch_first is True

jshilong

LGTM

…ch_first is True (#1418)

* [Feature] Add roiaware pool3d ops from mmdet3d (#1382) * add ops (roiaware pool3d) in mmdet3d * refactor code * fix typo Co-authored-by: zhouzaida <zhouzaida@163.com> * [Feature] Add iou3d op from mmdet3d (#1356) * add ops (iou3d) in mmdet3d * add unit test * refactor code * refactor code * refactor code * refactor code * refactor code Co-authored-by: zhouzaida <zhouzaida@163.com> * [Fix] Update test data for test_iou3d (#1427) * Update test data for test_iou3d * delete blank lines Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * [Feature] Add group points ops from mmdet3d (#1415) * add op (group points) and its related ops (ball query and knn) in mmdet3d * refactor code * fix typo * refactor code * fix typo * refactor code * make input contiguous Co-authored-by: zhouzaida <zhouzaida@163.com> * add mmdet3d op (#1425) Co-authored-by: zhouzaida <zhouzaida@163.com> * [Feature] Loading objects from different backends and dumping objects to different backends (#1330) * [Feature] Choose storage backend by the prefix of filepath * refactor FileClient and add unittest * support loading from different backends * polish docstring * fix unittet * rename attribute str_like_obj to is_str_like_obj * add infer_client method * add check_exist method * rename var client to file_client * polish docstring * add join_paths method * remove join_paths and add _format_path * enhance unittest * refactor unittest * singleton pattern * fix test_clientio.py * deprecate CephBackend * enhance docstring * refactor unittest for petrel * refactor unittest for disk backend * update io.md * add concat_paths method * improve docstring * improve docstring * add isdir and copyfile for file backend * delete copyfile and add get_local_path * remove isdir method of petrel * fix typo * add comment and polish docstring * polish docstring * rename _path_mapping to _map_path * polish docstring and fix typo * refactor get_local_path * add list_dir_or_file for FileClient * add list_dir_or_file for PetrelBackend * fix windows ci * Add return docstring * polish docstring * fix typo * fix typo * deprecate the conversion from Path to str * add docs for loading checkpoints with FileClient * refactor map_path * add _ensure_methods to ensure methods have been implemented * fix list_dir_or_file * rename _ensure_method_implemented to has_method * Add CI for pytorch 1.10 (#1431) * [Feature] Upload checkpoints and logs to ceph (#1375) * [Feature] Choose storage backend by the prefix of filepath * refactor FileClient and add unittest * support loading from different backends * polish docstring * fix unittet * rename attribute str_like_obj to is_str_like_obj * [Docs] Upload checkpoint to petrel oss * add infer_client method * Support uploading checkpoint to petrel oss * add check_exist method * refactor CheckpointHook * support uploading logs to ceph * rename var client to file_client * polish docstring * enhance load_from_ceph * refactor load_from_ceph * refactor TextLoggerHook * change the meaning of out_dir argument * fix test_checkpoint_hook.py * add join_paths method * remove join_paths and add _format_path * enhance unittest * refactor unittest * add a unittest for EvalHook when file backend is petrel * singleton pattern * fix test_clientio.py * deprecate CephBackend * add warning in load_from_ceph * fix type of out_suffix * enhance docstring * refactor unittest for petrel * refactor unittest for disk backend * update io.md * add concat_paths method * fix CI * mock check_exist * improve docstring * improve docstring * improve docstring * improve docstring * add isdir and copyfile for file backend * delete copyfile and add get_local_path * remove isdir method of petrel * fix typo * rename check_exists to exists * refactor code and polish docstring * fix windows ci * add comment and polish docstring * polish docstring * polish docstring * rename _path_mapping to _map_path * polish docstring and fix typo * refactor get_local_path * add list_dir_or_file for FileClient * add list_dir_or_file for PetrelBackend * fix windows ci * Add return docstring * polish docstring * fix typo * fix typo * fix typo * fix error when mocking PetrelBackend * deprecate the conversion from Path to str * add docs for loading checkpoints with FileClient * rename keep_log to keep_local * refactor map_path * add _ensure_methods to ensure methods have been implemented * fix list_dir_or_file * rename _ensure_method_implemented to has_method * refactor * polish information * format information * bump version to v1.3.16 (#1430) * [Fix]: Update test data of test_tin_shift (#1426) * Update test data of test_tin_shift * Delete tmp.engine * add pytest raises asserterror test * raise valueerror, update test log * add more comment * Apply suggestions from code review Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * fix the wrong function reference bug in BaseTransformerLayer when batch_first is True (#1418) * [Docs] Add mmcv itself in the docs list (#1441) * Add mmcv itself in the docs list * modify link of docs * [Improve] improve checkpoint loading log (#1446) * [Feature] Support SigmoidFocalLoss with Cambricon MLU backend (#1346) * [Feature] Support SigmoidFocalLoss with Cambricon MLU backend * refactor MMCV_WITH_MLU macro define * refactor NFU_ALIGN_SIZE, PAD_DOWN and split_pipeline_num * delete extra fool proofing in cpp * [Feature] Support SigmoidFocalLossBackward with Cambricon MLU backend * fix macro definition in SigmoidFocalLoss * refactor mlu files into clang-format * refactor sigmoid focal loss test * refactor Sigmoid Focal Loss file structure. * fix python lint error * fix import torch_mlu error type * fix lint * refactor clang format style to google Co-authored-by: zhouzaida <zhouzaida@163.com> * [Feature] Support RoiAlign With Cambricon MLU Backend (#1429) * [Feature] Support NMS with cambricon MLU backend (#1467) * [Feature] Support BBoxOverlaps with cambricon MLU backend (#1507) * [Refactor] Format C++ code * [Refactor] include common_mlu_helper in pytorch_mlu_helper and refactor build condition * [Improve] Improve the performance of roialign, nms and focalloss with MLU backend (#1572) * [Improve] Improve the performance of roialign with MLU backend * replace CHECK_MLU with CHECK_MLU_INPUT * [Improve] Improve the perf of nms and focallosssigmoid with MLU backend * [Improve] Improve the performance of roialign with MLU backend (#1741) * [Feature] Support tin_shift with cambricon MLU backend (#1696) * [Feature] Support tin_shift with cambricon MLU backend * [fix] Add the assertion of batch_size in tin_shift.py * [fix] fix the param check of tin_shift in cambricon code * [fix] Fix lint failure. * [fix] Fix source file lint failure. * Update mmcv/ops/tin_shift.py [Refactor] Modify the code in mmcv/ops/tin_shift.py. Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> Co-authored-by: budefei <budefei@cambricon.com> Co-authored-by: budefei <budefei@cambricom.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> * resolve conflicts and fix lint * fix mmcv.utils.__init__ * fix mmcv.utils.__init__ * Fix lints and change FLAG * fix setup and refine * remove a redundant line * remove an unnecessary 'f' * fix compilation error Co-authored-by: dingchang <hudingchang.vendor@sensetime.com> Co-authored-by: zhouzaida <zhouzaida@163.com> Co-authored-by: q.yao <yaoqian@sensetime.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> Co-authored-by: pc <luopeichao@sensetime.com> Co-authored-by: Wenwei Zhang <40779233+ZwwWayne@users.noreply.github.com> Co-authored-by: q.yao <streetyao@live.com> Co-authored-by: Tong Gao <gaotongxiao@gmail.com> Co-authored-by: Yuxin Liu <liuyuxin@cambricon.com> Co-authored-by: zihanchang11 <92860914+zihanchang11@users.noreply.github.com> Co-authored-by: shlrao <shenglong.rao@gmail.com> Co-authored-by: zhouchenyang <zcy19950525@gmail.com> Co-authored-by: Mrxiaofei <36697723+Mrxiaofei@users.noreply.github.com> Co-authored-by: budefei <budefei@cambricon.com> Co-authored-by: budefei <budefei@cambricom.com>

fix the wrong function reference bug in BaseTransformerLayer when bat…

8a2ce4e

…ch_first is True

zhouzaida requested a review from jshilong October 20, 2021 09:30

jshilong approved these changes Oct 26, 2021

View reviewed changes

zhouzaida approved these changes Oct 31, 2021

View reviewed changes

ZwwWayne approved these changes Nov 2, 2021

View reviewed changes

ZwwWayne merged commit c522b47 into open-mmlab:master Nov 2, 2021

gaotongxiao deleted the fix_transformer branch November 3, 2021 01:24

zhouzaida pushed a commit that referenced this pull request Nov 3, 2021

fix the wrong function reference bug in BaseTransformerLayer when bat…

2ea5e5e

…ch_first is True (#1418)

zhouzaida mentioned this pull request Nov 7, 2021

Iteration Plan v1.3.17 - Nov 2021 #1439

Closed

13 tasks

This was referenced Nov 12, 2021

Catch symlink failure on Windows open-mmlab/mmdetection#6482

Merged

[Fix] Fix SpatialReductionAttention in PVT. open-mmlab/mmdetection#6488

Merged

This was referenced Nov 12, 2021

[Feature] Add Cutout transform open-mmlab/mmsegmentation#1022

Merged

[Fix] Fix EfficientMultiheadAttention in SegFormer open-mmlab/mmsegmentation#1037

Merged

MengzhangLI mentioned this pull request Dec 20, 2021

[Fix] Fix mmcv version compatibility in get_started.md. open-mmlab/mmsegmentation#1154

Closed

3 tasks

JiaquanYe mentioned this pull request Apr 3, 2022

[Model] Add MASTER open-mmlab/mmocr#807

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] fix the wrong weight reference bug in BaseTransformerLayer #1418

[Fix] fix the wrong weight reference bug in BaseTransformerLayer #1418

gaotongxiao commented Oct 20, 2021 •

edited

Loading

jshilong left a comment

	if self.batch_first:

	def _bnc_to_nbc(forward):
	"""Because the dataflow('key', 'query', 'value') of
	``torch.nn.MultiheadAttention`` is (num_query, batch,
	embed_dims), We should adjust the shape of dataflow from
	batch_first (batch, num_query, embed_dims) to num_query_first
	(num_query ,batch, embed_dims), and recover ``attn_output``
	from num_query_first to batch_first."""

	def forward_wrapper(**kwargs):
	convert_keys = ('key', 'query', 'value')
	for key in kwargs.keys():
	if key in convert_keys:
	kwargs[key] = kwargs[key].transpose(0, 1)
	attn_output, attn_output_weights = forward(**kwargs)
	return attn_output.transpose(0, 1), attn_output_weights

	return forward_wrapper

	self.attn.forward = _bnc_to_nbc(self.attn.forward)

[Fix] fix the wrong weight reference bug in BaseTransformerLayer #1418

[Fix] fix the wrong weight reference bug in BaseTransformerLayer #1418

Conversation

gaotongxiao commented Oct 20, 2021 • edited Loading

Motivation

Modification

BC-breaking (Optional)

jshilong left a comment

Choose a reason for hiding this comment

gaotongxiao commented Oct 20, 2021 •

edited

Loading