[Feature] Support Side Adapter Network #3232

angiecao · 2023-07-25T17:57:46Z

Motivation

Support SAN for Open-Vocabulary Semantic Segmentation
Paper: Side Adapter Network for Open-Vocabulary Semantic Segmentation
official Code: SAN

Modification

Added the parameters of backbone vit for implementing the image encoder of CLIP.
Added text encoder code.
Added segmentor multimodel encoder-decoder code for open-vocabulary semantic segmentation.
Added SideAdapterNetwork decode head code.
Added config files for train and inference.
Added tools for converting pretrained models.
Added loss implementation for mask classification model, such as SAN, Maskformer and remove dependency on mmdetection.
Added test units for text encoder, multimodel encoder-decoder, san decode head and hungarian_assigner.

Use cases

Convert Models

pretrained SAN model
The official pretrained model can be downloaded from san_clip_vit_b_16.pth and san_clip_vit_large_14.pth.
Use tools/model_converters/san2mmseg.py to convert offcial model into mmseg style.
python tools/model_converters/san2mmseg.py <MODEL_PATH> <OUTPUT_PATH>

pretrained CLIP model
Use the CLIP model provided by openai to train SAN. The CLIP model can be download from ViT-B-16.pt and ViT-L-14-336px.pt.
Use tools/model_converters/clip2mmseg.py to convert model into mmseg style.
python tools/model_converters/clip2mmseg.py <MODEL_PATH> <OUTPUT_PATH>

Inference

test san_vit-base-16 model on coco-stuff164k dataset
python tools/test.py ./configs/san/san-vit-b16_coco-stuff164k-640x640.py <TRAINED_MODEL_PATH>

Train

test san_vit-base-16 model on coco-stuff164k dataset
python tools/train.py ./configs/san/san-vit-b16_coco-stuff164k-640x640.py --cfg-options model.pretrained=<PRETRAINED_MODEL_PATH>

Comparision Results

Train on COCO-Stuff164k

		mIoU	mAcc	pAcc
san-vit-base16	official	41.93	56.73	67.69
	mmseg	41.93	56.84	67.84
san-vit-large14	official	45.57	59.52	69.76
	mmseg	45.78	59.61	69.21

Evaluate on Pascal Context

		mIoU	mAcc	pAcc
san-vit-base16	official	54.05	72.96	77.77
	mmseg	54.04	73.74	77.71
san-vit-large14	official	57.53	77.56	78.89
	mmseg	56.89	76.96	78.74

Evaluate on Voc12Aug

		mIoU	mAcc	pAcc
san-vit-base16	official	93.86	96.61	97.11
	mmseg	94.58	97.01	97.38
san-vit-large14	official	95.17	97.61	97.63
	mmseg	95.58	97.75	97.79

CLAassistant · 2023-07-25T17:57:54Z

All committers have signed the CLA.

…tation into angiecao/add_SAN_infer synchronize the remote branch

This reverts commit 89600c9.

xiexinch · 2023-09-13T09:47:26Z

configs/mask2former/mask2former_r50_8xb2-160k_vaihingen-512x512.py

Might remove this file.

xiexinch · 2023-09-13T10:54:21Z

mmseg/utils/dist_utils.py

+def all_reduce_dict(py_dict, op='sum', group=None, to_float=True):
+    """Apply all reduce function for python dict object.
+
+    The code is modified from https://github.com/Megvii-
+    BaseDetection/YOLOX/blob/main/yolox/utils/allreduce_norm.py.


This method seems not in use.

xiexinch · 2023-09-13T10:54:59Z

mmseg/utils/dist_utils.py

+def sync_random_seed(seed=None, device='cuda'):
+    """Make sure different ranks share the same seed.
+
+    All workers must call this function, otherwise it will deadlock.


This method seems not in use.

xiexinch · 2023-09-13T10:55:53Z

mmseg/utils/dist_utils.py

+@functools.lru_cache()
+def _get_global_gloo_group():
+    """Return a process group based on gloo backend, containing all the ranks
+    The result is cached."""
+    if dist.get_backend() == 'nccl':
+        return dist.new_group(backend='gloo')
+    else:
+        return dist.group.WORLD


This method seems not in use.

xiexinch · 2023-09-13T10:56:41Z

mmseg/utils/dist_utils.py

+def allreduce_grads(params, coalesce=True, bucket_size_mb=-1):
+    """Allreduce gradients.
+
+    Args:
+        params (list[torch.Parameters]): List of parameters of a model


The same as other methods, it's not in use.

xiexinch · 2023-09-13T11:03:42Z

mmseg/utils/tokenizer.py

+"""
+import gzip
+import html
+# https://stackoverflow.com/q/62691279


Might remove this line.

xiexinch · 2023-09-13T11:25:15Z

mmseg/models/text_encoder/clip_text_encoder.py

+class CLIPTextEncoder(BaseModule):
+    """A text encoder with transformer architecture to encode the label text.
+
+    Args:


We should add a reference link to the original implementation and add a license.

xiexinch · 2023-09-13T11:28:43Z

mmseg/models/utils/san_layers.py

+
+from mmseg.models.backbones.vit import TransformerEncoderLayer
+
+# Modified from https://github.com/MendelXu/SAN/blob/main/san/model/attn_helper.py  # noqa: E501


Should move this line to the top of this file.

xiexinch · 2023-09-13T11:35:58Z

mmseg/models/utils/san_layers.py

+class LayerNorm(nn.Module):
+    """A LayerNorm variant, popularized by Transformers, that performs point-
+    wise mean and variance normalization over the channel dimension for inputs
+    that have shape (batch_size, channels, height, width).
+
+    https://github.com/facebookresearch/ConvNeXt/blob/d1fa8f6fef0a165b27399986cc2bdacc92777e40/models/convnext.py#L119  # noqa B950
+    """


Could we use nn.LayerNorm instead?

Yes, this method can be replaced by using nn.LayerNorm and torch.permute twice. But I think adding this LayerNorm variant might be a better choice.
This variant is popularized by Transformers. It supports that the input shape can be directly (B, C, H, W). The input and output of the vision transformer layers are also in this shape, so there is no need to constantly adjust inputs and outputs between network layers as when using nn.LayerNorm.
In order to distinguish it from nn.LayerNorm, I changed the name to LayerNorm2d.

xiexinch · 2023-09-19T14:27:23Z

mmseg/models/decode_heads/san_head.py

+        qkv_bias (int): Whether to use bias in multihead-attention.
+            Default: True.


Suggested change

qkv_bias (int): Whether to use bias in multihead-attention.

Default: True.

qkv_bias (bool): Whether to use bias in multihead-attention.

Default: True.

xiexinch · 2023-09-19T14:41:49Z

mmseg/models/decode_heads/san_head.py

+    def init_para(self):
+        if hasattr(self, 'sos_token'):
+            nn.init.normal_(self.sos_token, std=0.02)
+


It's not in use.

xiexinch · 2023-09-19T14:45:29Z

mmseg/models/decode_heads/san_head.py

+    def forward(self, bias, feature):
+        """Forward function."""


Might add type hints and docstring.

xiexinch · 2023-09-19T15:08:43Z

mmseg/models/utils/san_layers.py

+def cross_attn_layer(self: TransformerEncoderLayer, x, mem, attn_bias):
+    """Implementation of transformer layer with cross attention
+    Args:
+        self (TransformerEncoderLayer): The Module of vision transformer layer.


It is suggested that self be renamed something else.

xiexinch · 2023-09-19T15:15:05Z

mmseg/models/utils/san_layers.py

+def cross_attn_with_self_bias(self, query, key, value, attn_mask=None):
+    """Implementation of cross attention layer which shares the embedding
+    weights with self-attention.
+
+    Args:
+        self: self-attention layer
+        query, key, value: map a query and a set of key-value pairs to
+        an output. See "Attention Is All You Need" for more details.
+        attn_mask: mask that prevents attention to certain positions.
+    """
+    return cross_attn_with_self_bias_func(
+        query,


Can the two methods be combined into one? And then the naming of the self param is a bit misleading.

xiexinch · 2023-09-19T15:21:45Z

mmseg/models/decode_heads/san_head.py

+                        embed_dims=embed_dims,
+                        feedforward_channels=mlp_ratio * embed_dims,
+                        act_cfg=act_cfg),
+                    operation_order=('norm', 'self_attn', 'norm', 'ffn')))


Could we use self.cross_attn to control whether to add 'cross_attn' to the operation_order?

The implementation of cross attention for SAN implementations differs from that in mmcv. It also contains the process of calculating self-attention for query.
Concatenate self-attention weight of query with cross attention weight.

self_weight = (q * q_k).sum(dim=-1, keepdim=True) total_attn_output_weights = torch.cat([attn_output_weights, self_weight], dim=-1) total_attn_output_weights = F.softmax(total_attn_output_weights, dim=-1)

Add weighted query to the final attention output.

attn_output = ( attn_output + self_weight * q_v)

…tation into angiecao/add_SAN_infer

zsc1220 · 2023-10-13T01:24:06Z

How to support open vocabulary prompting inference ?

angiecao · 2023-10-13T03:00:49Z

How to support open vocabulary prompting inference ?

If you want a custom set of category names, you can define model.text_encoder.vocabulary in the config file and set model.text_encoder.dataset_name to None

model = dict(
    text_encoder=dict(dataset_name=None, 
                      vocabulary=['classA', 'classB', 'classC']))

If you want to set templates of prompts, you can define a new list of templates in the file mmseg/utils/get_templates.py and change model.text_encoder.templates to your template name in the config file.

PREDEFINED_TEMPLATES = {
    'custom':[
        'a photo of a {}.',
        'This is a photo of a {}',
        'There is a {} in the scene',
    ],
}

model = dict(text_encoder=dict(templates='custom'))

The implementation of prompt engineering is in function template_encode in the file mmseg/models/text_encoder/clip_text_encoder.py

zsc1220 · 2023-10-13T09:01:49Z

@angiecao Wow，Thank you so much ! But if I just want to give some categories when predicting, and only predict those categories, How should I do? like this

python predict.py test/xxx.jpg configs/xxx.py work_dirs/xxx.pth --class-names 'Oculus' 'Ukulele'  --output ./pred/xxx.jpg

## Motivation Support SAN for Open-Vocabulary Semantic Segmentation Paper: [Side Adapter Network for Open-Vocabulary Semantic Segmentation](https://arxiv.org/abs/2302.12242) official Code: [SAN](https://github.com/MendelXu/SAN) ## Modification - Added the parameters of backbone vit for implementing the image encoder of CLIP. - Added text encoder code. - Added segmentor multimodel encoder-decoder code for open-vocabulary semantic segmentation. - Added SideAdapterNetwork decode head code. - Added config files for train and inference. - Added tools for converting pretrained models. - Added loss implementation for mask classification model, such as SAN, Maskformer and remove dependency on mmdetection. - Added test units for text encoder, multimodel encoder-decoder, san decode head and hungarian_assigner. ## Use cases ### Convert Models **pretrained SAN model** The official pretrained model can be downloaded from [san_clip_vit_b_16.pth](https://huggingface.co/Mendel192/san/blob/main/san_vit_b_16.pth) and [san_clip_vit_large_14.pth](https://huggingface.co/Mendel192/san/blob/main/san_vit_large_14.pth). Use tools/model_converters/san2mmseg.py to convert offcial model into mmseg style. `python tools/model_converters/san2mmseg.py <MODEL_PATH> <OUTPUT_PATH>` **pretrained CLIP model** Use the CLIP model provided by openai to train SAN. The CLIP model can be download from [ViT-B-16.pt](https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt) and [ViT-L-14-336px.pt](https://openaipublic.azureedge.net/clip/models/3035c92b350959924f9f00213499208652fc7ea050643e8b385c2dac08641f02/ViT-L-14-336px.pt). Use tools/model_converters/clip2mmseg.py to convert model into mmseg style. `python tools/model_converters/clip2mmseg.py <MODEL_PATH> <OUTPUT_PATH>` ### Inference test san_vit-base-16 model on coco-stuff164k dataset `python tools/test.py ./configs/san/san-vit-b16_coco-stuff164k-640x640.py <TRAINED_MODEL_PATH>` ### Train test san_vit-base-16 model on coco-stuff164k dataset `python tools/train.py ./configs/san/san-vit-b16_coco-stuff164k-640x640.py --cfg-options model.pretrained=<PRETRAINED_MODEL_PATH>` ## Comparision Results ### Train on COCO-Stuff164k | | | mIoU | mAcc | pAcc | | --------------- | ----- | ----- | ----- | ----- | | san-vit-base16 | official | 41.93 | 56.73 | 67.69 | | | mmseg | 41.93 | 56.84 | 67.84 | | san-vit-large14 | official | 45.57 | 59.52 | 69.76 | | | mmseg | 45.78 | 59.61 | 69.21 | ### Evaluate on Pascal Context | | | mIoU | mAcc | pAcc | | --------------- | ----- | ----- | ----- | ----- | | san-vit-base16 | official | 54.05 | 72.96 | 77.77 | | | mmseg | 54.04 | 73.74 | 77.71 | | san-vit-large14 | official | 57.53 | 77.56 | 78.89 | | | mmseg | 56.89 | 76.96 | 78.74 | ### Evaluate on Voc12Aug | | | mIoU | mAcc | pAcc | | --------------- | ----- | ----- | ----- | ----- | | san-vit-base16 | official | 93.86 | 96.61 | 97.11 | | | mmseg | 94.58 | 97.01 | 97.38 | | san-vit-large14 | official | 95.17 | 97.61 | 97.63 | | | mmseg | 95.58 | 97.75 | 97.79 | --------- Co-authored-by: CastleDream <35064479+CastleDream@users.noreply.github.com> Co-authored-by: yeedrag <46050186+yeedrag@users.noreply.github.com> Co-authored-by: Yang-ChangHui <71805205+Yang-Changhui@users.noreply.github.com> Co-authored-by: Xu CAO <49406546+SheffieldCao@users.noreply.github.com> Co-authored-by: xiexinch <xiexinch@outlook.com> Co-authored-by: 小飞猪 <106524776+ooooo-create@users.noreply.github.com>

angiecao added 8 commits July 25, 2023 12:01

add vit configs

48344d2

Add AdapterSideNetwork decodehead

236c1fe

Add text_encoder in CLIP

41b8db9

Add configs for SAN

0213604

Add model covert for SAN

0ec5238

fix circular import

5b3b9e6

fix bg_embed

7de7c0c

add test units for SAN

3fac1ec

angiecao added 18 commits July 27, 2023 12:37

Add regex, ftfy requirements

16662d8

change 'multimodel' to 'multimodal'

340688f

Replace file_client with fileio

1fffd59

add vit configs

c992216

Add AdapterSideNetwork decodehead

75811e3

Add text_encoder in CLIP

91dcfa1

Add configs for SAN

c7f3167

Add model covert for SAN

6c2f01e

fix circular import

c0cad61

fix bg_embed

3e7bcca

add test units for SAN

7fb34b6

Add regex, ftfy requirements

125bf17

change 'multimodel' to 'multimodal'

7b82303

freeze parameters

8f01fb4

weight init

2ed4f6c

add convert CLIP model into mmseg style

34a230c

train_pipeline & optimizer & scheduler

7229c00

add deep supervised

53573a6

xiexinch changed the base branch from main to dev-1.x July 31, 2023 09:32

angiecao added 2 commits August 1, 2023 10:25

add multimodal dependencies

3e3bbea

Merge branch 'angiecao/add_SAN_infer' of github.com:angiecao/mmsegmen…

6449203

…tation into angiecao/add_SAN_infer synchronize the remote branch

angiecao added 10 commits September 5, 2023 17:58

add amp&clip grad

fdf6617

fix loss_mask_ce is nan when use amp

b0e9d22

fix loss_cls_ce

89600c9

fix error when use ViT/L-14 in clip2mmseg

16c65fc

Revert "fix loss_cls_ce"

a09c41a

This reverts commit 89600c9.

fix avg_factor when class_weight is not None

4954fc8

change num_total_masks to float32

ea2abc9

add assertion for InstanceData in match_cost

9362e12

add test units for train SAN

d69dc51

change batchsize for train

f3e69a6

angiecao changed the title ~~[WIP] Support Side Adapter Network~~ [Feature] Support Side Adapter Network Sep 7, 2023

xiexinch reviewed Sep 13, 2023

View reviewed changes

angiecao and others added 3 commits September 13, 2023 22:17

remove unused & add reference link & change LayerNorm to LayerNorm2d

28b741f

Merge branch 'dev-1.x' into angiecao/add_SAN_infer

8a18f37

fix pre-commit error

e45845d

xiexinch reviewed Sep 19, 2023

View reviewed changes

angiecao added 6 commits September 20, 2023 10:45

combine two cross attention methods into one

fc5424e

fix docstring

da670c7

Merge branch 'angiecao/add_SAN_infer' of github.com:angiecao/mmsegmen…

d30d37f

…tation into angiecao/add_SAN_infer

modify pretrained_Part load method & replace reduce_mean with all_reduce

8e50943

add readme.md

17a4279

add readme.md

75a6338

xiexinch approved these changes Sep 20, 2023

View reviewed changes

xiexinch merged commit 608e319 into open-mmlab:dev-1.x Sep 20, 2023
2 checks passed

mmeendez8 mentioned this pull request Nov 30, 2023

issue with class weight and cross entropy loss #3412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support Side Adapter Network #3232

[Feature] Support Side Adapter Network #3232

angiecao commented Jul 25, 2023 •

edited

Loading

CLAassistant commented Jul 25, 2023 •

edited

Loading

xiexinch Sep 13, 2023

xiexinch Sep 13, 2023

xiexinch Sep 13, 2023

xiexinch Sep 13, 2023

xiexinch Sep 13, 2023

xiexinch Sep 13, 2023

xiexinch Sep 13, 2023

xiexinch Sep 13, 2023

xiexinch Sep 13, 2023

angiecao Sep 13, 2023

xiexinch Sep 19, 2023

xiexinch Sep 19, 2023

xiexinch Sep 19, 2023

xiexinch Sep 19, 2023

xiexinch Sep 19, 2023

xiexinch Sep 19, 2023

angiecao Sep 20, 2023

zsc1220 commented Oct 13, 2023

angiecao commented Oct 13, 2023

zsc1220 commented Oct 13, 2023 •

edited

Loading


		from mmseg.models.backbones.vit import TransformerEncoderLayer

		# Modified from https://github.com/MendelXu/SAN/blob/main/san/model/attn_helper.py # noqa: E501

		qkv_bias (int): Whether to use bias in multihead-attention.
		Default: True.

[Feature] Support Side Adapter Network #3232

[Feature] Support Side Adapter Network #3232

Conversation

angiecao commented Jul 25, 2023 • edited Loading

Motivation

Modification

Use cases

Convert Models

Inference

Train

Comparision Results

Train on COCO-Stuff164k

Evaluate on Pascal Context

Evaluate on Voc12Aug

CLAassistant commented Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsc1220 commented Oct 13, 2023

angiecao commented Oct 13, 2023

zsc1220 commented Oct 13, 2023 • edited Loading

angiecao commented Jul 25, 2023 •

edited

Loading

CLAassistant commented Jul 25, 2023 •

edited

Loading

zsc1220 commented Oct 13, 2023 •

edited

Loading