New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About object detection training #6
Comments
Hi @louis624 Thank you for your interest in our paper. 1. Is PiT for object detection trained with a fixed size of 667 by 400 (half of 1333 and 800)? If so, were the images zero padded in case where the resized images were smaller than the size (667 by 400)?I'm sorry for make confusion. I will explain our detection setting in detail. Original scales = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
if image_set == 'train':
return T.Compose([
T.RandomHorizontalFlip(),
T.RandomSelect(
T.RandomResize(scales, max_size=1333),
T.Compose([
T.RandomResize([400, 500, 600]),
T.RandomSizeCrop(384, 600),
T.RandomResize(scales, max_size=1333),
])
),
normalize,
])
if image_set == 'val':
return T.Compose([
T.RandomResize([800], max_size=1333),
normalize,
]) Ours scales = [400 - i * 16 for i in range(11)]
if image_set == 'train':
return T.Compose([
T.RandomHorizontalFlip(),
T.RandomSelect(
T.RandomResize(scales, max_size=666),
T.Compose([
T.RandomResize([200, 250, 300]),
T.RandomSizeCrop(192, 300),
T.RandomResize(scales, max_size=666),
])
),
normalize,
])
if image_set == 'val':
return T.Compose([
T.RandomResize([400], max_size=666),
normalize,
]) So, it is not the fixed size setting and we didn't use extra code for zero padding. 2. For object detection, it is clear that the input data size are different than the data for image classification. Then, does patch size of PiT changes? or the number of patches changes?3. If number of patches for detection were kept the same as for the image classification, than does the patch embedding (conv_embedding) has larger kernel sizes?When the input size changes, PiT uses a different number of patches. I think PiT code used for Deformable-DETR can be a clear answer to this question. class PoolingTransformer(nn.Module):
def __init__(self, image_size, patch_size, stride,
num_classes, base_dims, depth, heads, mlp_ratio, in_chans=3,
attn_drop_rate=.0, drop_rate=.0, drop_path_rate=.0,
replace_stride_with_dilation=None):
super(PoolingTransformer, self).__init__()
total_block = sum(depth)
padding = 0
block_idx = 0
if replace_stride_with_dilation is None:
replace_stride_with_dilation = [False, False]
self.dilation = 1
width = math.floor(
(image_size + 2 * padding - patch_size) / stride + 1)
self.base_dims = base_dims
self.heads = heads
self.num_classes = num_classes
self.patch_size = patch_size
self.pos_embed = nn.Parameter(
torch.randn(1, base_dims[0] * heads[0], width, width),
requires_grad=True)
self.patch_embed = conv_embedding(in_chans, base_dims[0] * heads[0],
patch_size, stride, padding)
self.cls_token = nn.Parameter(
torch.randn(1, 1, base_dims[0] * heads[0]),
requires_grad=True)
self.pos_drop = nn.Dropout(p=drop_rate)
self.transformers = nn.ModuleList([])
self.pools = nn.ModuleList([])
for stage in range(len(depth)):
drop_path_prob = [drop_path_rate * i / total_block
for i in
range(block_idx, block_idx + depth[stage])]
block_idx += depth[stage]
self.transformers.append(
Transformer(base_dims[stage], depth[stage], heads[stage],
mlp_ratio,
drop_rate, attn_drop_rate, drop_path_prob)
)
if stage < len(heads) - 1:
stride = 2
if replace_stride_with_dilation[stage]:
self.dilation *= stride
stride = 1
self.pools.append(
conv_head_pooling(base_dims[stage] * heads[stage],
base_dims[stage + 1] * heads[stage + 1],
stride=stride,
dilation=self.dilation)
)
self.norm = nn.LayerNorm(base_dims[-1] * heads[-1], eps=1e-6)
# Classifier head
self.head = nn.Linear(base_dims[-1] * heads[-1],
num_classes) if num_classes > 0 else nn.Identity()
trunc_normal_(self.pos_embed, std=.02)
trunc_normal_(self.cls_token, std=.02)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
@torch.jit.ignore
def no_weight_decay(self):
return {'pos_embed', 'cls_token'}
def get_classifier(self):
return self.head
def reset_classifier(self, num_classes, global_pool=''):
self.num_classes = num_classes
self.head = nn.Linear(self.embed_dim,
num_classes) if num_classes > 0 else nn.Identity()
def no_grad_head(self):
self.head.weight.requires_grad_(False)
self.head.bias.requires_grad_(False)
self.norm.weight.requires_grad_(False)
self.norm.bias.requires_grad_(False)
def change_resolution(self, h, w):
self.pos_embed = nn.Parameter(
F.interpolate(self.pos_embed.data, (h, w), mode='bicubic'),
requires_grad=True
)
def forward_features(self, x):
x = self.patch_embed(x)
if x.shape[2:4] == self.pos_embed.shape[2:4]:
pos_embed = self.pos_embed
else:
pos_embed = F.interpolate(self.pos_embed, x.shape[2:4],
mode='bicubic')
x = self.pos_drop(x + pos_embed)
cls_tokens = self.cls_token.expand(x.shape[0], -1, -1)
features = []
for stage in range(len(self.pools)):
x, cls_tokens = self.transformers[stage](x, cls_tokens)
features.append(x)
x, cls_tokens = self.pools[stage](x, cls_tokens)
x, cls_tokens = self.transformers[-1](x, cls_tokens)
features.append(x)
return features, cls_tokens
def forward(self, x):
features, cls_tokens = self.forward_features(x)
return features I hope my answers solve your questions about our detection setting. Best |
Thank you for the detailed explanation about my questions!!! Just one more question about the architecture. In the architecture that you have shared, there is dilation argument for conv_head_pooling which does not exist conv_head_poling class.
In this case, since self.dilation is just 1, which is the default value of torch.nn.Conv2d, can I just ignore the dilation? Thank you! |
Yes. you can ignore the dilation option. Because Deformable-DETR supports the dilation option for backbone network, I implemented it for PiT. |
Great! Thank you for the detailed explanations!! |
Dear authors
Thank you for the great paper and its model architecture.
I have some questions related to the object detections in your paper.
In Section 4.2 (Object Detection), it is written as the following:
So, my questions are
Thank you in advance.
The text was updated successfully, but these errors were encountered: