Position encoding for downsteam task when pos_mask_ratio=1 and other questions #5

KJ-rc · 2024-05-03T09:27:02Z

Hi,
Thank you for the impressive work. I want to double-check a few points about the paper and code.

When setting pos_mask_ratio=1 in pre-training, do we apply any position encoding in downstream tasks, e.g., linear probing? Also, could we say DropPos is almost equivalent to Zhai et al [1], under this setting?
I found "--multi_task" in the pre-train code. However, it seems no related reports about it. I am curious about its performance boosting.
The visible patches with masked positions are involved in the encoder processing. This is somehow different from MAE, shouldn't they join later in the decoder stage (further speed up training?)? Under this setting, what's the difference between an encoder and a decoder?

[1] Zhai et al, Position Prediction as an Effective Pretraining Strategy

Haochen-Wang409 · 2024-05-03T15:39:14Z

Hi, thanks for your attention to our work! Here are point-to-point responses:

The positional embeddings will be added to downstream tasks when setting pos_mask_ratio=1 in pre-training. DropPos is not equivalent to MP3 [1] with pos_mask_ratio=1 because the visible patches of DropPos are encoded with positional embeddings while no positional information is added to context tokens in [1]. Moreover, DropPos employs a patch masking stage. Therefore, DropPos is more efficient than [1].
The multi_task setting is expected to boost ~0.5% of the top-1 accuracy on ImageNet-1K with a ViT-B backbone pre-trained with 200 epochs.
DropPos tries to reconstruct dropped positions based on patch appearances. These visible patches without positional embeddings provide sufficient information for further position reconstruction. Similar to most self-supervised methods, the encoder is responsible for learning scalable feature representations while the decoder is served to the particular pre-text task, i.e., reconstructing dropped positions in DropPos.

KJ-rc · 2024-05-06T12:03:33Z

Thank you for the explanation. I still have a few questions.

When pos_mask_ratio=1, DropPos didn't see any position info either, did it?
Regarding my 3rd question, if there are no new tokens joined in the decoder, what's the difference between a "12 layers encoder + 2 layers decoder" setting and a "14 layers encoder" setting?

Haochen-Wang409 · 2024-05-06T12:07:10Z

It seems no difference. The only thing that matters may be to choose features from which layer for downstream classification.

KJ-rc · 2024-05-09T07:57:41Z

Thank you. That answers my questions.

KJ-rc closed this as completed May 6, 2024

KJ-rc reopened this May 6, 2024

KJ-rc closed this as completed May 9, 2024

Provide feedback