Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Position encoding for downsteam task when pos_mask_ratio=1 and other questions #5

Closed
KJ-rc opened this issue May 3, 2024 · 4 comments

Comments

@KJ-rc
Copy link

KJ-rc commented May 3, 2024

Hi,
Thank you for the impressive work. I want to double-check a few points about the paper and code.

  • When setting pos_mask_ratio=1 in pre-training, do we apply any position encoding in downstream tasks, e.g., linear probing? Also, could we say DropPos is almost equivalent to Zhai et al [1], under this setting?
  • I found "--multi_task" in the pre-train code. However, it seems no related reports about it. I am curious about its performance boosting.
  • The visible patches with masked positions are involved in the encoder processing. This is somehow different from MAE, shouldn't they join later in the decoder stage (further speed up training?)? Under this setting, what's the difference between an encoder and a decoder?

[1] Zhai et al, Position Prediction as an Effective Pretraining Strategy

@Haochen-Wang409
Copy link
Owner

Hi, thanks for your attention to our work! Here are point-to-point responses:

  • The positional embeddings will be added to downstream tasks when setting pos_mask_ratio=1 in pre-training. DropPos is not equivalent to MP3 [1] with pos_mask_ratio=1 because the visible patches of DropPos are encoded with positional embeddings while no positional information is added to context tokens in [1]. Moreover, DropPos employs a patch masking stage. Therefore, DropPos is more efficient than [1].
  • The multi_task setting is expected to boost ~0.5% of the top-1 accuracy on ImageNet-1K with a ViT-B backbone pre-trained with 200 epochs.
  • DropPos tries to reconstruct dropped positions based on patch appearances. These visible patches without positional embeddings provide sufficient information for further position reconstruction. Similar to most self-supervised methods, the encoder is responsible for learning scalable feature representations while the decoder is served to the particular pre-text task, i.e., reconstructing dropped positions in DropPos.

@KJ-rc KJ-rc closed this as completed May 6, 2024
@KJ-rc
Copy link
Author

KJ-rc commented May 6, 2024

Thank you for the explanation. I still have a few questions.

  1. When pos_mask_ratio=1, DropPos didn't see any position info either, did it?
  2. Regarding my 3rd question, if there are no new tokens joined in the decoder, what's the difference between a "12 layers encoder + 2 layers decoder" setting and a "14 layers encoder" setting?

@KJ-rc KJ-rc reopened this May 6, 2024
@Haochen-Wang409
Copy link
Owner

It seems no difference. The only thing that matters may be to choose features from which layer for downstream classification.

@KJ-rc
Copy link
Author

KJ-rc commented May 9, 2024

Thank you. That answers my questions.

@KJ-rc KJ-rc closed this as completed May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants