Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Position Embedding #6

Closed
michael-camilleri opened this issue May 10, 2022 · 12 comments
Closed

Position Embedding #6

michael-camilleri opened this issue May 10, 2022 · 12 comments

Comments

@michael-camilleri
Copy link

As I was tweaking the code, I found some design choices in the position embeddings which I cannot understand.

  1. For the STLT, it seems that the position embedding vector does not take into consideration the number of sampled frames. Specifically, although it does use the config parameter config.layout_num_frames, this is never set in train.py (lines 87-96) or inference.py (lines 47-54). This means that the position embedding vector is always of size 256 (configs.py line 109). Is there a reason for this? I have changed this in my code, and ensuring that the position embedding is always config.layout_num_frames + 1 - does this make sense?

  2. For CACNF, the position embedding seems to be dependent on the Batch-Size and Image size itself for the summation with the features to work (models.py line 267). If I change the config.spatial_size or the batch size to less than 16, then there is a dimension mismatch. I hacked this by computing some intermediary values in my code but it is somewhat hard-coded all the same based on my data. Is there a reason for this dependency/architecture?

@michael-camilleri
Copy link
Author

Hi Gorjan

Did you have some time to look at the above?

@gorjanradevski
Copy link
Owner

Will do so on Monday. Sorry for the delay.

@michael-camilleri
Copy link
Author

No worries. Thanks. Looking forward.

@michael-camilleri
Copy link
Author

Hi Gorjan

Sorry to bug you again. Do you have any update on the above?

Thanks

@gorjanradevski
Copy link
Owner

Hi MIchael, I had time to look at this today.

  1. The position embedding vector in STLT is always created to be of the max size 256, but it's sliced in the forward pass, to select only the appropriate number of frames in the batch. See L105 in models.py

  2. In CACNF, the position embedding is an absolute position embedding inferred from the "fixed" number of frames you want to sample during training/inference. It is not dependent on the batch size. In general, you should be able to change both the batch size and the spatial size as you wish, however, I think you would have an issue with spatial sizes smaller than 112, but that issue would appear in the ResNet, I believe. Feel free to post the error you're getting in this issue and I'll have a look.

@michael-camilleri
Copy link
Author

michael-camilleri commented May 31, 2022

Hi Gorjan

So, with regards to STLT, that is what I understood in fact. In my branch I am just setting the embedding vector to be related to the layout_num_frames to save space:
self.register_buffer("position_ids", torch.arange(config.layout_num_frames+1).expand((1, -1)))

With regards to CACNF, if I use your code, with the following configuration,

appearance_num_frames = 12
frame_size = 200 # This is a new config parameter I introduced to control
batch_size = 1 # Had to do this to run on my machine: in reality, I am able to run on a cluster with batches = 5 but same error.

I end up with the following error for line 267 features = features + self.pos_embed:
The size of tensor a (85) must match the size of tensor b (13) at non-singleton dimension 0

I had found a fix around this through the following calculation:

_frames = 1 if config.appearance_num_frames < 16 else 2
_height = np.ceil(config.spatial_size / 32)
_width = np.ceil(1280/720 * config.spatial_size / 32)
_size = int(_frames * _width * _height + 1)

Note the 1280/720 is due to the size of the images in my data.
In particular, note the dependence on the number of appearance frames being less or more than 16 (but not directly related)

@gorjanradevski
Copy link
Owner

gorjanradevski commented May 31, 2022

I understand where the issues come from. The Resnet3D internals, as well as the code on top which I wrote make the assumption that the frame size is 112 x 112 -- which should be, ideally, as the Resnet3D has been (pre-)trained with frames of that size. Now, when the frame size is different than 112 x 112, the output from the Resnet3D will be different too: https://github.com/gorjanradevski/revisiting-spatial-temporal-layouts/blob/main/src/modelling/models.py#L256

An easy, straight forward fix would be to perform pooling right after the feature extraction, to ensure that regardless of the frame size (as long as it's larger than 112 x 112), the output is the same as if the frame size was 112 x 112. Add the following after L256: features = nn.AdaptiveAvgPool3d((2, 4, 4))(features).

With this, you should be fine inputing frames of any size to the model (again, as long as they are > 112), test bellow:

from modelling.models import TransformerResnet
from modelling.configs import AppearanceModelConfig

config = AppearanceModelConfig(appearance_num_frames=32,
                              resnet_model_path="../models/renamed_models/r3d50_KMS_200ep.pth",
                              num_classes=10)
model = TransformerResnet(config)

with torch.no_grad():
    output = model({"video_frames": torch.rand(4, 3, 32, 200, 200)})
    print(output["resnet3d"].size())

Note that the code above makes the assumption that the number of frames is > 32.

@michael-camilleri
Copy link
Author

michael-camilleri commented May 31, 2022

Hi Gorjan

That seems to have solved a part of the problem indeed: however, there is the issue with the dependence on appearance_num_frames (as expected):

When appearance_num_frames = 12
I get:

RuntimeError: The size of tensor a (33) must match the size of tensor b (13) at non-singleton dimension 0

When appearance_num_frames = 16

RuntimeError: The size of tensor a (33) must match the size of tensor b (17) at non-singleton dimension 0

Obviously, it works as expected with appearance_num_frames = 32, but again fails with appearance_num_frames = 33: i.e. the appearance num_frames is only supported = 32.

Shouldn't the resnet output dimension also change with appearance_num_frames? or is there something wrong in the addition?

@michael-camilleri
Copy link
Author

michael-camilleri commented May 31, 2022

I did some further investigation:

Basically, the ResNet temporal dimension output [Batch, Hidden, Temporal, Spatial, Spatial] is always ceil(appearance_num_frames / 16).
(e.g. if appearance_num_frames <= 16, then Temporal = 1, if 16 < appearance_num_frames <= 32, then Temporal = 2, etc...)
Is this expected?

However, I also noticed that the position embedding is never initialised: i.e. it is always zero. Is it encoding the position of the frame?

@gorjanradevski
Copy link
Owner

gorjanradevski commented Jun 2, 2022

Hi Michael,

What you found w.r.t. the Resnet is definitely expected, and it occurs because of the pooling taking place in the Resnet. The position embedding is zero initialized, but is learned during training. You might want to try random initialization

self.pos_embed = nn.Parameter(torch.rand(config.appearance_num_frames + 1, 1, config.hidden_size))

but I don't think you'll see much improvement there. Besides, the solution I was able to come up with w.r.t. the variable number of frames can be something like:

  • Create self.pos_embed as self.pos_embed = nn.Parameter(torch.rand(max_num_frames + 1, 1, config.hidden_size)). Note that max_num_frames is misnomer here, nevertheless, set it to something like 128.
  • Perform average pooling to deal with the variable frame size features = nn.AdaptiveAvgPool3d((None, 4, 4))(features). Note the None, meaning the temporal dimension will be kept the same.
  • Combine self.pos_embed with features by slicing features = features + self.pos_embed[:seq_len, :, :], essentially dealing with the variable number of frames.

@michael-camilleri
Copy link
Author

michael-camilleri commented Jun 3, 2022

Hi Gorjan

Thanks for the above.

Right now, I am just doing the below:

_emb_sz = int(np.ceil(config.appearance_num_frames / 16))
self.pooler = nn.AdaptiveAvgPool3d((_emb_sz, 4, 4))

and then:

self.pos_embed = nn.Parameter(torch.zeros(_emb_sz * 16 + 1, 1, config.hidden_size))

@gorjanradevski
Copy link
Owner

This is fine too, I don't think that there would be a big difference in performance between what you're doing and what I wrote above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants