Position Embedding #6

michael-camilleri · 2022-05-10T14:06:36Z

As I was tweaking the code, I found some design choices in the position embeddings which I cannot understand.

For the STLT, it seems that the position embedding vector does not take into consideration the number of sampled frames. Specifically, although it does use the config parameter config.layout_num_frames, this is never set in train.py (lines 87-96) or inference.py (lines 47-54). This means that the position embedding vector is always of size 256 (configs.py line 109). Is there a reason for this? I have changed this in my code, and ensuring that the position embedding is always config.layout_num_frames + 1 - does this make sense?
For CACNF, the position embedding seems to be dependent on the Batch-Size and Image size itself for the summation with the features to work (models.py line 267). If I change the config.spatial_size or the batch size to less than 16, then there is a dimension mismatch. I hacked this by computing some intermediary values in my code but it is somewhat hard-coded all the same based on my data. Is there a reason for this dependency/architecture?

The text was updated successfully, but these errors were encountered:

michael-camilleri · 2022-05-27T08:23:39Z

Hi Gorjan

Did you have some time to look at the above?

gorjanradevski · 2022-05-27T08:24:47Z

Will do so on Monday. Sorry for the delay.

michael-camilleri · 2022-05-27T09:25:57Z

No worries. Thanks. Looking forward.

michael-camilleri · 2022-05-31T06:44:31Z

Hi Gorjan

Sorry to bug you again. Do you have any update on the above?

Thanks

gorjanradevski · 2022-05-31T08:19:38Z

Hi MIchael, I had time to look at this today.

The position embedding vector in STLT is always created to be of the max size 256, but it's sliced in the forward pass, to select only the appropriate number of frames in the batch. See L105 in models.py
In CACNF, the position embedding is an absolute position embedding inferred from the "fixed" number of frames you want to sample during training/inference. It is not dependent on the batch size. In general, you should be able to change both the batch size and the spatial size as you wish, however, I think you would have an issue with spatial sizes smaller than 112, but that issue would appear in the ResNet, I believe. Feel free to post the error you're getting in this issue and I'll have a look.

michael-camilleri · 2022-05-31T10:03:11Z

Hi Gorjan

So, with regards to STLT, that is what I understood in fact. In my branch I am just setting the embedding vector to be related to the layout_num_frames to save space:
self.register_buffer("position_ids", torch.arange(config.layout_num_frames+1).expand((1, -1)))

With regards to CACNF, if I use your code, with the following configuration,

appearance_num_frames = 12
frame_size = 200 # This is a new config parameter I introduced to control
batch_size = 1 # Had to do this to run on my machine: in reality, I am able to run on a cluster with batches = 5 but same error.

I end up with the following error for line 267 features = features + self.pos_embed:
The size of tensor a (85) must match the size of tensor b (13) at non-singleton dimension 0

I had found a fix around this through the following calculation:

_frames = 1 if config.appearance_num_frames < 16 else 2
_height = np.ceil(config.spatial_size / 32)
_width = np.ceil(1280/720 * config.spatial_size / 32)
_size = int(_frames * _width * _height + 1)

Note the 1280/720 is due to the size of the images in my data.
In particular, note the dependence on the number of appearance frames being less or more than 16 (but not directly related)

gorjanradevski · 2022-05-31T11:31:58Z

I understand where the issues come from. The Resnet3D internals, as well as the code on top which I wrote make the assumption that the frame size is 112 x 112 -- which should be, ideally, as the Resnet3D has been (pre-)trained with frames of that size. Now, when the frame size is different than 112 x 112, the output from the Resnet3D will be different too: https://github.com/gorjanradevski/revisiting-spatial-temporal-layouts/blob/main/src/modelling/models.py#L256

An easy, straight forward fix would be to perform pooling right after the feature extraction, to ensure that regardless of the frame size (as long as it's larger than 112 x 112), the output is the same as if the frame size was 112 x 112. Add the following after L256: features = nn.AdaptiveAvgPool3d((2, 4, 4))(features).

With this, you should be fine inputing frames of any size to the model (again, as long as they are > 112), test bellow:

from modelling.models import TransformerResnet
from modelling.configs import AppearanceModelConfig

config = AppearanceModelConfig(appearance_num_frames=32,
                              resnet_model_path="../models/renamed_models/r3d50_KMS_200ep.pth",
                              num_classes=10)
model = TransformerResnet(config)

with torch.no_grad():
    output = model({"video_frames": torch.rand(4, 3, 32, 200, 200)})
    print(output["resnet3d"].size())

Note that the code above makes the assumption that the number of frames is > 32.

michael-camilleri · 2022-05-31T13:42:55Z

Hi Gorjan

That seems to have solved a part of the problem indeed: however, there is the issue with the dependence on appearance_num_frames (as expected):

When appearance_num_frames = 12
I get:

RuntimeError: The size of tensor a (33) must match the size of tensor b (13) at non-singleton dimension 0

When appearance_num_frames = 16

RuntimeError: The size of tensor a (33) must match the size of tensor b (17) at non-singleton dimension 0

Obviously, it works as expected with appearance_num_frames = 32, but again fails with appearance_num_frames = 33: i.e. the appearance num_frames is only supported = 32.

Shouldn't the resnet output dimension also change with appearance_num_frames? or is there something wrong in the addition?

michael-camilleri · 2022-05-31T14:01:31Z

I did some further investigation:

Basically, the ResNet temporal dimension output [Batch, Hidden, Temporal, Spatial, Spatial] is always ceil(appearance_num_frames / 16).
(e.g. if appearance_num_frames <= 16, then Temporal = 1, if 16 < appearance_num_frames <= 32, then Temporal = 2, etc...)
Is this expected?

However, I also noticed that the position embedding is never initialised: i.e. it is always zero. Is it encoding the position of the frame?

gorjanradevski · 2022-06-02T14:20:39Z

Hi Michael,

What you found w.r.t. the Resnet is definitely expected, and it occurs because of the pooling taking place in the Resnet. The position embedding is zero initialized, but is learned during training. You might want to try random initialization

self.pos_embed = nn.Parameter(torch.rand(config.appearance_num_frames + 1, 1, config.hidden_size))

but I don't think you'll see much improvement there. Besides, the solution I was able to come up with w.r.t. the variable number of frames can be something like:

Create self.pos_embed as self.pos_embed = nn.Parameter(torch.rand(max_num_frames + 1, 1, config.hidden_size)). Note that max_num_frames is misnomer here, nevertheless, set it to something like 128.
Perform average pooling to deal with the variable frame size features = nn.AdaptiveAvgPool3d((None, 4, 4))(features). Note the None, meaning the temporal dimension will be kept the same.
Combine self.pos_embed with features by slicing features = features + self.pos_embed[:seq_len, :, :], essentially dealing with the variable number of frames.

michael-camilleri · 2022-06-03T07:02:49Z

Hi Gorjan

Thanks for the above.

Right now, I am just doing the below:

_emb_sz = int(np.ceil(config.appearance_num_frames / 16))
self.pooler = nn.AdaptiveAvgPool3d((_emb_sz, 4, 4))

and then:

self.pos_embed = nn.Parameter(torch.zeros(_emb_sz * 16 + 1, 1, config.hidden_size))

gorjanradevski · 2022-06-03T14:58:48Z

This is fine too, I don't think that there would be a big difference in performance between what you're doing and what I wrote above.

gorjanradevski closed this as completed May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Position Embedding #6

Position Embedding #6

michael-camilleri commented May 10, 2022

michael-camilleri commented May 27, 2022

gorjanradevski commented May 27, 2022

michael-camilleri commented May 27, 2022

michael-camilleri commented May 31, 2022

gorjanradevski commented May 31, 2022

michael-camilleri commented May 31, 2022 •

edited

gorjanradevski commented May 31, 2022 •

edited

michael-camilleri commented May 31, 2022 •

edited

michael-camilleri commented May 31, 2022 •

edited

gorjanradevski commented Jun 2, 2022 •

edited

michael-camilleri commented Jun 3, 2022 •

edited

gorjanradevski commented Jun 3, 2022

Position Embedding #6

Position Embedding #6

Comments

michael-camilleri commented May 10, 2022

michael-camilleri commented May 27, 2022

gorjanradevski commented May 27, 2022

michael-camilleri commented May 27, 2022

michael-camilleri commented May 31, 2022

gorjanradevski commented May 31, 2022

michael-camilleri commented May 31, 2022 • edited

gorjanradevski commented May 31, 2022 • edited

michael-camilleri commented May 31, 2022 • edited

michael-camilleri commented May 31, 2022 • edited

gorjanradevski commented Jun 2, 2022 • edited

michael-camilleri commented Jun 3, 2022 • edited

gorjanradevski commented Jun 3, 2022

michael-camilleri commented May 31, 2022 •

edited

gorjanradevski commented May 31, 2022 •

edited

michael-camilleri commented May 31, 2022 •

edited

michael-camilleri commented May 31, 2022 •

edited

gorjanradevski commented Jun 2, 2022 •

edited

michael-camilleri commented Jun 3, 2022 •

edited