New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Position Embedding #6
Comments
Hi Gorjan Did you have some time to look at the above? |
Will do so on Monday. Sorry for the delay. |
No worries. Thanks. Looking forward. |
Hi Gorjan Sorry to bug you again. Do you have any update on the above? Thanks |
Hi MIchael, I had time to look at this today.
|
Hi Gorjan So, with regards to With regards to CACNF, if I use your code, with the following configuration,
I end up with the following error for line 267 I had found a fix around this through the following calculation:
Note the 1280/720 is due to the size of the images in my data. |
I understand where the issues come from. The An easy, straight forward fix would be to perform pooling right after the feature extraction, to ensure that regardless of the frame size (as long as it's larger than With this, you should be fine inputing frames of any size to the model (again, as long as they are > 112), test bellow: from modelling.models import TransformerResnet
from modelling.configs import AppearanceModelConfig
config = AppearanceModelConfig(appearance_num_frames=32,
resnet_model_path="../models/renamed_models/r3d50_KMS_200ep.pth",
num_classes=10)
model = TransformerResnet(config)
with torch.no_grad():
output = model({"video_frames": torch.rand(4, 3, 32, 200, 200)})
print(output["resnet3d"].size()) Note that the code above makes the assumption that the number of frames is > 32. |
Hi Gorjan That seems to have solved a part of the problem indeed: however, there is the issue with the dependence on When
When
Obviously, it works as expected with Shouldn't the resnet output dimension also change with appearance_num_frames? or is there something wrong in the addition? |
I did some further investigation: Basically, the ResNet temporal dimension output [Batch, Hidden, Temporal, Spatial, Spatial] is always However, I also noticed that the position embedding is never initialised: i.e. it is always zero. Is it encoding the position of the frame? |
Hi Michael, What you found w.r.t. the Resnet is definitely expected, and it occurs because of the pooling taking place in the Resnet. The position embedding is zero initialized, but is learned during training. You might want to try random initialization self.pos_embed = nn.Parameter(torch.rand(config.appearance_num_frames + 1, 1, config.hidden_size)) but I don't think you'll see much improvement there. Besides, the solution I was able to come up with w.r.t. the variable number of frames can be something like:
|
Hi Gorjan Thanks for the above. Right now, I am just doing the below:
and then:
|
This is fine too, I don't think that there would be a big difference in performance between what you're doing and what I wrote above. |
As I was tweaking the code, I found some design choices in the position embeddings which I cannot understand.
For the STLT, it seems that the position embedding vector does not take into consideration the number of sampled frames. Specifically, although it does use the config parameter
config.layout_num_frames
, this is never set in train.py (lines 87-96) or inference.py (lines 47-54). This means that the position embedding vector is always of size 256 (configs.py line 109). Is there a reason for this? I have changed this in my code, and ensuring that the position embedding is alwaysconfig.layout_num_frames + 1
- does this make sense?For CACNF, the position embedding seems to be dependent on the Batch-Size and Image size itself for the summation with the features to work (models.py line 267). If I change the
config.spatial_size
or the batch size to less than 16, then there is a dimension mismatch. I hacked this by computing some intermediary values in my code but it is somewhat hard-coded all the same based on my data. Is there a reason for this dependency/architecture?The text was updated successfully, but these errors were encountered: