-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Description
System Info
I'm not sure if I've missed something in the code, but I can't seem to find where the CLS tokens are added? I have input data of shape (64,45,2,32,32) with tubelet size = 5, patch_size = 4. This results in a sequence length of 576. From my understanding that is the total number of tubelets. I see that after the data is passed through the embedding layer the final embedding shape is (64,576,768) where 768 is the hidden size. However, should the dimensions not be (64,577,768) since we should be adding a CLS token to the sequence?
Would be great to get hear back soon because I'm not sure if I'm wrong or if there is something wrong with the code.
Thanks!
@NielsRogge
Reproduction
pixel_values = torch.randn(1,45, 2, 32, 32)
config = VideoMAEConfig()
config.num_frames = 45
config.image_size = 32
config.patch_size = 4
config.tubelet_size = 5
config.num_channels = 2
num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (model.config.num_frames // model.config.tubelet_size) * num_patches_per_frame
print(seq_length.shape)
videomae = VideoMAEModel(config)
output = videomae(pixel_values, output_hidden_states=True)
sequence_output = output[0]
print(sequence_output.shape)
Expected behavior
seq_length = 576
sequence_output = (1,577,768)
The embedding sequence length should be total number of tubelets + 1