Skip to content

VideoMAE missing CLS tokens in embedding #21016

@z5163449

Description

@z5163449

System Info

I'm not sure if I've missed something in the code, but I can't seem to find where the CLS tokens are added? I have input data of shape (64,45,2,32,32) with tubelet size = 5, patch_size = 4. This results in a sequence length of 576. From my understanding that is the total number of tubelets. I see that after the data is passed through the embedding layer the final embedding shape is (64,576,768) where 768 is the hidden size. However, should the dimensions not be (64,577,768) since we should be adding a CLS token to the sequence?

Would be great to get hear back soon because I'm not sure if I'm wrong or if there is something wrong with the code.

Thanks!
@NielsRogge

Reproduction

pixel_values = torch.randn(1,45, 2, 32, 32)
config = VideoMAEConfig()
config.num_frames = 45
config.image_size = 32
config.patch_size = 4
config.tubelet_size = 5
config.num_channels = 2

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (model.config.num_frames // model.config.tubelet_size) * num_patches_per_frame
print(seq_length.shape)

videomae = VideoMAEModel(config)
output = videomae(pixel_values, output_hidden_states=True)
sequence_output = output[0]
print(sequence_output.shape)

Expected behavior

seq_length = 576
sequence_output = (1,577,768)
The embedding sequence length should be total number of tubelets + 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions