VideoMAE missing CLS tokens in embedding

### System Info

I'm not sure if I've missed something in the code, but I can't seem to find where the CLS tokens are added? I have input data of shape (64,45,2,32,32) with tubelet size = 5, patch_size = 4. This results in a sequence length of 576. From my understanding that is the total number of tubelets. I see that after the data is passed through the embedding layer the final embedding shape is (64,576,768) where 768 is the hidden size. However, should the dimensions not be (64,577,768) since we should be adding a CLS token to the sequence?

Would be great to get hear back soon because I'm not sure if I'm wrong or if there is something wrong with the code.

Thanks!
@NielsRogge 

### Reproduction

pixel_values = torch.randn(1,45, 2, 32, 32)
config = VideoMAEConfig()
config.num_frames = 45
config.image_size = 32
config.patch_size = 4
config.tubelet_size = 5
config.num_channels = 2

num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
seq_length = (model.config.num_frames // model.config.tubelet_size) * num_patches_per_frame
print(seq_length.shape)

videomae = VideoMAEModel(config)
output = videomae(pixel_values, output_hidden_states=True)
sequence_output = output[0]
print(sequence_output.shape)

### Expected behavior
seq_length = 576
sequence_output = (1,577,768)
The embedding sequence length should be total number of  tubelets + 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VideoMAE missing CLS tokens in embedding #21016

System Info

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VideoMAE missing CLS tokens in embedding #21016

Description

System Info

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions