From ViT models to audio

Hi Khaled,

In your code, there is the possibility to create a ViT architecture and load the corresponding pretrained weights (like "vit_tiny_patch16_224").

Do we agree that such architectures only work with similar size inputs (224*224 for example)? If so, how did you finetune a model on Audioset that was initially trained on Imagenet (going from 224*224 to 128*998 for example)? Is this procedure in some code in your repo?

I read the AST paper I guess you took inspiration from and they talk about it in some details.
I was just wondering how I would do the whole process (ImageNet -> AudioSet -> ESC50) on my end.

Thanks a lot.

Antoine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

From ViT models to audio #45

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

From ViT models to audio #45

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions