Skip to content
Discussion options

You must be logged in to vote

A rough intuition goes like:

The Mel spectrogram is a good representation of audio, and we need a way to upscale the 80 dimensions to the transformer model's width, like 512. Ideally, we want each coordinate of that 512-dimensional distribution to be more or less independent from each other, and convolutional layers are a great way to learn such features from continuous inputs. We also reduced the context length from 3000 to 1500 while doing this, which is computationally advantageous because self-attention is O(L^2).

But given the success of Vision Transformers and Audio Spectrogram Transformers, all this might not be necessary for larger models.

Replies: 3 comments 4 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
4 replies
@Majdoddin
Comment options

@ozancaglayan
Comment options

@fleek
Comment options

@fleek
Comment options

Answer selected by jongwook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
7 participants