-
|
Hey, I have a question regarding the architecture. Post mel features we get an 80 x 3000 vector, which we send through a couple of convolution filters to generate a 512 x 1500 vector. We then send this vector (along with positional embeddings) to the encoder. What's the intuition behind using these conv filters? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
|
Section 2.2 of the paper mentions these I think, and references another paper. Maybe that gives you something? |
Beta Was this translation helpful? Give feedback.
-
|
Possibly just tested and found to be working good numbers within the given hardware power |
Beta Was this translation helpful? Give feedback.
-
|
A rough intuition goes like: The Mel spectrogram is a good representation of audio, and we need a way to upscale the 80 dimensions to the transformer model's width, like 512. Ideally, we want each coordinate of that 512-dimensional distribution to be more or less independent from each other, and convolutional layers are a great way to learn such features from continuous inputs. We also reduced the context length from 3000 to 1500 while doing this, which is computationally advantageous because self-attention is O(L^2). But given the success of Vision Transformers and Audio Spectrogram Transformers, all this might not be necessary for larger models. |
Beta Was this translation helpful? Give feedback.
A rough intuition goes like:
The Mel spectrogram is a good representation of audio, and we need a way to upscale the 80 dimensions to the transformer model's width, like 512. Ideally, we want each coordinate of that 512-dimensional distribution to be more or less independent from each other, and convolutional layers are a great way to learn such features from continuous inputs. We also reduced the context length from 3000 to 1500 while doing this, which is computationally advantageous because self-attention is O(L^2).
But given the success of Vision Transformers and Audio Spectrogram Transformers, all this might not be necessary for larger models.