What's the intuition behind using Conv filter on the top of Mel-log feature? #316

sgondala · 2022-10-13T21:38:01Z

sgondala
Oct 13, 2022

Hey,

I have a question regarding the architecture. Post mel features we get an 80 x 3000 vector, which we send through a couple of convolution filters to generate a 512 x 1500 vector. We then send this vector (along with positional embeddings) to the encoder.

What's the intuition behind using these conv filters?

Answered by jongwook

Oct 17, 2022

A rough intuition goes like:

The Mel spectrogram is a good representation of audio, and we need a way to upscale the 80 dimensions to the transformer model's width, like 512. Ideally, we want each coordinate of that 512-dimensional distribution to be more or less independent from each other, and convolutional layers are a great way to learn such features from continuous inputs. We also reduced the context length from 3000 to 1500 while doing this, which is computationally advantageous because self-attention is O(L^2).

But given the success of Vision Transformers and Audio Spectrogram Transformers, all this might not be necessary for larger models.

View full answer

Arlen22 · 2022-10-16T12:13:00Z

Arlen22
Oct 16, 2022

Section 2.2 of the paper mentions these I think, and references another paper. Maybe that gives you something?

https://cdn.openai.com/papers/whisper.pdf

0 replies

FurkanGozukara · 2022-10-16T20:46:53Z

FurkanGozukara
Oct 16, 2022

Possibly just tested and found to be working good numbers within the given hardware power

0 replies

jongwook · 2022-10-17T18:33:59Z

jongwook
Oct 17, 2022
Maintainer

A rough intuition goes like:

The Mel spectrogram is a good representation of audio, and we need a way to upscale the 80 dimensions to the transformer model's width, like 512. Ideally, we want each coordinate of that 512-dimensional distribution to be more or less independent from each other, and convolutional layers are a great way to learn such features from continuous inputs. We also reduced the context length from 3000 to 1500 while doing this, which is computationally advantageous because self-attention is O(L^2).

But given the success of Vision Transformers and Audio Spectrogram Transformers, all this might not be necessary for larger models.

4 replies

Majdoddin Dec 29, 2022

Thank you for explaining and the references. Don't you think even the Mel spectrogram is not necessary, and the transformer can be trained directly on wav data? maybe after a scaling.

ozancaglayan Jan 7, 2023

I think encoding purely raw audio with stacks of conv layers is slower and takes some significant processing time compared to using spectral features that are cheaper to extract. There is this MelHuBERT paper that replaced raw audio encoder with Mel based features to save around 30% computation time. But i am curious whether this was the motivation for whisper as well?

fleek Jan 7, 2023

I think encoding purely raw audio with stacks of conv layers is slower and takes some significant processing time compared to using spectral features that are cheaper to extract. There is this MelHuBERT paper that replaced raw audio encoder with Mel based features to save around 30% computation time. But i am curious whether this was the motivation for whisper as well?

I think there is a new technique which uses FFT, because currently whisper is still attention based. I think we need to see whisper2 come along

fleek Jan 7, 2023

I think to further improve on whisper would require regressive behaviour, because the current trained sliding window is 30 secs. You know how many words can be said in 30 secs, if it cannot be matched to a confidence level, halfing the window to every iteration, may yield more accurate results albeit longer processing times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the intuition behind using Conv filter on the top of Mel-log feature? #316

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What's the intuition behind using Conv filter on the top of Mel-log feature? #316

Uh oh!

Uh oh!

sgondala Oct 13, 2022

Replies: 3 comments · 4 replies

Uh oh!

Arlen22 Oct 16, 2022

Uh oh!

FurkanGozukara Oct 16, 2022

Uh oh!

jongwook Oct 17, 2022 Maintainer

Uh oh!

Majdoddin Dec 29, 2022

Uh oh!

ozancaglayan Jan 7, 2023

Uh oh!

fleek Jan 7, 2023

Uh oh!

fleek Jan 7, 2023

sgondala
Oct 13, 2022

Replies: 3 comments 4 replies

Arlen22
Oct 16, 2022

FurkanGozukara
Oct 16, 2022

jongwook
Oct 17, 2022
Maintainer