How does the cross attention work? #943

SinanAkkoyun · 2023-02-08T17:13:21Z

SinanAkkoyun
Feb 8, 2023

Hello,
I am trying to understand the autoregressiveness of the decoder.
Regarding the audio feature cross-attention: Does it for every iteration of the for loop (decoder.py, line 595 _main_loop) take ALL audio features (xa) simultaneously for each cross attention decoding or does it only take specific samples?

I want to understand how parallel/serial the whole approach is. I highly appreciate any answer!

Answered by SinanAkkoyun

Feb 9, 2023

@jongwook I hope the ping is ok

Forget what I asked above, this is the right question:
Given the for loop is at position 2s of a 5s wav file, it takes all 5s audio_features for prediction at any point, right? So, does predicting the token at timepoint 2s also take future audio_features into consideration?
What would happen if the decoder only had access to the current audio_features?

View full answer

SinanAkkoyun · 2023-02-09T01:01:01Z

SinanAkkoyun
Feb 9, 2023
Author

@jongwook I hope the ping is ok

Forget what I asked above, this is the right question:
Given the for loop is at position 2s of a 5s wav file, it takes all 5s audio_features for prediction at any point, right? So, does predicting the token at timepoint 2s also take future audio_features into consideration?
What would happen if the decoder only had access to the current audio_features?

8 replies

SinanAkkoyun Feb 9, 2023
Author

Awesome, after we hopefully gather some more intel we could work on that, I've been working on it and an interface for some time, planning on PRing when it's done

Yes, masked attention will be necessary, but also will require retraining... Regarding the relative positional embedding, why would that be needed? The sin pos encoding does scale from the beginning of a sentence, no?

atyshka Feb 9, 2023

I guess the question is how would you handle sequences longer than 30 seconds? I’d envision chopping off the previous audio so you have a constant window of 30s audio. But if you do that, the sinusoidal position embeddings no longer start at 0… I’m not sure if that would be a problem or not

SinanAkkoyun Feb 10, 2023
Author

That's no problem. No single sentence is more than 30 seconds long. My plan is to do VAD to detection and reset the positional embeddings for every sentence. Contextual robustness of that level (inter sentences context is negligible for real time detection)

SinanAkkoyun Feb 10, 2023
Author

The bigger problem lies within the architecture and retraining... The whole decoder cross-attention needs to be replaced and the decoder retrained...

atyshka Feb 10, 2023

The encoder also needs to be retrained, right? But yeah, the bigger problem is there's no code provided for training from scratch, and I'm not sure how many GPU-hours a model like this would take to train.

SinanAkkoyun · 2023-02-12T22:10:05Z

SinanAkkoyun
Feb 12, 2023
Author

@jongwook Are there any resources on how to retrain the decoder? (instead of fine tuning it) Or do you have an idea on how to solve the above without major retraining?

3 replies

jongwook Feb 15, 2023
Maintainer

The decoder is autoregressive while the encoder is not. Encoder needs only one forward pass, so running the encoder on every slice using a sliding 30s window (with ~1s stride) is not a big performance hit, relatively. More disciplined way is use a Transducer model like this, but it'd be an entirely different architecture and probably requires training from scratch.

SinanAkkoyun Feb 16, 2023
Author

I see, thank you very much. You mean different architecture than whisper or than the nvidia model?

jongwook Feb 16, 2023
Maintainer

The linked nvidia model is an example of transducer models, which works differently (while still being transformer-based) from Whisper, which is an encoder-decoder model. This blog article is a great introduction on how transducer models work.

How does the cross attention work? #943

Uh oh!

SinanAkkoyun Feb 8, 2023

Replies: 2 comments · 11 replies

Uh oh!

SinanAkkoyun Feb 9, 2023 Author

Uh oh!

SinanAkkoyun Feb 9, 2023 Author

Uh oh!

atyshka Feb 9, 2023

Uh oh!

SinanAkkoyun Feb 10, 2023 Author

Uh oh!

SinanAkkoyun Feb 10, 2023 Author

Uh oh!

atyshka Feb 10, 2023

Uh oh!

SinanAkkoyun Feb 12, 2023 Author

Uh oh!

Uh oh!

jongwook Feb 15, 2023 Maintainer

Uh oh!

SinanAkkoyun Feb 16, 2023 Author

Uh oh!

jongwook Feb 16, 2023 Maintainer

SinanAkkoyun
Feb 8, 2023

Replies: 2 comments 11 replies

SinanAkkoyun
Feb 9, 2023
Author

SinanAkkoyun Feb 9, 2023
Author

SinanAkkoyun Feb 10, 2023
Author

SinanAkkoyun Feb 10, 2023
Author

SinanAkkoyun
Feb 12, 2023
Author

jongwook Feb 15, 2023
Maintainer

SinanAkkoyun Feb 16, 2023
Author

jongwook Feb 16, 2023
Maintainer