Making a batch from samples with different number of tokens #753

Majdoddin · 2022-12-26T20:42:14Z

Majdoddin
Dec 26, 2022

To fine-tune whisper's decoder, I should make a batch of training samples. Each sample has an audio_features tensor, which is the output of encoder, and a tokens tensor.
Now all audio_features are the same length (30s window), but each sample can have a different number of tokens, and here is the problem: all the tokens tensors should be packed into one batch Tensor, to be passed as argument to model.decoder() (which is mapped to TextDecoder.forward(), which in Turn calls torch.nn.functional.embedding). But Tensors of different lengths can't be packed into one tensor :(

@jongwook How did you do it while training, please?
Is there some "neutral" Token, to pad all tensors to the same size?
Or should I change the codebase to do the embedding sample by sample?

Similar problem is discussed here.

Answered by jongwook

Jan 8, 2023

You can pad the tensor (to the right) with any token, and the choice does not affect the training because of the autoregressive attention mask used in the decoder. It is also important that the loss is masked accordingly.

In the code, the mask is implemented as the following:

whisper/whisper/model.py

Line 175 in 28769fc

mask = torch.empty(n_ctx, n_ctx).fill_(-np.inf).triu_(1)

View full answer

jongwook · 2023-01-08T12:49:06Z

jongwook
Jan 8, 2023
Maintainer

You can pad the tensor (to the right) with any token, and the choice does not affect the training because of the autoregressive attention mask used in the decoder. It is also important that the loss is masked accordingly.

In the code, the mask is implemented as the following:

whisper/whisper/model.py

Line 175 in 28769fc

mask = torch.empty(n_ctx, n_ctx).fill_(-np.inf).triu_(1)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making a batch from samples with different number of tokens #753

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Making a batch from samples with different number of tokens #753

Uh oh!

Uh oh!

Majdoddin Dec 26, 2022

Replies: 1 comment

Uh oh!

jongwook Jan 8, 2023 Maintainer

Majdoddin
Dec 26, 2022

jongwook
Jan 8, 2023
Maintainer