Solved issue related to prediction padding and pad_token_id #68

nkaenzig · 2022-09-07T22:52:13Z

This is a relatively hidden bug and took me quite some time to debug :)

Issue
_compute_metrics() in the evaluation loop calculates wrong CER & WER metrics when using a TokenSet where the pad_token_id is not equal to 0. This doesn't affect the loss calculation / training as such, but the logged metrics during training will be wrong and won't match the metrics calculated using model.evaluate() after training.

Reason
Similar to the label_ids the prediction logits are passed to _compute_metrics() as a matrix padded with -100 values.
Currently the argmax call which maps logits to token ids converts these -100 values to 0. So after the argmax, pred_ids will be 0-padded. For most wav2vec2 models this is not an issue, because their vocab.json assigns the ID 0 to the <pad> token. However, if you use a custom TokenSet for finetuning, <pad> will most probably not be mapped to 0, so the obtained "0-padding" values will wrongly correpond to another token.

See relevant code here: https://github.com/jonatasgrosman/huggingsound/blob/main/huggingsound/trainer.py#L599

pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)

pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

Proposed Solution
Save a padding_mask which stores the location of the padded -100 in the prediction logits. Then after applying the argmax use the mask to set the padded entries to the ID corresponding to the padding token.

nkaenzig · 2022-09-13T16:08:35Z

@jonatasgrosman Would be great if we could merge this :)

jonatasgrosman · 2022-09-14T18:42:39Z

Thanks, @nkaenzig, for noticing that bug and making this contribution :)
And sorry for the delayed response

fix: solved issue related to prediction padding and pad_token_id

77b31b3

jonatasgrosman merged commit 9b355a8 into jonatasgrosman:main Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solved issue related to prediction padding and pad_token_id #68

Solved issue related to prediction padding and pad_token_id #68

nkaenzig commented Sep 7, 2022

nkaenzig commented Sep 13, 2022

jonatasgrosman commented Sep 14, 2022

Solved issue related to prediction padding and pad_token_id #68

Solved issue related to prediction padding and pad_token_id #68

Conversation

nkaenzig commented Sep 7, 2022

nkaenzig commented Sep 13, 2022

jonatasgrosman commented Sep 14, 2022