Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solved issue related to prediction padding and pad_token_id #68

Merged
merged 1 commit into from Sep 14, 2022

Conversation

nkaenzig
Copy link
Contributor

@nkaenzig nkaenzig commented Sep 7, 2022

This is a relatively hidden bug and took me quite some time to debug :)

Issue
_compute_metrics() in the evaluation loop calculates wrong CER & WER metrics when using a TokenSet where the pad_token_id is not equal to 0. This doesn't affect the loss calculation / training as such, but the logged metrics during training will be wrong and won't match the metrics calculated using model.evaluate() after training.

Reason
Similar to the label_ids the prediction logits are passed to _compute_metrics() as a matrix padded with -100 values.
Currently the argmax call which maps logits to token ids converts these -100 values to 0. So after the argmax, pred_ids will be 0-padded. For most wav2vec2 models this is not an issue, because their vocab.json assigns the ID 0 to the <pad> token. However, if you use a custom TokenSet for finetuning, <pad> will most probably not be mapped to 0, so the obtained "0-padding" values will wrongly correpond to another token.
image
See relevant code here: https://github.com/jonatasgrosman/huggingsound/blob/main/huggingsound/trainer.py#L599

pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)

pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

Proposed Solution
Save a padding_mask which stores the location of the padded -100 in the prediction logits. Then after applying the argmax use the mask to set the padded entries to the ID corresponding to the padding token.

@nkaenzig
Copy link
Contributor Author

@jonatasgrosman Would be great if we could merge this :)

@jonatasgrosman jonatasgrosman merged commit 9b355a8 into jonatasgrosman:main Sep 14, 2022
@jonatasgrosman
Copy link
Owner

Thanks, @nkaenzig, for noticing that bug and making this contribution :)
And sorry for the delayed response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants