-
Notifications
You must be signed in to change notification settings - Fork 25.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data2vec-audio returns different results with padded input #25621
Comments
I just noticed this in the documentation: https://huggingface.co/docs/transformers/model_doc/data2vec#transformers.Data2VecAudioModel.forward.attention_mask
Does it mean the preprocessor config is wrong? |
cc @sanchit-gandhi and @ylacombe |
Apologies @gau-nernst, this slipped the net previously! @ylacombe are you able to take a look? Would be worth running some side-by-side debugging with padded and un-padded to see if there's a divergence |
Hey @gau-nernst, I've looked into the matter, and you rightfully highlighted two shortcomings:
I've studied a bit more where the computation starts to differ, and it happens right here, when computing positional embeddings. To address this issue, we should thus:
This could be a great PR for you @gau-nernst, WDYT on working on this ? Of course, I'll support you in the matter if you have any questions! |
Thank you for the detailed investigation and explanation. Do you know why |
As I understand it, when passing padding zeros through pytorch's This poses problems because: values after this length are then non zeros for the other Note that it wouldn't be a problem if there were only one |
I see, that makes sense. That's why Wav2Vec2 doesn't have this issue, since it uses only 1 convolution layer for positional encoding. I think the way to fix this is to fill the values after attention mask with zeros. This has to be done after every conv layers in positional encoding. Not sure if there is a more elegant way. Another note. It seems like the original fairseq implementation also has the same problem (padded input will have different results), since it seems like they don't do any special processing (I haven't actually run the code to check). Not sure if we should deviate from the official implementation if that's the case. |
That's exactly what is done here. I thing that we need to discuss it further with @sanchit-gandhi, since batching (and thus padding) seems to still give correct results in the integration tests. I think it could be interesting to experiment a bit with your solution and check if it gives correct and coherent solutions. Would you be ok to experiement with this ? you could pass the attention mask through those layers, or you could do something with the sequence lengths. And then you can compare results with what's in the integration tests. |
Note that the integration tests use |
The model is probably robust enough so that the final predictions are not affected. Do you have any thoughts about not replicating the exact fairseq implementation? This is the fairseq code and the config file |
Hey @gau-nernst, I had the occasion to discuss the matter internally with @sanchit-gandhi and @patrickvonplaten, and here are our thoughts! At the moment, we have strict equivalence with the fairseq implementation, which leads us to believe that the current behavior might be intended or that it is simply an oversight from their part. In any case, we'd like to keep the default behavior since it doesn't seem to impact so much the outputs according to the the integration tests! However, if you are really interested in the matter, you can still drive a PR to correct this behavior, provided that we keep the default behavior by default, and provided that it is really useful in terms of quality! BTW, could you also tell me the intended use of the model and how you encountered this problem? Many thanks! If you encountered this issue while fine-tuning the model, you might want to group samples by length, since it appears that your issue was over-amplified by a large padding-to-length ratio ! |
Sadly I don't have the bandwidth to do that experiment. I'm mainly using audio models for audio classification, so I'm interested in encoder-only models. I was evaluating which model to use, and found out the strange behaviour of different results for padded inputs for Wav2Vec2 Base and HuBERT Base models, which is due to the use of group norm. Then I tried to see if other models had this problem, thus found it for data2vec-audio. Currently I don't use data2vec-audio models, since I think Wav2Vec2 XLS-R is much better thanks to its large pre-trained data. I believe the solution for now is to update the documentation
|
@ylacombe What do you think of the solution I proposed above? I can submit a PR if you are ok with it. It's mainly documentation fix, since I won't have the bandwidth to do experiments with the model code. |
Hey @gau-nernst, thanks for the remainder! it will be nice to have your contribution on a PR here! I agree with the solution you proposed for now, feel free to ping me on the PR! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.31.0Who can help?
@sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
extract_features
are the same, butlast_hidden_state
is not.Expected behavior
The two outputs should be the same.
Note that when I change the model to
facebook/wav2vec2-xls-r-300m
, the outputs are identical. I would expect data2vec and wav2vec 2.0 have similar behavior, since they have very similar architecture. A quick glance at the source code also indicates that there should be no reason why data2vec cannot use attention mask correctly.The preprocessor config here also indicates the model should be able to use attention mask
https://huggingface.co/facebook/data2vec-audio-base/blob/main/preprocessor_config.json
The text was updated successfully, but these errors were encountered: