-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MMS] Scaling Speech Technology to 1,000+ Languages | Add attention adapter to Wav2Vec2 #23813
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Hey @patrickvonplaten - Thanks for working on this and I reviewed the options provided. I believe the second one would work best from a developer standpoint. IMO it ensures that all the adapter weights are in one repository and it all works the way it should, should someone want to use a different language with the base model. I am not a big fan of option 1 because it would make it difficult for a model to run in a resource-constrained environment. I am a bit conflicted with option 3, primarily because it involves the end-user having the same experience with Wav2Vec2 without worrying about the specific language adapter layers and so on. Although having 1000+ repos for the same sounds a bit wasteful IMO. Question: How would this work for fine-tuning, I am assuming if someone fine-tunes the Wav2Vec2-MMS on a language "X" then they'll push their adapter weights to a new repo and pull from that. So that'd mean that purely from a UX perspective, we should allow for the |
I think 2 is probably the better solution, and I would also make it possible to set the lang in the from transformers import Wav2Vec2ForCTC, AutoProcessor
ckpt = "./mms-1b-all/"
processor = AutoProcessor.from_pretrained(ckpt)
model = Wav2Vec2ForCTC.from_pretrained(ckpt, target_lang="esp")
processor.set_lang("esp")
model.to("cuda")
### Stuff
# want to change the language:
model.load_adapter("fra") |
+1 on the composite solution proposed by @sgugger. Regarding fine-tuning @Vaibhavs10, users will save both the fine-tuned base weights and adapter layer weights to the same repo (this is different to PEFT where we only save the adapter weights, since here the base weights are also trainable. The way to view the adapter layer is as a extra small feed-forward network on top of the transformer block, so a regular layer of weights rather than a parameter efficient one), so probably we can assume we're loading the base weights and adapter weights from the same repo. |
Agreed, with all above - 2 would be my choice:
|
I'll leave more in-detail functionality for fine-tuning adapter weights for a future PR, but in short we can already do the following: from transformers import Wav2Vec2ForCTC
ckpt = "patrickvonplaten/mms-1b"
model = Wav2Vec2ForCTC.from_pretrained(ckpt, num_attn_adapters=1, vocab_size=277)
adapter_keys = set(model._adapters.keys())
for name, param in model.named_parameters():
if name not in adapter_keys:
param.requires_grad = False So once we add adapter fine-tuning to the wav2vec2 fine-tuning script, we could also add a simple "freeze_all_but_adapter()" function or something. |
The code is now finished. I still need to upload the adapters for the smaller checkpoints, transfer them to Facebook and write some nice docs. All modeling files except Wav2Vec2 are changed due to the #Copied from mechanism. I think this is better than removing the copy-from mechanism, but happy to change. |
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice PR and new model abilities. Thanks for adding! Tests especially were clear and helped show the expected behaviour 🤗
Mostly just nits - main comment is some missing asserts in the tests.
I'm not sure sure about the added logic and layers to the models copying from wav2vec2. Classic inheritance problem but it's a bit counter intuitive to have calls to a method -- load_adapter
-- which the model doesn't have in the modeling code. Not a big issue -- if we find it confuses users, then we can handle e.g. def load_adapter
which raises NotImplementedError
. Noting just as a limitation of adapting models with copying logic.
src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py
Show resolved
Hide resolved
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
…dapter to Wav2Vec2 (huggingface#23813) * add fine-tuned with adapter layer * Add set_target_lang to tokenizer * Implement load adapter * add tests * make style * Apply suggestions from code review * Update src/transformers/models/wav2vec2/tokenization_wav2vec2.py * make fix-copies * Apply suggestions from code review * make fix-copies * make style again * mkae style again * fix doc string * Update tests/models/wav2vec2/test_tokenization_wav2vec2.py * Apply suggestions from code review * fix * Correct wav2vec2 adapter * mkae style * Update src/transformers/models/wav2vec2/modeling_wav2vec2.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * add more nice docs * finish * finish * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Apply suggestions from code review * all finish --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
…dapter to Wav2Vec2 (huggingface#23813) * add fine-tuned with adapter layer * Add set_target_lang to tokenizer * Implement load adapter * add tests * make style * Apply suggestions from code review * Update src/transformers/models/wav2vec2/tokenization_wav2vec2.py * make fix-copies * Apply suggestions from code review * make fix-copies * make style again * mkae style again * fix doc string * Update tests/models/wav2vec2/test_tokenization_wav2vec2.py * Apply suggestions from code review * fix * Correct wav2vec2 adapter * mkae style * Update src/transformers/models/wav2vec2/modeling_wav2vec2.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * add more nice docs * finish * finish * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Apply suggestions from code review * all finish --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
This PR adds the MMS models fine-tuned on speech recognition.
See official announcement here: https://about.fb.com/news/2023/05/ai-massively-multilingual-speech-technology/
See more details here: https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md#asr
Fixes #23811 and #23665
For now checkpoints are uploaded here:
Pretrained-only
ASR fine-tuned
The fine-tuned checkpoints are based on Adapter layers as can be seen in this PR. The ASR fine-tuned weights consist of two parts:
mms-1b-all
If one wants to use a specific language, specific adapter weights need to be loaded into
mms-1b-all
.By default
mms-1b-all
et. al load the English adapter layer weights as is currently done in https://huggingface.co/patrickvonplaten/mms-1b-allThe following works with this PR:
Now, the question what API to we want to build for allow the user to easily switch between languages for the fine-tuned weights.
Note:
How should we design this model? We need to have some kind of switching between languages function anyways. I see the following APIs that could work.
1.) By default, we download all adapter layers and load all in RAM, but we provide a functionality to remove all language but one from RAM:
A problem with this is though also that it's not trivial to switch between languages because one needs to load the whole model again and then set the language again. Also we would have to add a
set_adapter_weights
function to Wav2Vec2 which is not ideal2.) By default we only the adapter weights one of language (e.g. English) and the load upon request more adapter layers
Think this is quite user-friendly, intuitive and this way we also never require more than 3.1 GB of RAM. It however requires to add a pretty specific
load_adapter
function to Wav2Vec2 (think it's fine though).3.) We just upload 1000+ repos one for each language. This way we don't need any "set" or "load" function and we just tread each adapter weights as their own model:
Big disadvantage is that it's pretty wasteful since an adapter layer is just 0.3% of all the models weights.
=> Overall, I'm tending to API 2.) because it's the most user-friendly and intuitive. It'd just require to add a somewhat specific "load_adapter" function to Wav2Vec2, but think that's totally fine.
Thoughts @sanchit-gandhi @Vaibhavs10 @sgugger @LysandreJik @amyeroberts ?