[MMS] Scaling Speech Technology to 1,000+ Languages | Add attention adapter to Wav2Vec2 #23813

patrickvonplaten · 2023-05-27T14:06:57Z

What does this PR do?

This PR adds the MMS models fine-tuned on speech recognition.
See official announcement here: https://about.fb.com/news/2023/05/ai-massively-multilingual-speech-technology/
See more details here: https://github.com/facebookresearch/fairseq/blob/main/examples/mms/README.md#asr

Fixes #23811 and #23665

For now checkpoints are uploaded here:

Pretrained-only

ASR fine-tuned

The fine-tuned checkpoints are based on Adapter layers as can be seen in this PR. The ASR fine-tuned weights consist of two parts:

The non-adapter weights which are exactly the same as the base model weights
Language specific fine-tuned adapter layer weights. This means we have 1000+ adapter weights for mms-1b-all

If one wants to use a specific language, specific adapter weights need to be loaded into mms-1b-all.
By default mms-1b-all et. al load the English adapter layer weights as is currently done in https://huggingface.co/patrickvonplaten/mms-1b-all

The following works with this PR:

from transformers import Wav2Vec2ForCTC, AutoProcessor
import soundfile as sf
import torch

ckpt = "./mms-1b-fl102/"
ckpt = "./mms-1b-l1107"
ckpt = "./mms-1b-all/"

processor = AutoProcessor.from_pretrained(ckpt)
model = Wav2Vec2ForCTC.from_pretrained(ckpt)

# get audio.flac from https://huggingface.co/datasets/patrickvonplaten/audios/blob/main/audio.flac
audio, sr = sf.read("./audio.flac")

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

transcription = processor.batch_decode(logits.argmax(-1))[0]

print(f"Transcription: {transcription}")

Now, the question what API to we want to build for allow the user to easily switch between languages for the fine-tuned weights.

Note:

To switch from one language to another, both the tokenizer's vocab and the model's adapter layers need to be switched out
The tokenizer can always easily hold all langs dicts in RAM because each lang has around 150 entries so we have 150,000 entries which is not too much for RAM
However, things are a bit more tricky for the model. The base model requires 3.1 GB in FP32 RAM and each adapter weights are around 9MB in size. This means loading all adapter layers into RAM would cost ~9GB which is quite a bit.

How should we design this model? We need to have some kind of switching between languages function anyways. I see the following APIs that could work.

1.) By default, we download all adapter layers and load all in RAM, but we provide a functionality to remove all language but one from RAM:

from transformers import Wav2Vec2ForCTC, AutoProcessor

ckpt = "./mms-1b-all/"

processor = AutoProcessor.from_pretrained(ckpt)
model = Wav2Vec2ForCTC.from_pretrained(ckpt)  # requires at least 10GB of CPU RAM

target_lang = "esp"

processor.set_lang("esp")
adapter_id = processor.lang_to_id["esp"]
model.set_adapter_weights(adapter_id) # throw away all but one weights => 3.1GB of CPU RAM

model.to("cuda")

A problem with this is though also that it's not trivial to switch between languages because one needs to load the whole model again and then set the language again. Also we would have to add a set_adapter_weights function to Wav2Vec2 which is not ideal

2.) By default we only the adapter weights one of language (e.g. English) and the load upon request more adapter layers

```py
from transformers import Wav2Vec2ForCTC, AutoProcessor

ckpt = "./mms-1b-all/"

processor = AutoProcessor.from_pretrained(ckpt)
model = Wav2Vec2ForCTC.from_pretrained(ckpt)  # requires only 3GB of CPU RAM

target_lang = "esp"

processor.set_lang("esp")
model.load_adapter("esp") # This will load a file called "adapter.esp.bin" from: https://huggingface.co/patrickvonplaten/mms-1b-all , cache it and replace the adapter

model.to("cuda")

Think this is quite user-friendly, intuitive and this way we also never require more than 3.1 GB of RAM. It however requires to add a pretty specific load_adapter function to Wav2Vec2 (think it's fine though).

3.) We just upload 1000+ repos one for each language. This way we don't need any "set" or "load" function and we just tread each adapter weights as their own model:

from transformers import Wav2Vec2ForCTC, AutoProcessor

ckpt = "./mms-1b-all-esp/" # repo names then become lang specific

processor = AutoProcessor.from_pretrained(ckpt)
model = Wav2Vec2ForCTC.from_pretrained(ckpt)  # requires only 3GB of CPU RAM
model.to("cuda")

Big disadvantage is that it's pretty wasteful since an adapter layer is just 0.3% of all the models weights.

=> Overall, I'm tending to API 2.) because it's the most user-friendly and intuitive. It'd just require to add a somewhat specific "load_adapter" function to Wav2Vec2, but think that's totally fine.

Thoughts @sanchit-gandhi @Vaibhavs10 @sgugger @LysandreJik @amyeroberts ?

HuggingFaceDocBuilderDev · 2023-05-27T14:21:39Z

The documentation is not available anymore as the PR was closed or merged.

Vaibhavs10 · 2023-05-30T09:07:49Z

Hey @patrickvonplaten - Thanks for working on this and I reviewed the options provided. I believe the second one would work best from a developer standpoint. IMO it ensures that all the adapter weights are in one repository and it all works the way it should, should someone want to use a different language with the base model.

I am not a big fan of option 1 because it would make it difficult for a model to run in a resource-constrained environment.

I am a bit conflicted with option 3, primarily because it involves the end-user having the same experience with Wav2Vec2 without worrying about the specific language adapter layers and so on. Although having 1000+ repos for the same sounds a bit wasteful IMO.

Question: How would this work for fine-tuning, I am assuming if someone fine-tunes the Wav2Vec2-MMS on a language "X" then they'll push their adapter weights to a new repo and pull from that. So that'd mean that purely from a UX perspective, we should allow for the load_adapter function to be able to pull from a separate repository too right?

sgugger · 2023-05-30T14:32:28Z

I think 2 is probably the better solution, and I would also make it possible to set the lang in the from_pretrained call:

from transformers import Wav2Vec2ForCTC, AutoProcessor

ckpt = "./mms-1b-all/"

processor = AutoProcessor.from_pretrained(ckpt)
model = Wav2Vec2ForCTC.from_pretrained(ckpt, target_lang="esp")

processor.set_lang("esp")

model.to("cuda")

### Stuff
# want to change the language:
model.load_adapter("fra")

sanchit-gandhi · 2023-05-31T08:23:42Z

+1 on the composite solution proposed by @sgugger. Regarding fine-tuning @Vaibhavs10, users will save both the fine-tuned base weights and adapter layer weights to the same repo (this is different to PEFT where we only save the adapter weights, since here the base weights are also trainable. The way to view the adapter layer is as a extra small feed-forward network on top of the transformer block, so a regular layer of weights rather than a parameter efficient one), so probably we can assume we're loading the base weights and adapter weights from the same repo.

amyeroberts · 2023-05-31T11:19:08Z

Agreed, with all above - 2 would be my choice:

1 doesn't feel very user friendly. I'd expect most people would only use a consistent subset so downloading everything is slow and wasteful.
2 feels the most intuitive with the current API and flexible. Seconding @Vaibhavs10's questions about finetuning, pushing to the hub and loading finetuned weights. If we load model weights from mms-1b-fl102 and want our own finetuned adapter weights, how do I specify when loading and how is this information saved? How would we differentiate weights such that when I call model.push_to_hub the adapter weights are uploaded separately from the rest of the model (pattern matching?) Should the adapter weights be tied to a specific version of the 'base model' weights?
3 Probably simplest to do - but seems like a waste with many repeated weights.

patrickvonplaten · 2023-05-31T15:13:28Z

I'll leave more in-detail functionality for fine-tuning adapter weights for a future PR, but in short we can already do the following:

from transformers import Wav2Vec2ForCTC

ckpt = "patrickvonplaten/mms-1b"
model = Wav2Vec2ForCTC.from_pretrained(ckpt, num_attn_adapters=1, vocab_size=277)

adapter_keys = set(model._adapters.keys())
for name, param in model.named_parameters():
    if name not in adapter_keys:
        param.requires_grad = False

So once we add adapter fine-tuning to the wav2vec2 fine-tuning script, we could also add a simple "freeze_all_but_adapter()" function or something.

src/transformers/models/wav2vec2/configuration_wav2vec2.py

src/transformers/models/wav2vec2/tokenization_wav2vec2.py

…nsformers into add_wav2vec2_mms

patrickvonplaten · 2023-05-31T17:38:13Z

The code is now finished. I still need to upload the adapters for the smaller checkpoints, transfer them to Facebook and write some nice docs.

All modeling files except Wav2Vec2 are changed due to the #Copied from mechanism. I think this is better than removing the copy-from mechanism, but happy to change.

docs/source/en/model_doc/mms.mdx

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

amyeroberts

Really nice PR and new model abilities. Thanks for adding! Tests especially were clear and helped show the expected behaviour 🤗

Mostly just nits - main comment is some missing asserts in the tests.

I'm not sure sure about the added logic and layers to the models copying from wav2vec2. Classic inheritance problem but it's a bit counter intuitive to have calls to a method -- load_adapter -- which the model doesn't have in the modeling code. Not a big issue -- if we find it confuses users, then we can handle e.g. def load_adapter which raises NotImplementedError. Noting just as a limitation of adapting models with copying logic.

src/transformers/models/wav2vec2/modeling_wav2vec2.py

src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py

tests/models/wav2vec2/test_modeling_wav2vec2.py

tests/models/wav2vec2/test_tokenization_wav2vec2.py

tests/models/wav2vec2/test_modeling_wav2vec2.py

src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py

…to add_wav2vec2_mms

…dapter to Wav2Vec2 (huggingface#23813) * add fine-tuned with adapter layer * Add set_target_lang to tokenizer * Implement load adapter * add tests * make style * Apply suggestions from code review * Update src/transformers/models/wav2vec2/tokenization_wav2vec2.py * make fix-copies * Apply suggestions from code review * make fix-copies * make style again * mkae style again * fix doc string * Update tests/models/wav2vec2/test_tokenization_wav2vec2.py * Apply suggestions from code review * fix * Correct wav2vec2 adapter * mkae style * Update src/transformers/models/wav2vec2/modeling_wav2vec2.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * add more nice docs * finish * finish * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Apply suggestions from code review * all finish --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

add fine-tuned with adapter layer

7d39e84

patrickvonplaten changed the title ~~add fine-tuned with adapter layer~~ [RFC] Add fine-tuned with adapter layer May 27, 2023

patrickvonplaten requested review from amyeroberts, LysandreJik, sgugger and sanchit-gandhi May 27, 2023 14:43

Add set_target_lang to tokenizer

dc1a1ea

Implement load adapter

3a0f591

patrickvonplaten added 2 commits May 31, 2023 18:24

add tests

4a11aec

make style

0cc6c61

patrickvonplaten changed the title ~~[RFC] Add fine-tuned with adapter layer~~ Add fine-tuned with adapter layer May 31, 2023

patrickvonplaten commented May 31, 2023

View reviewed changes

src/transformers/models/wav2vec2/configuration_wav2vec2.py Outdated Show resolved Hide resolved

Apply suggestions from code review

ed41c1a

patrickvonplaten commented May 31, 2023

View reviewed changes

src/transformers/models/wav2vec2/tokenization_wav2vec2.py Outdated Show resolved Hide resolved

patrickvonplaten added 3 commits May 31, 2023 18:27

Update src/transformers/models/wav2vec2/tokenization_wav2vec2.py

f179136

make fix-copies

dfccfdb

Merge branch 'add_wav2vec2_mms' of https://github.com/huggingface/tra…

ef96d8f

…nsformers into add_wav2vec2_mms

patrickvonplaten changed the title ~~Add fine-tuned with adapter layer~~ [MMS] Scaling Speech Technology to 1,000+ Languages | Add attention adapter to Wav2Vec2 May 31, 2023

patrickvonplaten commented May 31, 2023

View reviewed changes

docs/source/en/model_doc/mms.mdx Outdated Show resolved Hide resolved

patrickvonplaten added 3 commits May 31, 2023 18:39

Apply suggestions from code review

272160c

make fix-copies

27ac4ff

make style again

c7b487d

vineelpratap mentioned this pull request May 31, 2023

About the way to simplify MMS inference facebookresearch/fairseq#5164

Closed

patrickvonplaten and others added 4 commits June 1, 2023 17:55

mkae style

fe7a05c

Update src/transformers/models/wav2vec2/modeling_wav2vec2.py

b4d9c00

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

add more nice docs

e338fba

finish

8c4e0f5

amyeroberts approved these changes Jun 1, 2023

View reviewed changes

patrickvonplaten and others added 3 commits June 1, 2023 19:17

finish

7b85267

Apply suggestions from code review

5a39a88

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Apply suggestions from code review

dfc8beb

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

patrickvonplaten commented Jun 1, 2023

View reviewed changes

src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py Outdated Show resolved Hide resolved

patrickvonplaten added 3 commits June 1, 2023 19:25

Apply suggestions from code review

bd2b758

all finish

3b3116a

Merge branch 'main' of https://github.com/huggingface/transformers in…

bd3052e

…to add_wav2vec2_mms

patrickvonplaten merged commit 5dfd407 into main Jun 2, 2023

patrickvonplaten deleted the add_wav2vec2_mms branch June 2, 2023 09:30

patrickvonplaten mentioned this pull request Jun 6, 2023

[Wav2Vec2] Fix torch srcipt #24062

Merged

5 tasks

ydshieh mentioned this pull request Jun 12, 2023

Fix Wav2Vec2 CI OOM #24190

Merged

sanchit-gandhi mentioned this pull request Jul 25, 2023

[Feature request] Add support for Massively Multilingual Speech(MMS) model huggingface/transformers.js#209

Closed

xenova mentioned this pull request Jul 27, 2023

[WIP] Add MMS and Wav2Vec2 models (Closes #209) huggingface/transformers.js#220

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MMS] Scaling Speech Technology to 1,000+ Languages | Add attention adapter to Wav2Vec2 #23813

[MMS] Scaling Speech Technology to 1,000+ Languages | Add attention adapter to Wav2Vec2 #23813

patrickvonplaten commented May 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 27, 2023 •

edited

Loading

Vaibhavs10 commented May 30, 2023

sgugger commented May 30, 2023

sanchit-gandhi commented May 31, 2023

amyeroberts commented May 31, 2023

patrickvonplaten commented May 31, 2023

patrickvonplaten commented May 31, 2023 •

edited

Loading

amyeroberts left a comment

[MMS] Scaling Speech Technology to 1,000+ Languages | Add attention adapter to Wav2Vec2 #23813

[MMS] Scaling Speech Technology to 1,000+ Languages | Add attention adapter to Wav2Vec2 #23813

Conversation

patrickvonplaten commented May 27, 2023 • edited Loading

What does this PR do?

Pretrained-only

ASR fine-tuned

1.) By default, we download all adapter layers and load all in RAM, but we provide a functionality to remove all language but one from RAM:

2.) By default we only the adapter weights one of language (e.g. English) and the load upon request more adapter layers

3.) We just upload 1000+ repos one for each language. This way we don't need any "set" or "load" function and we just tread each adapter weights as their own model:

HuggingFaceDocBuilderDev commented May 27, 2023 • edited Loading

Vaibhavs10 commented May 30, 2023

sgugger commented May 30, 2023

sanchit-gandhi commented May 31, 2023

amyeroberts commented May 31, 2023

patrickvonplaten commented May 31, 2023

patrickvonplaten commented May 31, 2023 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

patrickvonplaten commented May 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 27, 2023 •

edited

Loading

patrickvonplaten commented May 31, 2023 •

edited

Loading