Add Speech AutoModels #13655

patrickvonplaten · 2021-09-20T14:56:36Z

What does this PR do?

This PR adds two auto models for speech:

one for CTC (Wav2Vec2, Hubert)
one for Seq2Seq (SpeechEncoderDecoder, Speech2text)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

patrickvonplaten · 2021-09-20T18:04:25Z

src/transformers/pipelines/__init__.py

-        # Only load from `config.architectures`, AutoModelForCTC and AutoModelForConditionalGeneration
-        # do not exist yet.
-        "pt": () if is_torch_available() else (),
+        "pt": (AutoModelForCTC, AutoModelForSpeechSeq2Seq) if is_torch_available() else (),


Think that's cleaner @Narsil

patrickvonplaten · 2021-09-20T18:05:14Z

tests/test_pipelines_automatic_speech_recognition.py

            pipeline(
                task="automatic-speech-recognition",
-                model="hf-internal-testing/tiny-random-wav2vec2",


a tokenizer was recently added to this repo so created a new repo without tokenizer. Also it should give a OSError error IMO

LysandreJik

This looks good to me.

(Not too related to this PR) The fact that this pipeline yields to two different AutoModel* implementation reminds me of the ongoing issue relative to the 1-to-1 mapping that auto models/pipelines have, as:

Some pipelines map to several auto models (like this one)
Some auto models map to several pipelines (AutoModelForSeq2SeqLM maps ~4 pipelines)

Should there be a slight refactor so that auto models map 1-to-1 to a pipeline? This could introduce AutoModelForSummarization, AutoModelForTranslation, etc.

WDYT @Narsil @sgugger @patrickvonplaten @patil-suraj?

anton-l

LGTM, and the AutoClass names should work well down the road 👍

sgugger

Looks great, thanks for adding!

patrickvonplaten · 2021-09-20T23:05:24Z

yields

I'm not really in favor of aligning pipelines and AutoModels 1-to-1 to be honest. IMO there should be a 1-to-1 alignment between model classes Wav2Vec2For... and AutoModelFor..., but not between AutoModelFor... and the pipelines.

In this case here an auto model class called AutoModelForAutomaticSpeechRecognition would be problematic as:

It would include models with different API's: Wav2Vec2ForCTC does one forward pass and leverages just the forward() method whereas SpeechEncoderDecoder leverages the generate() function. I think we don't want to follow this practice anymore
It could lead to problems as soon as there are multiple Wav2Vec2... classes that can do ASR

On the other hand, I agreed with @Narsil that for the pipelines it doesn't really make sense to have AutomaticSpeechRecognitionWithCTC/ WithSeq2Seq / ... as the target group of the pipelines should have to know the difference betweetn CTC / WithSeq2Seq / ...

More generally, I believe that the AutoModel... classes should follow our "barebone"/"no magic"/"easy-to-read code" design whereas the pipelines rather fall in the category "very user-friendly"/"complexity is absorbed at the cost of hard to read/understand code".

So I'm not really in favor of the 1-to-1 alignment I think.

=> Would maybe be a good idea to have a chat about this more generally though!

Narsil · 2021-09-21T07:39:08Z

I agree with @patrickvonplaten actually.

In my mind, AutoModelFor... designates a kind of Head on a particular model, which includes some specific weights and API.
It should have nothing to do with pipelines and needs to be completely separated (at least models shouldn't care about pipelines).

On the other hand, for pipelines, relying on AutoModelFor.. in a 1-1 fashion is tempting, but it does fall apart for Seq2Seq (text-generation, translation, summarization) (1-n) and again for ForCTC and ForSpeechSeq2Seq for instance. (n-1).
For those, it really seems that it just doesn't any sense to keep the 1-1 relationship. But what we do want is a 1-1 relationship
between the AutoModel and the widget being displayed and that's possible because there is a sane default for Seq2Seq which is text-generation (we called it text2text-generation but really it's exactly the same API as text-generation.

Forcing 1-1 here would mean forcing model developers to care about pipeline and enabling all potential uses for a given head. It would also means that users wouldn't necessarily know which AutoModelFor.. to use as there would now be aliases. In addition, it would mean we would be forced to unnecessarily add new tasks automatic-speech-recognition-ctc and automatic-speech-recognition-seq2seq.
Both are highly undesirable IMHO.

Keeping the current architecture is alright I think.
In case of (n-1) mapping, there needs to be a default task that works (text-generation will always work for instance).
And in case of (1-n) mapping, adding switches within the pipeline is OK IMO. As long as there's 1 switch only, and the rest of the pipeline is common it's working fine, but we might want to consider other designs as the pipelines for different AutoModelFor start to differ significantly (like #13622). This is a code architecture issue that is very doable IMO.

* upload * correct * correct * correct * finish * up * up * up again

patrickvonplaten added 8 commits September 20, 2021 14:43

upload

5e9ccfc

correct

b2994b2

correct

6430c16

correct

bd66104

finish

7ee0a0e

up

bbe4d41

up

ccdc738

up again

a288bc2

patrickvonplaten commented Sep 20, 2021

View reviewed changes

patrickvonplaten requested review from Narsil, sgugger, LysandreJik and anton-l September 20, 2021 18:05

LysandreJik approved these changes Sep 20, 2021

View reviewed changes

anton-l approved these changes Sep 20, 2021

View reviewed changes

sgugger approved these changes Sep 20, 2021

View reviewed changes

patrickvonplaten merged commit 48fa42e into huggingface:master Sep 21, 2021

patrickvonplaten deleted the add_speech_automodels branch September 21, 2021 06:50

Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 13, 2022

Add Speech AutoModels (huggingface#13655)

70db8ef

* upload * correct * correct * correct * finish * up * up * up again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Speech AutoModels #13655

Add Speech AutoModels #13655

patrickvonplaten commented Sep 20, 2021 •

edited

patrickvonplaten Sep 20, 2021

patrickvonplaten Sep 20, 2021 •

edited

LysandreJik left a comment

anton-l left a comment

sgugger left a comment

patrickvonplaten commented Sep 20, 2021 •

edited

Narsil commented Sep 21, 2021

Add Speech AutoModels #13655

Add Speech AutoModels #13655

Conversation

patrickvonplaten commented Sep 20, 2021 • edited

What does this PR do?

Before submitting

Who can review?

patrickvonplaten Sep 20, 2021

Choose a reason for hiding this comment

patrickvonplaten Sep 20, 2021 • edited

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

anton-l left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Sep 20, 2021 • edited

Narsil commented Sep 21, 2021

patrickvonplaten commented Sep 20, 2021 •

edited

patrickvonplaten Sep 20, 2021 •

edited

patrickvonplaten commented Sep 20, 2021 •

edited