Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Speech AutoModels #13655

Merged

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Sep 20, 2021

What does this PR do?

This PR adds two auto models for speech:

  • one for CTC (Wav2Vec2, Hubert)
  • one for Seq2Seq (SpeechEncoderDecoder, Speech2text)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

# Only load from `config.architectures`, AutoModelForCTC and AutoModelForConditionalGeneration
# do not exist yet.
"pt": () if is_torch_available() else (),
"pt": (AutoModelForCTC, AutoModelForSpeechSeq2Seq) if is_torch_available() else (),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think that's cleaner @Narsil

pipeline(
task="automatic-speech-recognition",
model="hf-internal-testing/tiny-random-wav2vec2",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a tokenizer was recently added to this repo so created a new repo without tokenizer. Also it should give a OSError error IMO

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

(Not too related to this PR) The fact that this pipeline yields to two different AutoModel* implementation reminds me of the ongoing issue relative to the 1-to-1 mapping that auto models/pipelines have, as:

  • Some pipelines map to several auto models (like this one)
  • Some auto models map to several pipelines (AutoModelForSeq2SeqLM maps ~4 pipelines)

Should there be a slight refactor so that auto models map 1-to-1 to a pipeline? This could introduce AutoModelForSummarization, AutoModelForTranslation, etc.

WDYT @Narsil @sgugger @patrickvonplaten @patil-suraj?

Copy link
Member

@anton-l anton-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and the AutoClass names should work well down the road 👍

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for adding!

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Sep 20, 2021

yields

I'm not really in favor of aligning pipelines and AutoModels 1-to-1 to be honest. IMO there should be a 1-to-1 alignment between model classes Wav2Vec2For... and AutoModelFor..., but not between AutoModelFor... and the pipelines.

In this case here an auto model class called AutoModelForAutomaticSpeechRecognition would be problematic as:

  1. It would include models with different API's: Wav2Vec2ForCTC does one forward pass and leverages just the forward() method whereas SpeechEncoderDecoder leverages the generate() function. I think we don't want to follow this practice anymore
  2. It could lead to problems as soon as there are multiple Wav2Vec2... classes that can do ASR

On the other hand, I agreed with @Narsil that for the pipelines it doesn't really make sense to have AutomaticSpeechRecognitionWithCTC/ WithSeq2Seq / ... as the target group of the pipelines should have to know the difference betweetn CTC / WithSeq2Seq / ...

More generally, I believe that the AutoModel... classes should follow our "barebone"/"no magic"/"easy-to-read code" design whereas the pipelines rather fall in the category "very user-friendly"/"complexity is absorbed at the cost of hard to read/understand code".

So I'm not really in favor of the 1-to-1 alignment I think.

=> Would maybe be a good idea to have a chat about this more generally though!

@patrickvonplaten patrickvonplaten merged commit 48fa42e into huggingface:master Sep 21, 2021
@patrickvonplaten patrickvonplaten deleted the add_speech_automodels branch September 21, 2021 06:50
@Narsil
Copy link
Contributor

Narsil commented Sep 21, 2021

I agree with @patrickvonplaten actually.

In my mind, AutoModelFor... designates a kind of Head on a particular model, which includes some specific weights and API.
It should have nothing to do with pipelines and needs to be completely separated (at least models shouldn't care about pipelines).

On the other hand, for pipelines, relying on AutoModelFor.. in a 1-1 fashion is tempting, but it does fall apart for Seq2Seq (text-generation, translation, summarization) (1-n) and again for ForCTC and ForSpeechSeq2Seq for instance. (n-1).
For those, it really seems that it just doesn't any sense to keep the 1-1 relationship. But what we do want is a 1-1 relationship
between the AutoModel and the widget being displayed and that's possible because there is a sane default for Seq2Seq which is text-generation (we called it text2text-generation but really it's exactly the same API as text-generation.

Forcing 1-1 here would mean forcing model developers to care about pipeline and enabling all potential uses for a given head. It would also means that users wouldn't necessarily know which AutoModelFor.. to use as there would now be aliases. In addition, it would mean we would be forced to unnecessarily add new tasks automatic-speech-recognition-ctc and automatic-speech-recognition-seq2seq.
Both are highly undesirable IMHO.

Keeping the current architecture is alright I think.
In case of (n-1) mapping, there needs to be a default task that works (text-generation will always work for instance).
And in case of (1-n) mapping, adding switches within the pipeline is OK IMO. As long as there's 1 switch only, and the rest of the pipeline is common it's working fine, but we might want to consider other designs as the pipelines for different AutoModelFor start to differ significantly (like #13622). This is a code architecture issue that is very doable IMO.

Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 13, 2022
* upload

* correct

* correct

* correct

* finish

* up

* up

* up again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants