Add model like #14992

sgugger · 2021-12-30T21:25:56Z

What does this PR do?

This PR adds a new command transformers-cli add-new-model-like that creates a new model in the Transformers library exactly like an existing one, for models we want to add that are just tweaked versions of an existing model.

Fixes #14032

Note that this PR is still a draft, I'm just putting all of this here so that @stas00 can start playing around with it but there are still some rough edges, including:

it only works with a config right now, plan is to create a simple questionnaire for the user to fill
the doc page of the model is not created
the command should issue some warnings/recommendations (see below)
we can't filter out the frameworks yes (to just include PyTorch/TF/Flax)

Also it needs to be cleaned up and tested before a merge, but I'm going on vacation and this way there is a prototype available to play why I'm not here :-)

The whole thing works with string matches/replacements, so it can easily fall apart when a model has the same string of characters for its model type and checkpoint (like gpt2) or model camel-cased and upper-cased names (like GPT2). I have used gpt2 as a test model to make sure those are not out of control, but it's still a good idea to proofread the result.

Other than that the command:

creates the necessary submodules and test files
re-uses the tokenizer of the model we copy from if the user indicates they are the same (probably the most common use case)
puts everything in the right places in the inits and auto modules
add copied from statements everywhere it can (some will need to be removed in the submodules tweaked by the user)
adds a draft of the doc file
filters per selected frameworks (if given)

There are some rough edges still:

the tokenization files won't contain the proper mappings if created, so they need to be manually fixed. Same if there is a fast tokenizer, the converter needs to be manually added (the command warns the user they have to do this in that case)
some models import objects that are not prefixed with the model name in the main init like BERT (which imports BasicTokenizer), in that case the init needs to be manually fixed to avoid duplicate imports.
with the copied from option activated, there might be some Copied from to manually remove because of fights with black/the doc-styler

Those can be fixed in followup PRs if necessary, but I think they are acceptable steps for a user to fix manually for now. The test added adds a DistilBert-like model (can't user Bert as seen above) and checks the model added passes all the quality checks and that its common PyTorch tests pass (running the TF/Flax tests is significantly slower but they pass too).

To test right now, use an env where the clone of the transformers repo you are working on to add the package is the transformers library registered (might work if it isn't since the transformers module is only used to get the constant in the auto modules, but untested).

The easiest is to run transformers-cli add-new-model-like and follow the prompts.

Otherwise, create a config.json file like the one "add_new_model_config.json" in the first commits of this PR (provided as an example, it has been removed in the final version) with content like this:

{
    "add_copied_from": true, # If false, won't add any Copied from statements
    "old_model_type": "gpt2", # Needs to be a valid model type
    "new_model_patterns": {
        "model_name": "GPT-New new", # Model name for the doc
        "checkpoint": "huggingface/gpt-new-base", # checkpoint to use in all examples
        "model_type": "gpt-new-new", # Model type as saved in the configs
        "model_lower_cased": "gpt_new_new", # Used for the function names and module name
        "model_camel_cased": "GPTNewNew", # Used for the class names
        "model_upper_cased": "GPT_NEW_NEW", # Used for the constant names
        "config_class": "GPTNewNewConfig", # Config class, will default to {model_camel_cased}Config if not provided
        "tokenizer_class": "GPT2Tokenizer" # Tokenizer class, will default to {model_camel_cased}Tokenizer if not provided (which creates a new tokenizer)
    },
    "frameworks": [
        "pt",
        "tf",
        "flax"
    ]
}

then run

transformers-cli add-new-model-like --config_file path_to_config

Note that it's possible you have to redo an editable install of the repo with this branch checked out to properly register the new CLI command.

LysandreJik · 2022-01-06T15:12:51Z

This is really cool, looking forward to it!

LysandreJik

Thank you for your PR, @sgugger! This is a fantastic addition, which will be extremely helpful. I think having it work perfectly is a complex subject, so very impressed to see it perform so well already.

Questions/suggestions:

It generates the conversion script, but I'm not sure this is relevant/exact in most cases. I would put it behind a question as well "Would you like to have the same conversion script as XXX in order to convert from an original checkpoint?"
# Copied from statements appear twice in case we're adding a model like another which already has copied from statements. The two statements seem to copy to the appropriate model name chosen, however.

I tried to add a model like the roberta model, here are some attributes I would have expected to see changed but which remained like the original:

The introduction of the model file here:

transformers/src/transformers/models/roberta/modeling_roberta.py

Line 16 in 5f3c57f

"""PyTorch RoBERTa model."""

The introduction to model classes:

transformers/src/transformers/models/roberta/modeling_roberta.py

Lines 878 to 880 in 5f3c57f

    
           @add_start_docstrings( 
        
               """RoBERTa Model with a `language modeling` head on top for CLM fine-tuning.""", ROBERTA_START_DOCSTRING 
        
           )

The copyright at the top is a complex question IMO, as the code is definitely inheriting from another so the copyright should be kept - but shouldn't the script ask for the organization which authored the model so that the copyright to that org is also respected? Maybe not, if they don't modify much of it. Open question!
All copyrights should be changed to 2022
The integration test generated didn't use the checkpoint I specified, instead changed the model identifier to include the lowercase model name I chose. Not necessarily important, or worth the (most likely complex!) change.
I think the current approach is only set to work with NLP models? (at least models that have a tokenizer, even if non-NLP) Trying it with detr for example, throws a KeyError as detr is not in the TOKENIZER_MAPPING_NAMES
Absolutely love this:

What is the model you would like to duplicate? s2t
s2t is not a valid model type.
What is the model you would like to duplicate? speech2text
speech2text is not a valid model type.
Did you mean speech_to_text or speech_to_text_2 or unispeech?

LysandreJik · 2022-01-14T09:51:43Z

.github/workflows/add-model-like.yml

+      - "src/**"
+      - "tests/**"
+      - ".github/**"
+    types: [assigned, opened, synchronize, reopened]


(I know this is in the original model templates too) I don't think assigned is necessary, as it will relaunch the tests when the PR is assigned to someone even though the tests will already have been run

Will remove!

LysandreJik · 2022-01-14T09:53:21Z

tests/fixtures/add_distilbert_like_config.json

+    "frameworks": [
+        "pt",
+        "tf",
+        "flax"
+    ]


stas00 · 2022-01-14T20:50:21Z

Once you're happy with it, we can then put it to practical use and give it a good real case test by re-doing #14084 which got out of sync with all the recent revamps.

So basically cloning GPT2 to create GPTMeg and then adding to it 3 small changes that is the real difference over GPT2

sgugger · 2022-01-14T21:08:14Z

Note that GPT-2 is the model where this command is the most likely to fail as the checkpoint name is the same as the model type/model lower cased which then creates bad replacements I can't really control.

It will also duplicate the GPT2DoubleHeadsModel class, which I'm not sure you want for your new model. I would advise duplicating GPT-Neo or GPT-J

stas00 · 2022-01-14T23:00:52Z

But I'm not modifying GPT-Neo or GPT-J. I don't understand why GPT to GPTMeg is different from GPTNeo to GPTMeg? Is it because GPTNeo has a postfix over the prefix GPT.

Could you please give an example of what you mean when you say it'd fail?

Perhaps it can be done in 2 steps? GPT to XYZ, and then a few one-liner to rename XYZ to the target?

sgugger · 2022-01-17T12:50:27Z

Like I said, it's because the checkpoint for GPT2 is named gpt2, which is also the model type prefix for GPT2. It's the only model that conflates the two of them, which results in instances of the checkpoint not being replaced by the checkpoint of the new model, but the model type of the new model. So it needs careful proofreading of the generated files.

Could you please give an example of what you mean when you say it'd fail?

Just run the command and look at the result.

sgugger · 2022-01-20T15:28:30Z

This PR is now in a state that is good in my opinion. I have:

revamped the core of the replacement script, to make it more resilient. This solves the issues you pointed out for RoBERTa naming and wrong checkpoints
added support for non-NLP models
added a whole test suite for the utilities the command uses.

Also there was a fix on the check-copies script on master which make all the quality tests pass when the new model added includes the # Copied from comments.

I don't have much more time to spend on this so for the following problems (or potential bugs) I'd like us to rely on the community. The test suite added is there to catch any regression.

It generates the conversion script, but I'm not sure this is relevant/exact in most cases. I would put it behind a question as well "Would you like to have the same conversion script as XXX in order to convert from an original checkpoint?"

This could be a nice feature to add, but it's also super easy to just remove the conversion scripts sine they are completely independent.

# Copied from statements appear twice in case we're adding a model like another which already has copied from statements. The two statements seem to copy to the appropriate model name chosen, however.

This is fixed.

The copyright at the top is a complex question IMO, as the code is definitely inheriting from another so the copyright should be kept - but shouldn't the script ask for the organization which authored the model so that the copyright to that org is also respected? Maybe not, if they don't modify much of it. Open question!

All copyrights should be changed to 2022

I haven't touched at all the copyrights. The new model author should add themselves manually, but if nothing much is changed, I think it's goo to keep the defaults to the authors of the copied model. For the change of year, we can make it a good first issue.

LysandreJik

For all intents and purposes, this seems to work very well!

I tested with the following starting models:

DETR
BERT
RoBERTa
Wav2Vec2
Vision encoder decoder

One bug I found with the vision encoder-decoder class (but it's questionable whether this class should be supported with this script) is that the resulting code still has the [CONFIG_CLASS] everywhere.

Other than that, it looks great!

src/transformers/commands/add_new_model_like.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

sgugger · 2022-01-24T20:01:56Z

Noted for the bug in VisionEncoderDecoder. It doesn't block the merge as it's not really a target of this PR, but good to keep it in mind!

sgugger · 2022-01-24T20:25:07Z

Failure is unrelated so merging this!

sgugger marked this pull request as draft December 30, 2021 21:27

sgugger requested a review from LysandreJik January 12, 2022 21:50

sgugger marked this pull request as ready for review January 12, 2022 21:52

LysandreJik reviewed Jan 14, 2022

View reviewed changes

sgugger force-pushed the add_model_like branch 2 times, most recently from c3a6fa6 to 877f5fe Compare January 19, 2022 13:50

sgugger added 19 commits January 19, 2022 08:52

Add new model like command

c3ee599

Bad doc-styler

b965359

black and doc-styler, stop fighting!

cf1e3f4

black and doc-styler, stop fighting!

fd620dd

At last

b354ae0

Clean up

06d7a7f

Typo

041a19a

Bad doc-styler

c65af6d

Bad doc-styler

965f17f

All good maybe?

9f15828

Use constants

76e8767

Add doc and type hints

ba3dd9e

More cleaning

a44469a

Add doc

56b1ce0

Fix Copied from

132ead2

Doc template

fc49003

Use typing.Pattern instead

99383cd

Framework-specific files

46e5e72

Fixes

c78b8b9

sgugger added 18 commits January 19, 2022 08:52

Let's try another way

91ff9d7

It should go better when actually passing the arg...

df90379

Remove debug statements and style

355139c

Fix config

735b292

Add tests

3057e54

Test require the three backends

02e0cdd

intermediate commit

6aa6ef7

Revamp pattern replacements and start work on feature extractors

877f5fe

Adapt model info

ef99f38

Finalize code for processors

9908a3b

Fix in main init additions

b77fece

Finish questionnaire for processing classes

d54ae5f

Fix file name

ce9a210

Fix for real

16be1a2

Fix patterns

690815e

Style

ed06650

Remove needless warnings

70b64d0

Copied from should work now.

0f0f6e4

sgugger requested a review from LysandreJik January 20, 2022 15:28

sgugger added 3 commits January 20, 2022 10:52

Include Copied form in blocks

d24475d

Add test

1bd2e45

More fixes and tests

7a9fe98

LysandreJik approved these changes Jan 24, 2022

View reviewed changes

src/transformers/commands/add_new_model_like.py Outdated Show resolved Hide resolved

src/transformers/commands/add_new_model_like.py Outdated Show resolved Hide resolved

sgugger and others added 2 commits January 24, 2022 14:54

Apply suggestions from code review

2633364

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Address review comment

f27667f

sgugger merged commit 81156d2 into master Jan 24, 2022

sgugger deleted the add_model_like branch January 24, 2022 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model like #14992

Add model like #14992

sgugger commented Dec 30, 2021 •

edited

LysandreJik commented Jan 6, 2022

LysandreJik left a comment

LysandreJik Jan 14, 2022

sgugger Jan 24, 2022

LysandreJik Jan 14, 2022

stas00 commented Jan 14, 2022

sgugger commented Jan 14, 2022

stas00 commented Jan 14, 2022

sgugger commented Jan 17, 2022

sgugger commented Jan 20, 2022 •

edited

LysandreJik left a comment

sgugger commented Jan 24, 2022

sgugger commented Jan 24, 2022

	@add_start_docstrings(
	"""RoBERTa Model with a `language modeling` head on top for CLM fine-tuning.""", ROBERTA_START_DOCSTRING
	)

Add model like #14992

Add model like #14992

Conversation

sgugger commented Dec 30, 2021 • edited

What does this PR do?

LysandreJik commented Jan 6, 2022

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Jan 14, 2022

Choose a reason for hiding this comment

sgugger Jan 24, 2022

Choose a reason for hiding this comment

LysandreJik Jan 14, 2022

Choose a reason for hiding this comment

stas00 commented Jan 14, 2022

sgugger commented Jan 14, 2022

stas00 commented Jan 14, 2022

sgugger commented Jan 17, 2022

sgugger commented Jan 20, 2022 • edited

LysandreJik left a comment

Choose a reason for hiding this comment

sgugger commented Jan 24, 2022

sgugger commented Jan 24, 2022

sgugger commented Dec 30, 2021 •

edited

sgugger commented Jan 20, 2022 •

edited