Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model like #14992

Merged
merged 62 commits into from Jan 24, 2022
Merged

Add model like #14992

merged 62 commits into from Jan 24, 2022

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Dec 30, 2021

What does this PR do?

This PR adds a new command transformers-cli add-new-model-like that creates a new model in the Transformers library exactly like an existing one, for models we want to add that are just tweaked versions of an existing model.

Fixes #14032

Note that this PR is still a draft, I'm just putting all of this here so that @stas00 can start playing around with it but there are still some rough edges, including:

  • it only works with a config right now, plan is to create a simple questionnaire for the user to fill
  • the doc page of the model is not created
  • the command should issue some warnings/recommendations (see below)
  • we can't filter out the frameworks yes (to just include PyTorch/TF/Flax)

Also it needs to be cleaned up and tested before a merge, but I'm going on vacation and this way there is a prototype available to play why I'm not here :-)

The whole thing works with string matches/replacements, so it can easily fall apart when a model has the same string of characters for its model type and checkpoint (like gpt2) or model camel-cased and upper-cased names (like GPT2). I have used gpt2 as a test model to make sure those are not out of control, but it's still a good idea to proofread the result.

Other than that the command:

  • creates the necessary submodules and test files
  • re-uses the tokenizer of the model we copy from if the user indicates they are the same (probably the most common use case)
  • puts everything in the right places in the inits and auto modules
  • add copied from statements everywhere it can (some will need to be removed in the submodules tweaked by the user)
  • adds a draft of the doc file
  • filters per selected frameworks (if given)

There are some rough edges still:

  • the tokenization files won't contain the proper mappings if created, so they need to be manually fixed. Same if there is a fast tokenizer, the converter needs to be manually added (the command warns the user they have to do this in that case)
  • some models import objects that are not prefixed with the model name in the main init like BERT (which imports BasicTokenizer), in that case the init needs to be manually fixed to avoid duplicate imports.
  • with the copied from option activated, there might be some Copied from to manually remove because of fights with black/the doc-styler

Those can be fixed in followup PRs if necessary, but I think they are acceptable steps for a user to fix manually for now. The test added adds a DistilBert-like model (can't user Bert as seen above) and checks the model added passes all the quality checks and that its common PyTorch tests pass (running the TF/Flax tests is significantly slower but they pass too).

To test right now, use an env where the clone of the transformers repo you are working on to add the package is the transformers library registered (might work if it isn't since the transformers module is only used to get the constant in the auto modules, but untested).

The easiest is to run transformers-cli add-new-model-like and follow the prompts.

Otherwise, create a config.json file like the one "add_new_model_config.json" in the first commits of this PR (provided as an example, it has been removed in the final version) with content like this:

{
    "add_copied_from": true, # If false, won't add any Copied from statements
    "old_model_type": "gpt2", # Needs to be a valid model type
    "new_model_patterns": {
        "model_name": "GPT-New new", # Model name for the doc
        "checkpoint": "huggingface/gpt-new-base", # checkpoint to use in all examples
        "model_type": "gpt-new-new", # Model type as saved in the configs
        "model_lower_cased": "gpt_new_new", # Used for the function names and module name
        "model_camel_cased": "GPTNewNew", # Used for the class names
        "model_upper_cased": "GPT_NEW_NEW", # Used for the constant names
        "config_class": "GPTNewNewConfig", # Config class, will default to {model_camel_cased}Config if not provided
        "tokenizer_class": "GPT2Tokenizer" # Tokenizer class, will default to {model_camel_cased}Tokenizer if not provided (which creates a new tokenizer)
    },
    "frameworks": [
        "pt",
        "tf",
        "flax"
    ]
} 

then run

transformers-cli add-new-model-like --config_file path_to_config

Note that it's possible you have to redo an editable install of the repo with this branch checked out to properly register the new CLI command.

@sgugger sgugger marked this pull request as draft December 30, 2021 21:27
@LysandreJik
Copy link
Member

This is really cool, looking forward to it!

@sgugger sgugger marked this pull request as ready for review January 12, 2022 21:52
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR, @sgugger! This is a fantastic addition, which will be extremely helpful. I think having it work perfectly is a complex subject, so very impressed to see it perform so well already.

Questions/suggestions:

  • It generates the conversion script, but I'm not sure this is relevant/exact in most cases. I would put it behind a question as well "Would you like to have the same conversion script as XXX in order to convert from an original checkpoint?"
  • # Copied from statements appear twice in case we're adding a model like another which already has copied from statements. The two statements seem to copy to the appropriate model name chosen, however.
  • I tried to add a model like the roberta model, here are some attributes I would have expected to see changed but which remained like the original:
  • The copyright at the top is a complex question IMO, as the code is definitely inheriting from another so the copyright should be kept - but shouldn't the script ask for the organization which authored the model so that the copyright to that org is also respected? Maybe not, if they don't modify much of it. Open question!
  • All copyrights should be changed to 2022
  • The integration test generated didn't use the checkpoint I specified, instead changed the model identifier to include the lowercase model name I chose. Not necessarily important, or worth the (most likely complex!) change.
  • I think the current approach is only set to work with NLP models? (at least models that have a tokenizer, even if non-NLP) Trying it with detr for example, throws a KeyError as detr is not in the TOKENIZER_MAPPING_NAMES
  • Absolutely love this:
What is the model you would like to duplicate? s2t
s2t is not a valid model type.
What is the model you would like to duplicate? speech2text
speech2text is not a valid model type.
Did you mean speech_to_text or speech_to_text_2 or unispeech?

- "src/**"
- "tests/**"
- ".github/**"
types: [assigned, opened, synchronize, reopened]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know this is in the original model templates too) I don't think assigned is necessary, as it will relaunch the tests when the PR is assigned to someone even though the tests will already have been run

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove!

Comment on lines +14 to +18
"frameworks": [
"pt",
"tf",
"flax"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@stas00
Copy link
Contributor

stas00 commented Jan 14, 2022

Once you're happy with it, we can then put it to practical use and give it a good real case test by re-doing #14084 which got out of sync with all the recent revamps.

So basically cloning GPT2 to create GPTMeg and then adding to it 3 small changes that is the real difference over GPT2

@sgugger
Copy link
Collaborator Author

sgugger commented Jan 14, 2022

Note that GPT-2 is the model where this command is the most likely to fail as the checkpoint name is the same as the model type/model lower cased which then creates bad replacements I can't really control.

It will also duplicate the GPT2DoubleHeadsModel class, which I'm not sure you want for your new model. I would advise duplicating GPT-Neo or GPT-J

@stas00
Copy link
Contributor

stas00 commented Jan 14, 2022

But I'm not modifying GPT-Neo or GPT-J. I don't understand why GPT to GPTMeg is different from GPTNeo to GPTMeg? Is it because GPTNeo has a postfix over the prefix GPT.

Could you please give an example of what you mean when you say it'd fail?

Perhaps it can be done in 2 steps? GPT to XYZ, and then a few one-liner to rename XYZ to the target?

@sgugger
Copy link
Collaborator Author

sgugger commented Jan 17, 2022

Like I said, it's because the checkpoint for GPT2 is named gpt2, which is also the model type prefix for GPT2. It's the only model that conflates the two of them, which results in instances of the checkpoint not being replaced by the checkpoint of the new model, but the model type of the new model. So it needs careful proofreading of the generated files.

Could you please give an example of what you mean when you say it'd fail?

Just run the command and look at the result.

@sgugger sgugger force-pushed the add_model_like branch 2 times, most recently from c3a6fa6 to 877f5fe Compare January 19, 2022 13:50
@sgugger
Copy link
Collaborator Author

sgugger commented Jan 20, 2022

This PR is now in a state that is good in my opinion. I have:

  • revamped the core of the replacement script, to make it more resilient. This solves the issues you pointed out for RoBERTa naming and wrong checkpoints
  • added support for non-NLP models
  • added a whole test suite for the utilities the command uses.

Also there was a fix on the check-copies script on master which make all the quality tests pass when the new model added includes the # Copied from comments.

I don't have much more time to spend on this so for the following problems (or potential bugs) I'd like us to rely on the community. The test suite added is there to catch any regression.

  • It generates the conversion script, but I'm not sure this is relevant/exact in most cases. I would put it behind a question as well "Would you like to have the same conversion script as XXX in order to convert from an original checkpoint?"

This could be a nice feature to add, but it's also super easy to just remove the conversion scripts sine they are completely independent.

  • # Copied from statements appear twice in case we're adding a model like another which already has copied from statements. The two statements seem to copy to the appropriate model name chosen, however.

This is fixed.

  • The copyright at the top is a complex question IMO, as the code is definitely inheriting from another so the copyright should be kept - but shouldn't the script ask for the organization which authored the model so that the copyright to that org is also respected? Maybe not, if they don't modify much of it. Open question!
  • All copyrights should be changed to 2022

I haven't touched at all the copyrights. The new model author should add themselves manually, but if nothing much is changed, I think it's goo to keep the defaults to the authors of the copied model. For the change of year, we can make it a good first issue.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all intents and purposes, this seems to work very well!

I tested with the following starting models:

  • DETR
  • BERT
  • RoBERTa
  • Wav2Vec2
  • Vision encoder decoder

One bug I found with the vision encoder-decoder class (but it's questionable whether this class should be supported with this script) is that the resulting code still has the [CONFIG_CLASS] everywhere.

Other than that, it looks great!

src/transformers/commands/add_new_model_like.py Outdated Show resolved Hide resolved
src/transformers/commands/add_new_model_like.py Outdated Show resolved Hide resolved
sgugger and others added 2 commits January 24, 2022 14:54
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
@sgugger
Copy link
Collaborator Author

sgugger commented Jan 24, 2022

Noted for the bug in VisionEncoderDecoder. It doesn't block the merge as it's not really a target of this PR, but good to keep it in mind!

@sgugger
Copy link
Collaborator Author

sgugger commented Jan 24, 2022

Failure is unrelated so merging this!

@sgugger sgugger merged commit 81156d2 into master Jan 24, 2022
@sgugger sgugger deleted the add_model_like branch January 24, 2022 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature request] a tool to clone existing models to make new models with small changes
3 participants