Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove n_ctx from configs #14165

Merged
merged 5 commits into from
Oct 29, 2021
Merged

Conversation

thomasw21
Copy link
Contributor

@thomasw21 thomasw21 commented Oct 26, 2021

What does this PR do?

Remove n_ctx from configs as it's just a duplicate for n_positions.

GPTJ was left unchanged because it has a linear layer that depends on n_ctx. I'm unclear why, is it a bug and the author meant n_embed?

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@patrickvonplaten @stas00

@patrick-s-h-lewis
Copy link
Contributor

yo, you probably meant to tag @patrickvonplaten :)

@thomasw21
Copy link
Contributor Author

Oops! Thanks!

@thomasw21 thomasw21 force-pushed the thomas/remove_ctx_from_gpt2_config branch from c6d49b3 to d34f577 Compare October 26, 2021 16:42
@thomasw21 thomasw21 marked this pull request as ready for review October 26, 2021 16:59
@stas00
Copy link
Contributor

stas00 commented Oct 26, 2021

Thank you, Thomas!

I haven't had a chance to confirm that it's truly identical. Perhaps someone could do it since I won't be able to do that now.

But a question how are we going to deal with backward compatibly if we are removing a config key? Assume that n_positions must be there and therefore it should just work?

@sgugger, what do you think?

@sgugger
Copy link
Collaborator

sgugger commented Oct 26, 2021

In terms of backward compatibility, it won't change anything to have this attribute or not: the config will still set the n_ctx attribute to the value inside the config.json on the model repo (if it is defined there). If it's not use anywhere in a model's code, it can be removed IMO.

cc @LysandreJik

@stas00
Copy link
Contributor

stas00 commented Oct 26, 2021

what about the function arguments? this is part of this PR:

- def __init__(self, nx, n_ctx, config, scale=False)
+ def __init__(self, nx, n_positions, config, scale=False)

@sgugger
Copy link
Collaborator

sgugger commented Oct 26, 2021

Changing the inside blocks is fine I think since those are internal, although I'm not sure the slight breaking change is worth it. However changing the config arguments used here and there(e.g. using config.n_positions instead of config.n_ctx) is not fine (I had not noticed it on my first pass) unless we can guarantee that every model on the hub uses the same values for n_ctx and n_positions.

All in all not sure this is worth changing anything compared to what we are potentially breaking.

@stas00
Copy link
Contributor

stas00 commented Oct 26, 2021

So that means that we will never be able to clean it up, not even in the proverbial v5.

I suppose in v5 we could add an assert that if config.json has n_ctx and it's not the same as n_positions then we alert the user that n_ctx is deprecated and use n_positions instead?

or alternatively perhaps asking a different question - is it even possible for a config to have different values for n_ctx and n_positions - if that code won't work by definition - i.e. there will be a certain mismatch error, then it should be safe to assume that n_ctx and n_positions are identical on the hub. I haven't validated that it is so, but if it is then...

@sgugger
Copy link
Collaborator

sgugger commented Oct 26, 2021

A major release won't change anything, as breaking changes to existing models on the Hub will never be accepted. I have no idea why the original design model has those two attributes, so I can't answer your second question.

I guess the best path forward is to study all configs on the Hub with a script, and check if there is one with two different values for those attributes. If there are none, we can proceed with the PR as it is.

@stas00
Copy link
Contributor

stas00 commented Oct 26, 2021

Agreed!

Is there an easy way to download all config files? does hub have a sort of index with all files and their locations? (assuming not in LFS as it's a tiny file)

@sgugger
Copy link
Collaborator

sgugger commented Oct 26, 2021

I think @patrickvonplaten might have a script to help on this :-)

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Oct 26, 2021

Okey as far as I see the only model where anything could potentially break is modeling_openai. I'm happy to scan the Hub on whether there are any models out there where this could break. Will run the script tomorrow and let you know.

@patrickvonplaten
Copy link
Contributor

GPT2 can be cleaned up for sure IMO n_ctx is never actually used there

@stas00 stas00 requested review from stas00 and removed request for stas00 October 26, 2021 22:40
@thomasw21
Copy link
Contributor Author

Okay so the only "breaking change" config wise are: GPTJForSequenceClassificationModel, OpenAIGPT* models.

I scanned the hub, for configs where n_ctx != n_positions and there was none (doesn't mean there are any in the wild I guess ...)

Subblocks are not considered as breaking change. cc @patrickvonplaten

@thomasw21
Copy link
Contributor Author

Script used to scan the hub.

import argparse
from typing import Tuple

from transformers import AutoConfig
import json
from multiprocessing import Pool

def get_args():
    parser = argparse.ArgumentParser()
    # Required parameters
    parser.add_argument(
        "--model-ids-file", default=None, type=str, required=True, help="Path to the json file containing all models ids."
    )
    parser.add_argument(
        "--procs", default=1, type=int, help="Number of processes."
    )
    return parser.parse_args()

def check_config(model_id) -> Tuple[str, bool, str]:
    model_id = model_id.strip()
    try:
        config = AutoConfig.from_pretrained(model_id)
    except:
        return model_id, False, f"{model_id} cannot load config"

    if isinstance(config.architectures, list) and len(config.architectures) > 0:
        # here need to check for all `OpenAIGPT...` class names and filter

        # for architecture in config.architectures:
        #     if architecture == "GPTJForSequenceClassification":
        #         if config.n_ctx != config.n_positions:
        #             return model_id, True, f"{config.model_type}, n_ctx != n_positions with n_ctx={config.n_ctx} n_positions={config.n_positions}"

        # return model_id, False, f"No architecture matched GPTJForSequenceClassification or all have matching n_ctx n_positions"

        for architecture in config.architectures:
            if architecture.startswith("OpenAIGPT"):
                if config.n_ctx != config.n_positions:
                    return model_id, True, f"{config.model_type}, n_ctx != n_positions with n_ctx={config.n_ctx} n_positions={config.n_positions}"

        return model_id, False, f"No architecture matched GPTJForSequenceClassification or all have matching n_ctx n_positions"



        # model_type_filter = ["gpt", "gpt2", "ctrl"]
        # if config.model_type in model_type_filter:
        #     if config.n_ctx != config.n_positions:
        #         return model_id, True, f"{config.model_type}, n_ctx != n_positions with n_ctx={config.n_ctx} n_positions={config.n_positions}"

        #     return model_id, False, f"{config.model_type}, n_ctx == n_positions with n_ctx={config.n_ctx} n_positions={config.n_positions}"
        # return model_id, False, f"{config.model_type} not in {model_type_filter}"
    else:
        return model_id, False, f"{model_id} is BAD!"

def main():
    args = get_args()

    with open(args.model_ids_file, "r") as f:
        lines = json.load(f)

    model_ids = [line["modelId"] for line in lines]
    print(model_ids)
    if args.procs > 1:
        pool = Pool(args.procs)
        model_ids_and_reasons = pool.imap(check_config, model_ids)
    else:
        model_ids_and_reasons = [check_config(model_id) for model_id in model_ids]

    all_matches = []
    for i, model_ids_and_reason in enumerate(model_ids_and_reasons):
        if i % 1000 == 0:
            print(i)
        if model_ids_and_reason[1] is False:
            continue

        else:
            all_matches.append(model_ids_and_reason)

    for match in all_matches:
        print(f"{match[0]} is not safe {match[2]}")

if __name__ == "__main__":
    main()

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it! I verified that CTRL and GPT2 actually never make use of n_ctx so it's safe to remove it.

The only models where n_ctx has an effect are all OpenAIGPT* and GPTJForSequenceClassification. All those classes are rarely used and there is not a single config on the hub that has n_ctx != n_positions => so IMO this breaking change in the PR is ok!

@sgugger
Copy link
Collaborator

sgugger commented Oct 28, 2021

Thanks a lot for checking the existing configs @patrickvonplaten !

@stas00
Copy link
Contributor

stas00 commented Oct 28, 2021

Could we please save the scanning scripts
under https://github.com/huggingface/transformers/tree/master/scripts
and how do we get the list of models that the script takes? Thank you!

@patrickvonplaten
Copy link
Contributor

Tests fails currently on master as well. Think this needs to be fixed and then the PR should be rebased :-)

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me, thank you for working on this!

@stas00
Copy link
Contributor

stas00 commented Oct 28, 2021

Tests fails currently on master as well. Think this needs to be fixed and then the PR should be rebased :-)

FWIW, run_tests_hub CI is currently broken - the failure is not related to this PR.

@thomasw21
Copy link
Contributor Author

Reran tests, seem to pass now. Previous failures seemed unrelated to this PR. Merging then.

@thomasw21 thomasw21 merged commit 5b45422 into master Oct 29, 2021
@thomasw21 thomasw21 deleted the thomas/remove_ctx_from_gpt2_config branch October 29, 2021 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants