Remove n_ctx from configs #14165

thomasw21 · 2021-10-26T16:22:51Z

What does this PR do?

Remove n_ctx from configs as it's just a duplicate for n_positions.

GPTJ was left unchanged because it has a linear layer that depends on n_ctx. I'm unclear why, is it a bug and the author meant n_embed?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@patrickvonplaten @stas00

patrick-s-h-lewis · 2021-10-26T16:24:47Z

yo, you probably meant to tag @patrickvonplaten :)

thomasw21 · 2021-10-26T16:30:48Z

Oops! Thanks!

stas00 · 2021-10-26T18:51:48Z

Thank you, Thomas!

I haven't had a chance to confirm that it's truly identical. Perhaps someone could do it since I won't be able to do that now.

But a question how are we going to deal with backward compatibly if we are removing a config key? Assume that n_positions must be there and therefore it should just work?

@sgugger, what do you think?

sgugger · 2021-10-26T19:03:54Z

In terms of backward compatibility, it won't change anything to have this attribute or not: the config will still set the n_ctx attribute to the value inside the config.json on the model repo (if it is defined there). If it's not use anywhere in a model's code, it can be removed IMO.

cc @LysandreJik

stas00 · 2021-10-26T19:16:14Z

what about the function arguments? this is part of this PR:

- def __init__(self, nx, n_ctx, config, scale=False)
+ def __init__(self, nx, n_positions, config, scale=False)

sgugger · 2021-10-26T19:21:29Z

Changing the inside blocks is fine I think since those are internal, although I'm not sure the slight breaking change is worth it. However changing the config arguments used here and there(e.g. using config.n_positions instead of config.n_ctx) is not fine (I had not noticed it on my first pass) unless we can guarantee that every model on the hub uses the same values for n_ctx and n_positions.

All in all not sure this is worth changing anything compared to what we are potentially breaking.

stas00 · 2021-10-26T20:23:19Z

So that means that we will never be able to clean it up, not even in the proverbial v5.

I suppose in v5 we could add an assert that if config.json has n_ctx and it's not the same as n_positions then we alert the user that n_ctx is deprecated and use n_positions instead?

or alternatively perhaps asking a different question - is it even possible for a config to have different values for n_ctx and n_positions - if that code won't work by definition - i.e. there will be a certain mismatch error, then it should be safe to assume that n_ctx and n_positions are identical on the hub. I haven't validated that it is so, but if it is then...

sgugger · 2021-10-26T21:00:09Z

A major release won't change anything, as breaking changes to existing models on the Hub will never be accepted. I have no idea why the original design model has those two attributes, so I can't answer your second question.

I guess the best path forward is to study all configs on the Hub with a script, and check if there is one with two different values for those attributes. If there are none, we can proceed with the PR as it is.

stas00 · 2021-10-26T21:13:53Z

Agreed!

Is there an easy way to download all config files? does hub have a sort of index with all files and their locations? (assuming not in LFS as it's a tiny file)

sgugger · 2021-10-26T21:23:29Z

I think @patrickvonplaten might have a script to help on this :-)

src/transformers/models/openai/modeling_openai.py

src/transformers/models/gpt2/modeling_tf_gpt2.py

src/transformers/models/gpt2/configuration_gpt2.py

src/transformers/models/openai/modeling_tf_openai.py

patrickvonplaten · 2021-10-26T21:35:41Z

Okey as far as I see the only model where anything could potentially break is modeling_openai. I'm happy to scan the Hub on whether there are any models out there where this could break. Will run the script tomorrow and let you know.

patrickvonplaten · 2021-10-26T21:35:58Z

GPT2 can be cleaned up for sure IMO n_ctx is never actually used there

… exists should still be able to work

… are no configs such that it breaks

thomasw21 · 2021-10-28T15:47:05Z

Okay so the only "breaking change" config wise are: GPTJForSequenceClassificationModel, OpenAIGPT* models.

I scanned the hub, for configs where n_ctx != n_positions and there was none (doesn't mean there are any in the wild I guess ...)

Subblocks are not considered as breaking change. cc @patrickvonplaten

thomasw21 · 2021-10-28T15:49:27Z

Script used to scan the hub.

import argparse
from typing import Tuple

from transformers import AutoConfig
import json
from multiprocessing import Pool

def get_args():
    parser = argparse.ArgumentParser()
    # Required parameters
    parser.add_argument(
        "--model-ids-file", default=None, type=str, required=True, help="Path to the json file containing all models ids."
    )
    parser.add_argument(
        "--procs", default=1, type=int, help="Number of processes."
    )
    return parser.parse_args()

def check_config(model_id) -> Tuple[str, bool, str]:
    model_id = model_id.strip()
    try:
        config = AutoConfig.from_pretrained(model_id)
    except:
        return model_id, False, f"{model_id} cannot load config"

    if isinstance(config.architectures, list) and len(config.architectures) > 0:
        # here need to check for all `OpenAIGPT...` class names and filter

        # for architecture in config.architectures:
        #     if architecture == "GPTJForSequenceClassification":
        #         if config.n_ctx != config.n_positions:
        #             return model_id, True, f"{config.model_type}, n_ctx != n_positions with n_ctx={config.n_ctx} n_positions={config.n_positions}"

        # return model_id, False, f"No architecture matched GPTJForSequenceClassification or all have matching n_ctx n_positions"

        for architecture in config.architectures:
            if architecture.startswith("OpenAIGPT"):
                if config.n_ctx != config.n_positions:
                    return model_id, True, f"{config.model_type}, n_ctx != n_positions with n_ctx={config.n_ctx} n_positions={config.n_positions}"

        return model_id, False, f"No architecture matched GPTJForSequenceClassification or all have matching n_ctx n_positions"



        # model_type_filter = ["gpt", "gpt2", "ctrl"]
        # if config.model_type in model_type_filter:
        #     if config.n_ctx != config.n_positions:
        #         return model_id, True, f"{config.model_type}, n_ctx != n_positions with n_ctx={config.n_ctx} n_positions={config.n_positions}"

        #     return model_id, False, f"{config.model_type}, n_ctx == n_positions with n_ctx={config.n_ctx} n_positions={config.n_positions}"
        # return model_id, False, f"{config.model_type} not in {model_type_filter}"
    else:
        return model_id, False, f"{model_id} is BAD!"

def main():
    args = get_args()

    with open(args.model_ids_file, "r") as f:
        lines = json.load(f)

    model_ids = [line["modelId"] for line in lines]
    print(model_ids)
    if args.procs > 1:
        pool = Pool(args.procs)
        model_ids_and_reasons = pool.imap(check_config, model_ids)
    else:
        model_ids_and_reasons = [check_config(model_id) for model_id in model_ids]

    all_matches = []
    for i, model_ids_and_reason in enumerate(model_ids_and_reasons):
        if i % 1000 == 0:
            print(i)
        if model_ids_and_reason[1] is False:
            continue

        else:
            all_matches.append(model_ids_and_reason)

    for match in all_matches:
        print(f"{match[0]} is not safe {match[2]}")

if __name__ == "__main__":
    main()

patrickvonplaten

Thanks for working on it! I verified that CTRL and GPT2 actually never make use of n_ctx so it's safe to remove it.

The only models where n_ctx has an effect are all OpenAIGPT* and GPTJForSequenceClassification. All those classes are rarely used and there is not a single config on the hub that has n_ctx != n_positions => so IMO this breaking change in the PR is ok!

sgugger · 2021-10-28T15:57:43Z

Thanks a lot for checking the existing configs @patrickvonplaten !

stas00 · 2021-10-28T16:28:48Z

Could we please save the scanning scripts
under https://github.com/huggingface/transformers/tree/master/scripts
and how do we get the list of models that the script takes? Thank you!

patrickvonplaten · 2021-10-28T18:00:10Z

Tests fails currently on master as well. Think this needs to be fixed and then the PR should be rebased :-)

LysandreJik

Fine by me, thank you for working on this!

stas00 · 2021-10-28T22:23:45Z

Tests fails currently on master as well. Think this needs to be fixed and then the PR should be rebased :-)

FWIW, run_tests_hub CI is currently broken - the failure is not related to this PR.

thomasw21 · 2021-10-29T09:50:00Z

Reran tests, seem to pass now. Previous failures seemed unrelated to this PR. Merging then.

thomasw21 added 2 commits October 26, 2021 18:42

Remove n_ctx from configs

759e2ff

Make style

d34f577

thomasw21 force-pushed the thomas/remove_ctx_from_gpt2_config branch from c6d49b3 to d34f577 Compare October 26, 2021 16:42

thomasw21 marked this pull request as ready for review October 26, 2021 16:59

thomasw21 requested review from stas00 and patrickvonplaten October 26, 2021 17:05