Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add context_length to model configs that us HF models (and as param to HFTokenizer) #222

Closed
iejMac opened this issue Nov 13, 2022 · 4 comments

Comments

@iejMac
Copy link
Contributor

iejMac commented Nov 13, 2022

No description provided.

@iejMac iejMac mentioned this issue Nov 13, 2022
@usuyama
Copy link

usuyama commented Nov 16, 2022

How about something like this?

def get_tokenizer(model_name):

def get_tokenizer(model_name):
    config = get_model_config(model_name)
    context_length = config['text_cfg'].get("context_length", 77)
    if 'hf_tokenizer_name' in config['text_cfg']:
       tokenizer = HFTokenizer(config['text_cfg']['hf_tokenizer_name'], context_length)
    else:
       tokenizer = lambda texts: tokenize(texts, context_length)
    return tokenizer

@rom1504
Copy link
Collaborator

rom1504 commented Nov 20, 2022

sure why not

@iejMac
Copy link
Contributor Author

iejMac commented Nov 20, 2022

btw, I'm not sure if this is urgent so I've been putting it off. If people need this soon I can push it up the priority queue

@rom1504
Copy link
Collaborator

rom1504 commented Nov 21, 2022

we noticed that some models that use relative positional embeddings (eg mt5) do not have a limited context length.

That's an edge case to take into account here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants