Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TBD] add a feature to continue training a tokenizer #879

Closed
SaulLu opened this issue Jan 14, 2022 · 3 comments
Closed

[TBD] add a feature to continue training a tokenizer #879

SaulLu opened this issue Jan 14, 2022 · 3 comments
Labels

Comments

@SaulLu
Copy link
Contributor

SaulLu commented Jan 14, 2022

I seem to have seen this request more than once on transformers, many users would like to be able to continue training a tokenizer on a new dataset (see for example this issue).

The use case I've heard about several times is: a user wants to continue training an already pre-trained model. Only, he/she notices that some characters - very frequent in his/her dataset - are unknown to the tokenizer. As the user plans in any case to retrain the model, he/she can take the opportunity to add some tokens and enlarge the matrix - i.e. by adding randomly initialized embeddings for the newly added tokens - corresponding to the embedding tokens in the model (thanks to the resize_token_embeddings method). Currently, the user can only add tokens with the add_tokens method - which implies that he/she has to identify him/herself which tokens are most relevant to add - but maybe it would be possible to continue training the tokenizer to add the most relevant tokens according to the tokenization algortihm used.

I imagine this feature will be very specific to each tokenization algorithm.

What do you think about it?

@Narsil
Copy link
Collaborator

Narsil commented Jan 14, 2022

This issue #839 also seems to point it that way.

  • For BPE it seems sound
  • For Unigram it's debattable

since it uses Energy Maximisation, the algorithm starts with MANY potential tokens which get removed progressively. For "fine-tuning" the tokenizer, we would need to find a way to "add again" some tokens without removing previous work. The previously found tokens which might not get seen in the new data need to be prevented from being evicted. I don't think there's any obvious way to do so.

@SaulLu
Copy link
Contributor Author

SaulLu commented Jan 14, 2022

I did not put it in my initial message but I shared the same impression: I think I have well in mind how to do for BPE and do not see (yet?) a satisfactory solution for Unigram either. ☺️

Copy link

github-actions bot commented Mar 4, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Mar 4, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants