[TBD] add a feature to continue training a tokenizer #879

SaulLu · 2022-01-14T14:01:45Z

I seem to have seen this request more than once on transformers, many users would like to be able to continue training a tokenizer on a new dataset (see for example this issue).

The use case I've heard about several times is: a user wants to continue training an already pre-trained model. Only, he/she notices that some characters - very frequent in his/her dataset - are unknown to the tokenizer. As the user plans in any case to retrain the model, he/she can take the opportunity to add some tokens and enlarge the matrix - i.e. by adding randomly initialized embeddings for the newly added tokens - corresponding to the embedding tokens in the model (thanks to the resize_token_embeddings method). Currently, the user can only add tokens with the add_tokens method - which implies that he/she has to identify him/herself which tokens are most relevant to add - but maybe it would be possible to continue training the tokenizer to add the most relevant tokens according to the tokenization algortihm used.

I imagine this feature will be very specific to each tokenization algorithm.

What do you think about it?

The text was updated successfully, but these errors were encountered:

Narsil · 2022-01-14T14:07:04Z

This issue #839 also seems to point it that way.

For BPE it seems sound
For Unigram it's debattable

since it uses Energy Maximisation, the algorithm starts with MANY potential tokens which get removed progressively. For "fine-tuning" the tokenizer, we would need to find a way to "add again" some tokens without removing previous work. The previously found tokens which might not get seen in the new data need to be prevented from being evicted. I don't think there's any obvious way to do so.

SaulLu · 2022-01-14T14:11:46Z

I did not put it in my initial message but I shared the same impression: I think I have well in mind how to do for BPE and do not see (yet?) a satisfactory solution for Unigram either. ☺️

github-actions · 2024-03-04T01:52:20Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

SaulLu mentioned this issue Jan 14, 2022

It is better to add a function to train additon tokens for the pre_trained tokenizer. esp. for the language like Chinese. huggingface/transformers#15153

Closed

github-actions bot added the Stale label Mar 4, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TBD] add a feature to continue training a tokenizer #879

[TBD] add a feature to continue training a tokenizer #879

SaulLu commented Jan 14, 2022

Narsil commented Jan 14, 2022

SaulLu commented Jan 14, 2022

github-actions bot commented Mar 4, 2024

[TBD] add a feature to continue training a tokenizer #879

[TBD] add a feature to continue training a tokenizer #879

Comments

SaulLu commented Jan 14, 2022

Narsil commented Jan 14, 2022

SaulLu commented Jan 14, 2022

github-actions bot commented Mar 4, 2024