You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I seem to have seen this request more than once on transformers, many users would like to be able to continue training a tokenizer on a new dataset (see for example this issue).
The use case I've heard about several times is: a user wants to continue training an already pre-trained model. Only, he/she notices that some characters - very frequent in his/her dataset - are unknown to the tokenizer. As the user plans in any case to retrain the model, he/she can take the opportunity to add some tokens and enlarge the matrix - i.e. by adding randomly initialized embeddings for the newly added tokens - corresponding to the embedding tokens in the model (thanks to the resize_token_embeddings method). Currently, the user can only add tokens with the add_tokens method - which implies that he/she has to identify him/herself which tokens are most relevant to add - but maybe it would be possible to continue training the tokenizer to add the most relevant tokens according to the tokenization algortihm used.
I imagine this feature will be very specific to each tokenization algorithm.
What do you think about it?
The text was updated successfully, but these errors were encountered:
since it uses Energy Maximisation, the algorithm starts with MANY potential tokens which get removed progressively. For "fine-tuning" the tokenizer, we would need to find a way to "add again" some tokens without removing previous work. The previously found tokens which might not get seen in the new data need to be prevented from being evicted. I don't think there's any obvious way to do so.
I did not put it in my initial message but I shared the same impression: I think I have well in mind how to do for BPE and do not see (yet?) a satisfactory solution for Unigram either. ☺️
I seem to have seen this request more than once on
transformers
, many users would like to be able to continue training a tokenizer on a new dataset (see for example this issue).The use case I've heard about several times is: a user wants to continue training an already pre-trained model. Only, he/she notices that some characters - very frequent in his/her dataset - are unknown to the tokenizer. As the user plans in any case to retrain the model, he/she can take the opportunity to add some tokens and enlarge the matrix - i.e. by adding randomly initialized embeddings for the newly added tokens - corresponding to the embedding tokens in the model (thanks to the
resize_token_embeddings
method). Currently, the user can only add tokens with theadd_tokens
method - which implies that he/she has to identify him/herself which tokens are most relevant to add - but maybe it would be possible to continue training the tokenizer to add the most relevant tokens according to the tokenization algortihm used.I imagine this feature will be very specific to each tokenization algorithm.
What do you think about it?
The text was updated successfully, but these errors were encountered: