-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to re-train a Tokenizer with relevant parameters #525
Comments
Just want to check that i'm facing the same problem: use tokenizers::{
models::bpe::{BpeTrainer, BPE},
pre_tokenizers::whitespace::Whitespace,
AddedToken, DecoderWrapper, Model, NormalizerWrapper, PostProcessorWrapper,
PreTokenizerWrapper, TokenizerImpl,
};
fn main() -> Result<(), tokenizers::Error> {
let mut tokenizer: TokenizerImpl<
BPE,
NormalizerWrapper,
PreTokenizerWrapper,
PostProcessorWrapper,
DecoderWrapper,
> = TokenizerImpl::new(
BPE::builder()
.unk_token("[UNK]".to_string())
.build()
.unwrap(),
);
let mut trainer = BpeTrainer::builder()
.special_tokens(vec![
AddedToken::from("[UNK]", true),
AddedToken::from("[CLS]", true),
AddedToken::from("[SEP]", true),
AddedToken::from("[PAD]", true),
AddedToken::from("[MASK]", true),
])
.build();
tokenizer.with_pre_tokenizer(Whitespace::default());
let files = vec![
"wikitext-103-raw/wiki.train.raw".into(),
"wikitext-103-raw/wiki.test.raw".into(),
"wikitext-103-raw/wiki.valid.raw".into(),
];
tokenizer.train_from_files(&mut trainer, files)?;
tokenizer.save("tokenizer-wiki.json", false)?;
// Next wave
let mut new_trainer = tokenizer.get_model().get_trainer();
let files = vec![
"wikitext-103-raw/wiki.train.raw".into(),
"wikitext-103-raw/wiki.test.raw".into(),
"wikitext-103-raw/wiki.valid.raw".into(),
];
tokenizer.train_from_files(&mut new_trainer, files)?;
tokenizer.save("tokenizer-wiki.json", false)?;
Ok(())
} This is basically the code, that is crashing with
What will be the right strategy? |
This issue is more a feature request than a |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Current state
When we want to train a Tokenizer, we need to give a
Trainer
initialized with a set of custom parameters:Goal
Add the ability to re-train a
Tokenizer
using the same custom parameters that were used for training the first time.This would allow users to re-train some pre-trained tokenizers provided by the community with their own dataset. We'd be able to do this:
and expect to get a
Tokenizer
very similar to the one we originally loaded (same special_tokens, vocab_size, ...), with a brand new vocabulary.How
One of the ways to achieve this is to make the
Trainer
save its training params on theModel
during training, thus allowingModel::get_trainer
to return aTrainer
instantiated as expected. All of this should be added to the serialization process.Considerations
If some tokens were added with
add_tokens
oradd_special_tokens
, re-training is not currently supported becauseAddedVocabulary
adds tokens on top of an existing vocabulary, expecting it to never change (cf #523)Also depends on #527
The text was updated successfully, but these errors were encountered: