Ability to re-train a Tokenizer with relevant parameters #525

n1t0 · 2020-11-13T17:53:03Z

Current state

When we want to train a Tokenizer, we need to give a Trainer initialized with a set of custom parameters:

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

# We need to provide the relevant parameters, to avoid using the general defaults
trainer = BpeTrainer(vocab_size=30000, special_tokens=[...], initial_alphabet=[...], ...)
tokenizer.train(files=[...], trainer=trainer)

Goal

Add the ability to re-train a Tokenizer using the same custom parameters that were used for training the first time.
This would allow users to re-train some pre-trained tokenizers provided by the community with their own dataset. We'd be able to do this:

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
tokenizer.train(files=[ ... ])

and expect to get a Tokenizer very similar to the one we originally loaded (same special_tokens, vocab_size, ...), with a brand new vocabulary.

How

One of the ways to achieve this is to make the Trainer save its training params on the Model during training, thus allowing Model::get_trainer to return a Trainer instantiated as expected. All of this should be added to the serialization process.

Considerations

If some tokens were added with add_tokens or add_special_tokens, re-training is not currently supported because AddedVocabulary adds tokens on top of an existing vocabulary, expecting it to never change (cf #523)
Also depends on #527

The text was updated successfully, but these errors were encountered:

Virviil · 2023-06-06T13:08:35Z

Just want to check that i'm facing the same problem:

use tokenizers::{
    models::bpe::{BpeTrainer, BPE},
    pre_tokenizers::whitespace::Whitespace,
    AddedToken, DecoderWrapper, Model, NormalizerWrapper, PostProcessorWrapper,
    PreTokenizerWrapper, TokenizerImpl,
};

fn main() -> Result<(), tokenizers::Error> {
    let mut tokenizer: TokenizerImpl<
        BPE,
        NormalizerWrapper,
        PreTokenizerWrapper,
        PostProcessorWrapper,
        DecoderWrapper,
    > = TokenizerImpl::new(
        BPE::builder()
            .unk_token("[UNK]".to_string())
            .build()
            .unwrap(),
    );

    let mut trainer = BpeTrainer::builder()
        .special_tokens(vec![
            AddedToken::from("[UNK]", true),
            AddedToken::from("[CLS]", true),
            AddedToken::from("[SEP]", true),
            AddedToken::from("[PAD]", true),
            AddedToken::from("[MASK]", true),
        ])
        .build();
    tokenizer.with_pre_tokenizer(Whitespace::default());
    let files = vec![
        "wikitext-103-raw/wiki.train.raw".into(),
        "wikitext-103-raw/wiki.test.raw".into(),
        "wikitext-103-raw/wiki.valid.raw".into(),
    ];
    tokenizer.train_from_files(&mut trainer, files)?;
    tokenizer.save("tokenizer-wiki.json", false)?;

    // Next wave
    let mut new_trainer = tokenizer.get_model().get_trainer();
    let files = vec![
        "wikitext-103-raw/wiki.train.raw".into(),
        "wikitext-103-raw/wiki.test.raw".into(),
        "wikitext-103-raw/wiki.valid.raw".into(),
    ];
    tokenizer.train_from_files(&mut new_trainer, files)?;
    tokenizer.save("tokenizer-wiki.json", false)?;

    Ok(())
}

This is basically the code, that is crashing with

thread 'main' panicked at 'Missing additional token', /.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.13.3/src/tokenizer/added_vocabulary.rs:293:26
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_display
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:147:5
   3: core::panicking::panic_str
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:131:5
   4: core::option::expect_failed
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/option.rs:2045:5
   5: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
   6: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::fold
   7: tokenizers::tokenizer::added_vocabulary::AddedVocabulary::add_special_tokens
   8: tokenizers::utils::iter::ResultShunt<I,E>::process
   9: tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>::train_from_files
  10: toktest::main

get_trainer returns not valid but default one, which is missing some tokens.

What will be the right strategy?

ArthurZucker · 2023-09-22T00:15:44Z

This issue is more a feature request than a problem.
You are doing something wrong as the error indicates: pretty sure the special tokens are missing in the tokenizer while they are added to the trainer builder. Yes the feature presented would help you!
Do you want to have a go at it? 🤗

github-actions · 2024-05-04T01:49:05Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

n1t0 added the enhancement New feature or request label Nov 13, 2020

n1t0 mentioned this issue Nov 13, 2020

Training improvements #528

Closed

6 tasks

mishig25 mentioned this issue Mar 14, 2022

Adding pickling support for trainers #949

Merged

github-actions bot added the Stale label May 4, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to re-train a Tokenizer with relevant parameters #525

Ability to re-train a Tokenizer with relevant parameters #525

n1t0 commented Nov 13, 2020 •

edited

Loading

Virviil commented Jun 6, 2023 •

edited

Loading

ArthurZucker commented Sep 22, 2023

github-actions bot commented May 4, 2024

Ability to re-train a Tokenizer with relevant parameters #525

Ability to re-train a Tokenizer with relevant parameters #525

Comments

n1t0 commented Nov 13, 2020 • edited Loading

Current state

Goal

How

Considerations

Virviil commented Jun 6, 2023 • edited Loading

ArthurZucker commented Sep 22, 2023

github-actions bot commented May 4, 2024

n1t0 commented Nov 13, 2020 •

edited

Loading

Virviil commented Jun 6, 2023 •

edited

Loading