Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to re-train a Tokenizer with relevant parameters #525

Closed
n1t0 opened this issue Nov 13, 2020 · 3 comments
Closed

Ability to re-train a Tokenizer with relevant parameters #525

n1t0 opened this issue Nov 13, 2020 · 3 comments
Labels
enhancement New feature or request Stale

Comments

@n1t0
Copy link
Member

n1t0 commented Nov 13, 2020

Current state

When we want to train a Tokenizer, we need to give a Trainer initialized with a set of custom parameters:

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

# We need to provide the relevant parameters, to avoid using the general defaults
trainer = BpeTrainer(vocab_size=30000, special_tokens=[...], initial_alphabet=[...], ...)
tokenizer.train(files=[...], trainer=trainer)

Goal

Add the ability to re-train a Tokenizer using the same custom parameters that were used for training the first time.
This would allow users to re-train some pre-trained tokenizers provided by the community with their own dataset. We'd be able to do this:

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
tokenizer.train(files=[ ... ])

and expect to get a Tokenizer very similar to the one we originally loaded (same special_tokens, vocab_size, ...), with a brand new vocabulary.

How

One of the ways to achieve this is to make the Trainer save its training params on the Model during training, thus allowing Model::get_trainer to return a Trainer instantiated as expected. All of this should be added to the serialization process.

Considerations

If some tokens were added with add_tokens or add_special_tokens, re-training is not currently supported because AddedVocabulary adds tokens on top of an existing vocabulary, expecting it to never change (cf #523)
Also depends on #527

@n1t0 n1t0 added the enhancement New feature or request label Nov 13, 2020
@n1t0 n1t0 mentioned this issue Nov 13, 2020
6 tasks
@Virviil
Copy link

Virviil commented Jun 6, 2023

Just want to check that i'm facing the same problem:

use tokenizers::{
    models::bpe::{BpeTrainer, BPE},
    pre_tokenizers::whitespace::Whitespace,
    AddedToken, DecoderWrapper, Model, NormalizerWrapper, PostProcessorWrapper,
    PreTokenizerWrapper, TokenizerImpl,
};

fn main() -> Result<(), tokenizers::Error> {
    let mut tokenizer: TokenizerImpl<
        BPE,
        NormalizerWrapper,
        PreTokenizerWrapper,
        PostProcessorWrapper,
        DecoderWrapper,
    > = TokenizerImpl::new(
        BPE::builder()
            .unk_token("[UNK]".to_string())
            .build()
            .unwrap(),
    );

    let mut trainer = BpeTrainer::builder()
        .special_tokens(vec![
            AddedToken::from("[UNK]", true),
            AddedToken::from("[CLS]", true),
            AddedToken::from("[SEP]", true),
            AddedToken::from("[PAD]", true),
            AddedToken::from("[MASK]", true),
        ])
        .build();
    tokenizer.with_pre_tokenizer(Whitespace::default());
    let files = vec![
        "wikitext-103-raw/wiki.train.raw".into(),
        "wikitext-103-raw/wiki.test.raw".into(),
        "wikitext-103-raw/wiki.valid.raw".into(),
    ];
    tokenizer.train_from_files(&mut trainer, files)?;
    tokenizer.save("tokenizer-wiki.json", false)?;

    // Next wave
    let mut new_trainer = tokenizer.get_model().get_trainer();
    let files = vec![
        "wikitext-103-raw/wiki.train.raw".into(),
        "wikitext-103-raw/wiki.test.raw".into(),
        "wikitext-103-raw/wiki.valid.raw".into(),
    ];
    tokenizer.train_from_files(&mut new_trainer, files)?;
    tokenizer.save("tokenizer-wiki.json", false)?;

    Ok(())
}

This is basically the code, that is crashing with

thread 'main' panicked at 'Missing additional token', /.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.13.3/src/tokenizer/added_vocabulary.rs:293:26
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_display
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:147:5
   3: core::panicking::panic_str
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:131:5
   4: core::option::expect_failed
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/option.rs:2045:5
   5: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
   6: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::fold
   7: tokenizers::tokenizer::added_vocabulary::AddedVocabulary::add_special_tokens
   8: tokenizers::utils::iter::ResultShunt<I,E>::process
   9: tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>::train_from_files
  10: toktest::main

get_trainer returns not valid but default one, which is missing some tokens.

What will be the right strategy?

@ArthurZucker
Copy link
Collaborator

This issue is more a feature request than a problem.
You are doing something wrong as the error indicates: pretty sure the special tokens are missing in the tokenizer while they are added to the trainer builder. Yes the feature presented would help you!
Do you want to have a go at it? 🤗

Copy link

github-actions bot commented May 4, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 4, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

3 participants