Trainer improvements #519

n1t0 · 2020-11-10T16:47:13Z

Add a WordLevelTrainer (Fix Add the WordLevelTrainer #265)
Add a default impl for Trainer::process_tokens
Each Model can now return its associated trainer with get_trainer. This removes the need to provide a trainer in Python (could be optional on Rust too) (Fix Make the trainer optional when training #527)
The Model on the Tokenizer now gets trained in-place. This allows keeping its customization (dropout, unk_token, ...) while just training it. That's also more in line with the role of the trainer, which is to train a Model, not to build one. (Fix BPE dropout not working as expected #201 & Fix Trainer trains the Model in-place #526)
This makes us wrap the Model in a RwLock on the Python's side. This is something that we'll have to do anyway very soon to provide the ability to customize the components properties in Python.

Missing before merge:

Node bindings
Update parts of the docs where we reload the Model after training
Update parts of the docs where Tokenizer.train is used

Narsil

I really like the inversion of control !
Mostly seems fine, but we should probably have a discussion between available syntax for training if we're changing it.

Options:

tokenizer.train(trainer, files) (current)
tokenizer.train(files, trainer=trainer) tokenizer.train(files) (proposed)
tokenizer.train(files, vocab_size=X, ...)
trainer.train(tokenizer, files)

I like 2. because it can drop completely the concept of trainer in the default case, but I think it makes the case were you need it a bit awkward, 3. Fixes that, but then knowing the exact list of options is going to be tricky (as it depends on tokenizer.model now)

Option 4. does not remove the concept of trainer and the inversion of flow makes it closer to the rust version. But it does feel as simple as it could be.

Right now, I'm leaning on 4. that would get called automatically by a syntax like 2. in the default case, and we would drop the non default case. So:

tokenizer.train(files) works, and it equivalent to trainer = tokenizer.default_trainer(); trainer.train(tokenizer, files).

Narsil · 2020-11-11T09:33:26Z

bindings/python/src/tokenizer.rs

    use tempfile::NamedTempFile;
    use tk::normalizers::{Lowercase, NFKC};

    #[test]
    fn serialize() {
-        let mut tokenizer = Tokenizer::new(PyModel::new(Arc::new(
+        let mut tokenizer = Tokenizer::new(PyModel::new(Arc::new(RwLock::new(


Could you explain why all those RwLock are need ? I'm not sure why we would need them.

Sure, we need them because anything in an Arc is immutable. These RwLock provides us with a way to actually mutate the Model here.

Narsil · 2020-11-11T09:37:16Z

bindings/python/py_src/tokenizers/implementations/bert_wordpiece.py

@@ -115,4 +115,4 @@ def train(
        )
        if isinstance(files, str):
            files = [files]
-        self._tokenizer.train(trainer, files)
+        self._tokenizer.train(files, trainer=trainer)


As long as we're breaking signature, I would argue we have a different signature like

train(files, options=bpe_train_options), or train(files, vocab_size=X, ....) what do you think ?

I like the second version better, the only trouble is the exact description of those options if going to get fuzzy pretty fast and error handling a bit hard. But it "feels" more pythonic, what do you think ?

Either that, or if we keep the trainer concept, we should stick to something closer to Rust, with trainer.train(tokenizer, files) I actually like that last version better at this moment, the control flow feels more natural.

Narsil · 2020-11-11T09:40:31Z

bindings/python/src/models.rs

    fn tokenize(&self, tokens: &str) -> tk::Result<Vec<Token>> {
-        self.model.tokenize(tokens)
+        self.model.read().unwrap().tokenize(tokens)


I'm always a bit scared by adding that much unwrap everywhere... Do you think there's a way we could avoid them ?

The unwrap here is for the std::sync::LockResult that is returned by the RwLock when you try to access its content. The Err case happens when a thread that was holding the lock panicked. So since this shouldn't happen, and we don't want to recover from it, I think the unwrap should be ok.

n1t0 · 2020-11-12T21:22:13Z

Sure! Happy to discuss the possible options.

With 2. tokenizer.train(files, trainer=trainer) & tokenizer.train(files), the expected usage is:

# Default training, for example when you just want to train a pre-trained tokenizer on your own dataset
tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Not available as we speak, but soon
tokenizer.train(files=[ ... ])

# Same case but with some custom parameters
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
trainer = tokenizer.model.get_trainer()
print(trainer) # inspect the current training params
trainer.vocab_size = 32000
trainer.special_tokens = [ "[UNK]", "[PAD]" ]
tokenizer.train(files=[ ... ], trainer=trainer)

# Or building your own
trainer = BpeTrainer(...)
tokenizer.train(files=[ ... ], trainer)

I think we should avoid 3. tokenizer.train(files, vocab_size=X, ...). It looks good at first sight, but it would be pretty impossible to document accurately.

I'm not entirely sure about what you have in mind with 4. trainer.train(tokenizer, files). Can you provide some examples of the various usage in Python (and maybe the expected signatures in Rust too)?

Jess0-0 · 2020-11-14T00:53:56Z

Hi,

I'm using the WordLevelTrainer in this branch. I'm able to import the package and initialize a trainer. However, when I tried to use it in:
trainer = WordLevelTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(trainer, ["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"])
it generates

TypeError: Can't convert <WordLevelTrainer object at 0x00000194EC6DBD90> to Sequence

error. Wondering if there is any way to fix this? Thanks!

n1t0 · 2020-11-14T04:23:19Z

In this PR we make the trainer optional when calling train, so you need to provide the files first:

trainer = WordLevelTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer)

Narsil · 2020-11-16T18:12:22Z

Sure! Happy to discuss the possible options.

With 2. tokenizer.train(files, trainer=trainer) & tokenizer.train(files), the expected usage is:

# Default training, for example when you just want to train a pre-trained tokenizer on your own dataset
tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Not available as we speak, but soon
tokenizer.train(files=[ ... ])

# Same case but with some custom parameters
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
trainer = tokenizer.model.get_trainer()
print(trainer) # inspect the current training params
trainer.vocab_size = 32000
trainer.special_tokens = [ "[UNK]", "[PAD]" ]
tokenizer.train(files=[ ... ], trainer=trainer)

I think this shows there's a problem in the naming of stuff. This example is using trainer as glorified options (or params). It's even mentionned in the comment. The user seem to be passing pure arguments, but the name trainer suggests that the object does something, namely it should train, which feels odd. That's the main reason why I like trainer.train(tokenizer, files) better, the trainer does train as he's supposed to.
The other way to think about it is, if we want to stick to 2. we should rename everything to something more passive like options or params. That choice should also be reflected in the Rust API.

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
train_options = tokenizer.model.get_train_options()
print(train_options) # it's actually a raw Python dict
train_options["vocab_size"] = 32000
train_options["special_tokens"] = [ "[UNK]", "[PAD]" ]
tokenizer.train(files=[ ... ], train_options=train_options)
# or 
tokenizer.train(files=[...], **train_options)

Small nit:

train_options = tokenizer.model.get_train_options()
tokenizer.train(..., train_options)

feels strange, why not

train_options = tokenizer.get_train_options() # No need to call the underlying model.
tokenizer.train(..., train_options)

The main issue I have with that API is for training a tokenizer from scratch:

tokenizer = Tokenizer(BPE())
train_options = tokenizer.get_train_options()

should not work as there's no way this code can know what vocabulary size I want. This code should raise an Exception IMO saying the user should define the vocab_size.
I'm unsure what's the best way to fix it.

Tokenizer(BPE(vocab_size=X)) seems the most reasonable. You should get an error while trying to encode which would be the case anyway (unk token is not defined).
tokenizer.get_train_options(vocab_size=XX) feels too odd (why it needs to be defined sometimes and not others and so on.)

BpeTrainer with Unigram model does not make sense, we should avoid at all cost enabling the mixup.
We also probably don't want to make choices in the user stead where it's not necessary (for instance vocab_size is a core argument, we should never choose one by default if it's not already available)

n1t0 · 2020-11-20T15:28:24Z

I think this shows there's a problem in the naming of stuff. This example is using trainer as glorified options (or params). It's even mentionned in the comment. The user seem to be passing pure arguments, but the name trainer suggests that the object does something, namely it should train, which feels odd. That's the main reason why I like trainer.train(tokenizer, files) better, the trainer does train as he's supposed to.
The other way to think about it is, if we want to stick to 2. we should rename everything to something more passive like options or params. That choice should also be reflected in the Rust API.

I think you might have the wrong idea about the way things currently work. The trainer actually trains, it trains a Model, and it is not some sort of glorified params. We didn't expose its methods yet, because we are still changing them and it is not the main path so there's no rush on this side, but we will. It could actually even be used to train multiple Model while only ingesting the data once later on. All of this wouldn't be possible with a dict.
Also, I don't think the Trainer should somehow have a dependency on the Tokenizer. It trains a Model given some input text, and it should stay that way. The fact that the Tokenizer does pre-process this text before feeding it to the Trainer is and should stay the responsibility of the Tokenizer.

BpeTrainer with Unigram model does not make sense, we should avoid at all cost enabling the mixup.

Agreed, and we don't allow it. It makes me realize that I treated this path as unreachable though, and will need to change it to raise instead.

This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.

n1t0 force-pushed the trainer-improvements branch from 73415a2 to a1f490b Compare November 10, 2020 17:13

This was referenced Nov 10, 2020

Adding ways to flow unk_token from trainer to new model. #518

Closed

[WIP] train from memory. #512

Closed

Narsil reviewed Nov 11, 2020

View reviewed changes

n1t0 mentioned this pull request Nov 13, 2020

Is there a WordLevel trainer to train a wordlevel tokenizer from scrach? #529

Closed

n1t0 mentioned this pull request Nov 16, 2020

WordPiece error: Missing [UNK] token from the vocabulary #531

Closed

n1t0 force-pushed the trainer-improvements branch 3 times, most recently from b818d10 to f1a9742 Compare November 20, 2020 01:16

n1t0 added 11 commits November 20, 2020 12:27

Rust - Trainer::process_tokens has a default impl

7a6423e

Add WordLevel trainer

5a6e91b

A Model can return its associated Trainer

7dd8999

Python - Make the trainer optional on Tokenizer.train

9bd12fd

Train Model in place

a4b6588

This let us keep everything that was set on the model except from the vocabulary when trained. For example, this let us keep the configured `unk_token` of BPE when its trained.

PyModel uses a RwLock to allow modifications

b401c3b

Test BPE keeping its options after training

3983bce

Generate pyi, fix tests and clippy warnings

e70a66c

Node - Trainers train the Model in-place

22dabea

Node - Add WordLevelTrainer

1c45b1c

Make sure TrainerWrapper can only train the right Model

2410452

n1t0 force-pushed the trainer-improvements branch from 7e47862 to 2410452 Compare November 20, 2020 17:27

n1t0 merged commit 2fbd677 into master Nov 20, 2020

n1t0 deleted the trainer-improvements branch November 20, 2020 18:30

Tomiinek mentioned this pull request Nov 21, 2020

Setting parameters of Trainer returned by .get_trainer() #540

Closed

n1t0 mentioned this pull request Jan 12, 2021

Python - Prepare for release 0.10.0 #595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer improvements #519

Trainer improvements #519

n1t0 commented Nov 10, 2020 •

edited

Loading

Narsil left a comment

Narsil Nov 11, 2020

n1t0 Nov 12, 2020

Narsil Nov 11, 2020

Narsil Nov 11, 2020

n1t0 Nov 12, 2020

n1t0 commented Nov 12, 2020

Jess0-0 commented Nov 14, 2020

n1t0 commented Nov 14, 2020

Narsil commented Nov 16, 2020

n1t0 commented Nov 20, 2020

Trainer improvements #519

Trainer improvements #519

Conversation

n1t0 commented Nov 10, 2020 • edited Loading

Narsil left a comment

Choose a reason for hiding this comment

Narsil Nov 11, 2020

Choose a reason for hiding this comment

n1t0 Nov 12, 2020

Choose a reason for hiding this comment

Narsil Nov 11, 2020

Choose a reason for hiding this comment

Narsil Nov 11, 2020

Choose a reason for hiding this comment

n1t0 Nov 12, 2020

Choose a reason for hiding this comment

n1t0 commented Nov 12, 2020

Jess0-0 commented Nov 14, 2020

n1t0 commented Nov 14, 2020

Narsil commented Nov 16, 2020

n1t0 commented Nov 20, 2020

n1t0 commented Nov 10, 2020 •

edited

Loading