Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer improvements #519

Merged
merged 11 commits into from
Nov 20, 2020
Merged

Trainer improvements #519

merged 11 commits into from
Nov 20, 2020

Conversation

n1t0
Copy link
Member

@n1t0 n1t0 commented Nov 10, 2020

  • Add a WordLevelTrainer (Fix Add the WordLevelTrainer #265)
  • Add a default impl for Trainer::process_tokens
  • Each Model can now return its associated trainer with get_trainer. This removes the need to provide a trainer in Python (could be optional on Rust too) (Fix Make the trainer optional when training #527)
  • The Model on the Tokenizer now gets trained in-place. This allows keeping its customization (dropout, unk_token, ...) while just training it. That's also more in line with the role of the trainer, which is to train a Model, not to build one. (Fix BPE dropout not working as expected #201 & Fix Trainer trains the Model in-place #526)
  • This makes us wrap the Model in a RwLock on the Python's side. This is something that we'll have to do anyway very soon to provide the ability to customize the components properties in Python.

Missing before merge:

  • Node bindings
  • Update parts of the docs where we reload the Model after training
  • Update parts of the docs where Tokenizer.train is used

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the inversion of control !
Mostly seems fine, but we should probably have a discussion between available syntax for training if we're changing it.

Options:

  1. tokenizer.train(trainer, files) (current)
  2. tokenizer.train(files, trainer=trainer) tokenizer.train(files) (proposed)
  3. tokenizer.train(files, vocab_size=X, ...)
  4. trainer.train(tokenizer, files)

I like 2. because it can drop completely the concept of trainer in the default case, but I think it makes the case were you need it a bit awkward, 3. Fixes that, but then knowing the exact list of options is going to be tricky (as it depends on tokenizer.model now)

Option 4. does not remove the concept of trainer and the inversion of flow makes it closer to the rust version. But it does feel as simple as it could be.

Right now, I'm leaning on 4. that would get called automatically by a syntax like 2. in the default case, and we would drop the non default case. So:

tokenizer.train(files) works, and it equivalent to trainer = tokenizer.default_trainer(); trainer.train(tokenizer, files).

use tempfile::NamedTempFile;
use tk::normalizers::{Lowercase, NFKC};

#[test]
fn serialize() {
let mut tokenizer = Tokenizer::new(PyModel::new(Arc::new(
let mut tokenizer = Tokenizer::new(PyModel::new(Arc::new(RwLock::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why all those RwLock are need ? I'm not sure why we would need them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we need them because anything in an Arc is immutable. These RwLock provides us with a way to actually mutate the Model here.

@@ -115,4 +115,4 @@ def train(
)
if isinstance(files, str):
files = [files]
self._tokenizer.train(trainer, files)
self._tokenizer.train(files, trainer=trainer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we're breaking signature, I would argue we have a different signature like

train(files, options=bpe_train_options), or train(files, vocab_size=X, ....) what do you think ?

I like the second version better, the only trouble is the exact description of those options if going to get fuzzy pretty fast and error handling a bit hard. But it "feels" more pythonic, what do you think ?

Either that, or if we keep the trainer concept, we should stick to something closer to Rust, with trainer.train(tokenizer, files) I actually like that last version better at this moment, the control flow feels more natural.

fn tokenize(&self, tokens: &str) -> tk::Result<Vec<Token>> {
self.model.tokenize(tokens)
self.model.read().unwrap().tokenize(tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a bit scared by adding that much unwrap everywhere... Do you think there's a way we could avoid them ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unwrap here is for the std::sync::LockResult that is returned by the RwLock when you try to access its content. The Err case happens when a thread that was holding the lock panicked. So since this shouldn't happen, and we don't want to recover from it, I think the unwrap should be ok.

@n1t0
Copy link
Member Author

n1t0 commented Nov 12, 2020

Sure! Happy to discuss the possible options.

With 2. tokenizer.train(files, trainer=trainer) & tokenizer.train(files), the expected usage is:

# Default training, for example when you just want to train a pre-trained tokenizer on your own dataset
tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Not available as we speak, but soon
tokenizer.train(files=[ ... ])

# Same case but with some custom parameters
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
trainer = tokenizer.model.get_trainer()
print(trainer) # inspect the current training params
trainer.vocab_size = 32000
trainer.special_tokens = [ "[UNK]", "[PAD]" ]
tokenizer.train(files=[ ... ], trainer=trainer)

# Or building your own
trainer = BpeTrainer(...)
tokenizer.train(files=[ ... ], trainer)

I think we should avoid 3. tokenizer.train(files, vocab_size=X, ...). It looks good at first sight, but it would be pretty impossible to document accurately.

I'm not entirely sure about what you have in mind with 4. trainer.train(tokenizer, files). Can you provide some examples of the various usage in Python (and maybe the expected signatures in Rust too)?

@Jess0-0
Copy link

Jess0-0 commented Nov 14, 2020

Hi,

I'm using the WordLevelTrainer in this branch. I'm able to import the package and initialize a trainer. However, when I tried to use it in:
trainer = WordLevelTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(trainer, ["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"])
it generates

TypeError: Can't convert <WordLevelTrainer object at 0x00000194EC6DBD90> to Sequence

error. Wondering if there is any way to fix this? Thanks!

@n1t0
Copy link
Member Author

n1t0 commented Nov 14, 2020

In this PR we make the trainer optional when calling train, so you need to provide the files first:

trainer = WordLevelTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer)

@Narsil
Copy link
Collaborator

Narsil commented Nov 16, 2020

Sure! Happy to discuss the possible options.

With 2. tokenizer.train(files, trainer=trainer) & tokenizer.train(files), the expected usage is:

# Default training, for example when you just want to train a pre-trained tokenizer on your own dataset
tokenizer = Tokenizer.from_pretrained("bert-base-uncased") # Not available as we speak, but soon
tokenizer.train(files=[ ... ])

# Same case but with some custom parameters
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
trainer = tokenizer.model.get_trainer()
print(trainer) # inspect the current training params
trainer.vocab_size = 32000
trainer.special_tokens = [ "[UNK]", "[PAD]" ]
tokenizer.train(files=[ ... ], trainer=trainer)

I think this shows there's a problem in the naming of stuff. This example is using trainer as glorified options (or params). It's even mentionned in the comment. The user seem to be passing pure arguments, but the name trainer suggests that the object does something, namely it should train, which feels odd. That's the main reason why I like trainer.train(tokenizer, files) better, the trainer does train as he's supposed to.
The other way to think about it is, if we want to stick to 2. we should rename everything to something more passive like options or params. That choice should also be reflected in the Rust API.

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
train_options = tokenizer.model.get_train_options()
print(train_options) # it's actually a raw Python dict
train_options["vocab_size"] = 32000
train_options["special_tokens"] = [ "[UNK]", "[PAD]" ]
tokenizer.train(files=[ ... ], train_options=train_options)
# or 
tokenizer.train(files=[...], **train_options)

Small nit:

train_options = tokenizer.model.get_train_options()
tokenizer.train(..., train_options)

feels strange, why not

train_options = tokenizer.get_train_options() # No need to call the underlying model.
tokenizer.train(..., train_options)

The main issue I have with that API is for training a tokenizer from scratch:

tokenizer = Tokenizer(BPE())
train_options = tokenizer.get_train_options()

should not work as there's no way this code can know what vocabulary size I want. This code should raise an Exception IMO saying the user should define the vocab_size.
I'm unsure what's the best way to fix it.

  • Tokenizer(BPE(vocab_size=X)) seems the most reasonable. You should get an error while trying to encode which would be the case anyway (unk token is not defined).
  • tokenizer.get_train_options(vocab_size=XX) feels too odd (why it needs to be defined sometimes and not others and so on.)

BpeTrainer with Unigram model does not make sense, we should avoid at all cost enabling the mixup.
We also probably don't want to make choices in the user stead where it's not necessary (for instance vocab_size is a core argument, we should never choose one by default if it's not already available)

@n1t0 n1t0 force-pushed the trainer-improvements branch 3 times, most recently from b818d10 to f1a9742 Compare November 20, 2020 01:16
@n1t0
Copy link
Member Author

n1t0 commented Nov 20, 2020

I think this shows there's a problem in the naming of stuff. This example is using trainer as glorified options (or params). It's even mentionned in the comment. The user seem to be passing pure arguments, but the name trainer suggests that the object does something, namely it should train, which feels odd. That's the main reason why I like trainer.train(tokenizer, files) better, the trainer does train as he's supposed to.
The other way to think about it is, if we want to stick to 2. we should rename everything to something more passive like options or params. That choice should also be reflected in the Rust API.

I think you might have the wrong idea about the way things currently work. The trainer actually trains, it trains a Model, and it is not some sort of glorified params. We didn't expose its methods yet, because we are still changing them and it is not the main path so there's no rush on this side, but we will. It could actually even be used to train multiple Model while only ingesting the data once later on. All of this wouldn't be possible with a dict.
Also, I don't think the Trainer should somehow have a dependency on the Tokenizer. It trains a Model given some input text, and it should stay that way. The fact that the Tokenizer does pre-process this text before feeding it to the Trainer is and should stay the responsibility of the Tokenizer.

BpeTrainer with Unigram model does not make sense, we should avoid at all cost enabling the mixup.

Agreed, and we don't allow it. It makes me realize that I treated this path as unreachable though, and will need to change it to raise instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants