Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[finetune_trainer] enhancements and fixes #9042

Merged
merged 18 commits into from
Dec 15, 2020
Merged

[finetune_trainer] enhancements and fixes #9042

merged 18 commits into from
Dec 15, 2020

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Dec 10, 2020

The main need was to add speed metrics to perform speed performance regressions. But on the way a bunch of other things got worked on. Hopefully you will find the proposed changes useful.

This PR change trainer

  • adds an optional metric_key_prefix for evaluate and predict functions to return metrics with a prefix key set by the user rather than the default eval_.

This PR change finetune_trainer

  • utils: sort json keys when dumping to filesystem
  • renames s/eval/val/ for the validation dataset results
  • adds speed metrics for all: train/eval/test (samples_per_second/runtime/n_objs)
  • refactors logging/saving code for each mode
  • renames internal vars to tell which is metrics and which is output that is more than just metrics
  • fixes a bug where all_results.json wasn't getting saved in the right place
  • rounds up loss values to 4 decimals - before it was "eval_loss": 368.2950744628906, - not sure if it's better done upstream in the trainer?

Here is a sample of all_results.json after this change:

{
    "epoch": 1.0,
    "test_bleu": 22.8548,
    "test_gen_len": 35.9,
    "test_loss": 734.8612,
    "test_n_ojbs": 10,
    "test_runtime": 2.5185,
    "test_samples_per_second": 3.971,
    "train_n_ojbs": 200,
    "train_runtime": 24.9101,
    "train_samples_per_second": 8.029
    "val_bleu": 26.581,
    "val_gen_len": 31.4,
    "val_loss": 738.3438,
    "val_n_ojbs": 200,
    "val_runtime": 33.9329,
    "val_samples_per_second": 5.894,
}

@sgugger, @patil-suraj, @patrickvonplaten

@sgugger
Copy link
Collaborator

sgugger commented Dec 10, 2020

Unfortunately the naming has been done a long time ago and even if it's not ideal, we can't break it like this as people rely on the names of the keys in their code. I would advocate for the renaming to be done in the script directly and not inside Trainer.

If there is really a lot of people asking for it, we can think of a strategy to rename those keys progressively with some kind of deprecation warning, but since it's merely cosmetic, I would leave that for scripts using Trainer.

@stas00
Copy link
Contributor Author

stas00 commented Dec 10, 2020

I see what you mean that someone relying on "eval_loss" when doing predict would have their code broken. Yes, we can't do that.
I moved this fix back into the finetune trainer as it was originally.

Could we set a a target for when we could do breaking changes and fix this bug?

I also find it strange that we use --n_val but eval_

And then predict vs test_.

The callbacks are inconsistent too :(

I'd plan a design session where we first collect all the rough edges and inputs on what needs to be polished and then adjust the trainer so that it's not limping for the rest of its life. Can this be done?

@stas00 stas00 changed the title [trainer and finetune_trainer] enhancements and fixes [finetune_trainer] enhancements and fixes Dec 10, 2020
@sgugger
Copy link
Collaborator

sgugger commented Dec 10, 2020

Could we set a a target for when we could do breaking changes and fix this bug?

Like I said, unless there is strong demand for it, I think we're just going to leave it as it. It's not the ideal naming choice but we have to deal with it now (kind of like PretrainedConfig vs PreTrainedConfig).

I also find it strange that we use --n_val but eval_

And then predict vs test_.

I don't understand that part. Also predict could be used for test or evaluation, so predict does not mean test.

The callbacks are inconsistent too

Could you elaborate? If it's the evaluate vs predict you mentioned, there is a reason. prediction_step is called both in predict and evaluate whereas the on_evaluate is only called at evaluate.

@stas00
Copy link
Contributor Author

stas00 commented Dec 10, 2020

Could we set a a target for when we could do breaking changes and fix this bug?

Like I said, unless there is strong demand for it, I think we're just going to leave it as it. It's not the ideal naming choice but we have to deal with it now (kind of like PretrainedConfig vs PreTrainedConfig).

I'm not sure how this is similar. I call trainer.predict() and get in return eval_ metrics - this is very confusing.

I also find it strange that we use --n_val but eval_
And then predict vs test_.

I don't understand that part. Also predict could be used for test or evaluation, so predict does not mean test.

I suppose from the perspective of the existing trainer like finetune they are the same. But surely this is much less of an issue than val vs eval.

The callbacks are inconsistent too

Could you elaborate? If it's the evaluate vs predict you mentioned, there is a reason. prediction_step is called both in predict and evaluate whereas the on_evaluate is only called at evaluate.

Ah, I see, thank you for clarifying that - then why is there no on_predict to match on_evaluate? I assumed it was the former.

@sgugger
Copy link
Collaborator

sgugger commented Dec 10, 2020

There is no on_predict event because the training loop never calls Trainer.predict. It does however call Trainer.evaluate. I guess we could add the on_predict event that would be called at the end of a Trainer.predict method.

But surely this is much less of an issue than val vs eval.

Could you please clarifying that part? I'm not sure what you mean by this.

I'm not sure how this is similar. I call trainer.predict() and get in return eval_ metrics - this is very confusing.

If we go down that road, trainer.predict should only return predictions and not even the metrics (which we won't do either as it's a bigger breaking change but it would definitely make sense to me). Predict and evaluate do not mean test vs evaluation, it's really a matter of getting the predictions of the model vs evaluating the metrics on a given dataset (which could be train/eval/test).

I can get behind adding a prefix argument to those method that defaults to None and will be used to prefix the metrics. If one is passed, it's used (so it's easier to get the test_ prefix you want and does not require ugly post-processing) otherwise eval_ is used to avoid any breaking changes. Would that work for you?

@stas00
Copy link
Contributor Author

stas00 commented Dec 11, 2020

But surely this is much less of an issue than val vs eval.

Could you please clarifying that part? I'm not sure what you mean by this.

Of course, we have --n_val (mnemonic validation), but then we return eval_(foo|bar) as the metrics for "validation". But see below.

So now that you have further expanded on eval+predict not being correlated to validation+testing (thank you!), I think I'm not approaching the problem in the right way.

Really, then there is no bug with both predict and evaluate returning metrics with eval_-prefixed keys and the bug is really in the end use in finetune_reader.py. Here is what I'm thinking:

  1. It shouldn't be eval_bleu and test_bleu, it should be val_bleu and test_bleu - because these are both evaluation report on 2 different splits so --n_val dataset should lead to val_bleu metrics, and --n_test to test_bleu (not sure of valid or val - probably val to match --n_val)
  2. Ideally that whole eval_ prefix should be removed altogether, since it just has a potential at being confused with val as in validation, and there are no other metrics in that context - the trainer code forcefully adds eval_ to all metrics - but as we said it's not possible to do w/o a breaking change, and it's not really a problem anyway. these are just evaluation metrics - no problem here.
  3. What the interface could use then is getting a split argument which it could prepend to the metrics keys, so if someone is doing evaluation on the validation dataset the metrics will be returned could start with val_eval_.

So if my logic makes sense practically we can either:

  1. leave trainer alone and recode finetune_reader.py to prefix eval_ with the split name - so it'll be val_eval_bleu and test_eval_bleu
  2. add an optional trainer argument split for evaluate and predict and have the trainer arrange the split name prefixed in the metrics as in the option above.

Probably the 1st one is simpler, since it gives the user full flexibility.

The same should happen to the results file naming - we need to choose whether those are (val|test )_results.jsonor (eval|predict)_results.json - and not the currently confusing pair eval_results.json, but test_results.json.

@sgugger
Copy link
Collaborator

sgugger commented Dec 11, 2020

If you're happy with val_eval_bleu and test_eval_bleu, it's fine by me. I'd rather name split prefix in solution 2 unless I understand badly what you mean by it. It's also fine by me and could be a feature other users find useful (if they don't want eval_xxx as names).

@stas00
Copy link
Contributor Author

stas00 commented Dec 11, 2020

If you're happy with val_eval_bleu and test_eval_bleu, it's fine by me. I'd rather name split prefix in solution 2 unless I understand badly what you mean by it. It's also fine by me and could be a feature other users find useful (if they don't want eval_xxx as names).

OK, 3 follow up questions:

  1. I suggested split since it's typically either train|val|test, but prefix works just as well. Except it's unclear then in the function API - prefix to what? metrics_key_prefix?

  2. So we are discussing to optionally prefix eval_bleu, etc. with something and not replace eval_, yes? So the end result is f"{prefix}_eval_bleu", etc.

  3. If so, should the prefix include the separator _ (test_) or just be (test) and trainer will "_".join([prefix, key])? I suppose the latter

What do you think @patrickvonplaten + @patil-suraj? I think @sgugger's priority is the trainer itself, but what do you think about what would be ideal for seq2seq examples domain?

@sgugger
Copy link
Collaborator

sgugger commented Dec 11, 2020

For 1, yes metric_key_prefix sounds better. For 2, I was thinking of replacing the eval_ actually, which goes with 3, the prefix should not have the _.

@stas00
Copy link
Contributor Author

stas00 commented Dec 11, 2020

@sgugger, please have a look - do we want None as the default and use eval in the code or eval in the function signature - I suppose the latter, right? I'm a bit confused here with the optional, having non-None default and keeping the API unbroken. Help?

@stas00
Copy link
Contributor Author

stas00 commented Dec 11, 2020

And so then the only remaining thing that I'm stuck with - do we call the results of evaluate in finetune trainer val or eval? Since we call it test for predict - so confusing. Are those results run on dataset splits and then should be val and test or results on functionality they check and then they should be eval and predict - but predict doesn't work, since the results are evaluation results.

I think they should be val and test since both sets are evaluation results on 2 different splits.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks a lot! I have a few comments but nothing really important.

examples/seq2seq/finetune_trainer.py Outdated Show resolved Hide resolved
examples/seq2seq/finetune_trainer.py Outdated Show resolved Hide resolved
examples/seq2seq/finetune_trainer.py Outdated Show resolved Hide resolved
examples/seq2seq/finetune_trainer.py Outdated Show resolved Hide resolved
examples/seq2seq/finetune_trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to get clearer, let me know if I managed or if I should just call it a day for today ;-)

examples/seq2seq/finetune_trainer.py Show resolved Hide resolved
examples/seq2seq/finetune_trainer.py Outdated Show resolved Hide resolved
@stas00
Copy link
Contributor Author

stas00 commented Dec 11, 2020

As the comments issue is unrelated to this PR - how about I just let you edit those comments as you think would be the best, @sgugger. Anything you choose works for me. Thank you.

@stas00
Copy link
Contributor Author

stas00 commented Dec 14, 2020

It doesn't look like others are going to review this PR. I didn't want to force anybody by asking to review, just tagging. Is it better to ask for a review explicitly?

@sgugger, please let me know if you still want to adjust unrelated to this PR comments or should I merge it and you will deal with it later.

Thank you!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits

stas00 and others added 3 commits December 14, 2020 11:58
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
@stas00
Copy link
Contributor Author

stas00 commented Dec 14, 2020

Left some nits

Thank you, @patrickvonplaten!

I went on and removed the optional word in the docs section as well to match the function signature. You haven't suggested I do that, so just want to make sure I did the right thing.

src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@stas00
Copy link
Contributor Author

stas00 commented Dec 15, 2020

I went on and removed the optional word in the docs section as well to match the function signature. You haven't suggested I do that, so just want to make sure I did the right thing.

So that was wrong - thank you for fixing that, @sgugger

  • So we are removing Optional from the function signature because Optional == Union[..., None] and we have no None here
  • but we are documenting that the argument is optional to the user

@stas00 stas00 merged commit c19d046 into master Dec 15, 2020
@stas00 stas00 deleted the speed branch December 15, 2020 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants