-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework TF trainer #6038
Rework TF trainer #6038
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6038 +/- ##
==========================================
- Coverage 78.80% 77.92% -0.89%
==========================================
Files 146 146
Lines 26325 26332 +7
==========================================
- Hits 20746 20519 -227
- Misses 5579 5813 +234
Continue to review full report at Codecov.
|
Yes, you now have to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and it generally looks cleaner, thanks!
I just have two nits and one comment to go with #6015
src/transformers/trainer_tf.py
Outdated
if num_examples < 0: | ||
raise ValueError("The training dataset must have an asserted cardinality") | ||
|
||
if self.args.dataloader_drop_last: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I like simple tests like those on one line:
approx = math.floor if self.args.dataloader_drop_last else math.ceil
src/transformers/trainer_tf.py
Outdated
name="training", step=self.global_step, profiler_outdir=self.args.logging_dir | ||
) | ||
self.train_loss = tf.keras.metrics.Sum() | ||
distributed_training_steps = self._create_training_routines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe_distributed_training_steps
?
It's not always distributed training.
src/transformers/trainer_tf.py
Outdated
def _accumulate_next_gradients(self, ds): | ||
"""Accumulates the gradients from the next element in dataset.""" | ||
iterator = iter(ds) | ||
def training_step(features, labels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part would need to be more easily accessible for subclassing. I don't think it's the case if it's nested in another method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me this function should not be accessible for subclassing and more precisely should not be changed. Globally, that's why important functions are nested, to do not be updated accidentally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You then remove the ability for any user to add any customization to the way the loss is computed (like adding a penalty to the loss), or should it be done in run_model
?
I haven't used TF a lot, but I thought overriding the train_step
function was the way to go in Keras, so it would be more natural to users to have this name as a customization point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loss computation is done directly inside the model such as PT ones, you cannot customize its computation. If you want to modify the result of the loss computation, you have to redefine the run_model
function and update it accordingly to what you want.
Here we are not using the train_step
method that has been integrated since TF 2.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes loss computation is done directly in the model. I'm talking of the times the user does not want that exactly and wants to customize this. I know we are not using the train_step
method, just saying a user that discovers our API might look for something with the same name as a customization point instead of run_model
.
Also the PyTorch side has a training_step
and a prediction_step
as points of customization, it would be great if they had the same name for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the consistency can also be achieved by moving some parts of the PT Trainer API to look more like TF, so there is no obligation to change things here if it does not make any sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I think up to now the goal was to have the two trainers be very similar, in that the API would be mostly the same. Is that still the case? I understand that some methods must diverge, though.
As an aside, there is this issue (#5765) as well, it would be great if you could check it out and batch it here with your changes!
src/transformers/trainer_tf.py
Outdated
@@ -21,6 +23,15 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
if (*map(int, tf.__version__.split(".")),) < (2, 2, 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've added the packaging
dependency a few months ago for that exact purpose. I don't think this check would work with alpha/beta/release candidate versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't know that package! I will check how to use it for that purpose :)
src/transformers/trainer_tf.py
Outdated
def get_optimizers( | ||
self, num_training_steps: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not returning the optimizers with this method but assigning the optimizers to self
, then the TF and PT trainers diverge, don't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in that case, wouldn't create_optimizers
be more appropriate? I'm expecting a get_optimizers
method to return the optimizers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with create_optimizers
, at first I was a bit perplex on returning a class field while it can be just updated.
src/transformers/trainer_tf.py
Outdated
optimizer, lr_scheduler = self.get_optimizers(num_training_steps=t_total) | ||
iterations = optimizer.iterations | ||
self.get_optimizers(num_training_steps=t_total) | ||
iterations = self.optimizers[0].iterations | ||
self.global_step = iterations.numpy() | ||
folder = os.path.join(self.args.output_dir, PREFIX_CHECKPOINT_DIR) | ||
ckpt = tf.train.Checkpoint(optimizer=optimizer, model=self.model) | ||
ckpt = tf.train.Checkpoint(optimizer=self.optimizers[0], model=self.model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit harder to read, as I don't immediately understand what is self.optimizers[0]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you prefer? create a self.optimizer
(singular)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To go with this comment and the previous one, having the method be named create_optimizers
and update an optimizer
and a lr_scheduler
attribute is fine by me. I can update the PT side to do the same and it shouldn't break anything.
I can rework it to make it more similar to the PT one, but I don't think they must be strictly the same has they don't really have the same way to work... In any case I can rethink it :) About the issue I don't get what it means so I asked to elaborate more. |
@jplu With my custom code I'm meeting the following error :
Not sure if it's from my code or this PR though... |
Humm this is weird... I never got this error myself. How did you come to this? in order to try to replicate it. |
@LysandreJik @sgugger Did I properly address all your comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for all the adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I still think it would be good to keep the two trainers as close as possible, both with regards to their API and functionalities.
if parse(tf.__version__).release < (2, 2, 0): | ||
logger.info( | ||
"You need to run the TensorFlow trainer with at least the version 2.2.0, your version is {}".format( | ||
tf.__version__ | ||
) | ||
) | ||
sys.exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to update this still! Please disregard, this is great.
commit 54f9fbe Author: Julien Plu <plu.julien@gmail.com> Date: Wed Jul 29 20:32:01 2020 +0200 Rework TF trainer (huggingface#6038) * Fully rework training/prediction loops * fix method name * Fix variable name * Fix property name * Fix scope * Fix method name * Fix tuple index * Fix tuple index * Fix indentation * Fix variable name * fix eval before log * Add drop remainder for test dataset * Fix step number + fix logging datetime * fix eval loss value * use global step instead of step + fix logging at step 0 * Fix logging datetime * Fix global_step usage * Fix breaking loop + logging datetime * Fix step in prediction loop * Fix step breaking * Fix train/test loops * Force TF at least 2.2 for the trainer * Use assert_cardinality to facilitate the dataset size computation * Log steps per epoch * Make tfds compliant with TPU * Make tfds compliant with TPU * Use TF dataset enumerate instead of the Python one * revert previous commit * Fix data_dir * Apply style * rebase on master * Address Sylvain's comments * Address Sylvain's and Lysandre comments * Trigger CI * Remove unused import commit 3f94170 Author: Lysandre Debut <lysandre@huggingface.co> Date: Wed Jul 29 14:26:26 2020 -0400 [WIP] Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleC… (huggingface#5614) * Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleChoice} models and tests * AutoModels Tiny tweaks * Style * Final changes before merge * Re-order for simpler review * Final fixes * Addressing @sgugger's comments * Test MultipleChoice commit 8a8ae27 Author: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Date: Wed Jul 29 12:28:12 2020 -0400 Use google style to document properties (huggingface#6130) * Use google style to document properties * Update src/transformers/configuration_utils.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> commit fc64559 Author: Julien Plu <plu.julien@gmail.com> Date: Wed Jul 29 18:20:00 2020 +0200 Fix TF CTRL model naming (huggingface#6134) commit 641b873 Author: Lysandre Debut <lysandre@huggingface.co> Date: Wed Jul 29 11:38:15 2020 -0400 XLNet PLM Readme (huggingface#6121) commit 8d157c9 Author: Timo Moeller <timo.moeller@deepset.ai> Date: Wed Jul 29 17:34:16 2020 +0200 add deepset/xlm-roberta-large-squad2 model card (huggingface#6128) * Add xlm-r QA model card * Add tags commit 6c00285 Author: Funtowicz Morgan <mfuntowicz@users.noreply.github.com> Date: Wed Jul 29 13:21:29 2020 +0200 Added capability to quantize a model while exporting through ONNX. (huggingface#6089) * Added capability to quantize a model while exporting through ONNX. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> We do not support multiple extensions Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Reformat files Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * More quality Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Ensure test_generate_identified_name compares the same object types Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added documentation everywhere on ONNX exporter Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use pathlib.Path instead of plain-old string Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use f-string everywhere Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use the correct parameters for black formatting Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use Python 3 super() style. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use packaging.version to ensure installed onnxruntime version match requirements Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fixing imports sorting order. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Missing raise(s) Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added quantization documentation Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix some spelling. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix bad list header format Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> commit 25de74c Author: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Date: Wed Jul 29 05:20:53 2020 -0400 Use FutureWarning to deprecate (huggingface#6111) commit 640550f Author: Funtowicz Morgan <mfuntowicz@users.noreply.github.com> Date: Wed Jul 29 11:02:35 2020 +0200 ONNX documentation (huggingface#5992) * Move torchscript and add ONNX documentation under modle_export Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Let's follow guidelines by the gurus: Renamed torchscript.rst to serialization.rst Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * Remove previously introduced tree element Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * WIP doc Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com> * ONNX documentation Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix invalid link Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Improve spelling Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Final wording pass Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> commit 92f8ce2 Author: Sam Shleifer <sshleifer@gmail.com> Date: Tue Jul 28 18:30:16 2020 -0400 Fix deebert tests (huggingface#6102) commit c49cd92 Author: Sam Shleifer <sshleifer@gmail.com> Date: Tue Jul 28 18:29:35 2020 -0400 [Fix] position_ids tests again (huggingface#6100) commit 40796c5 Author: Sam Shleifer <sshleifer@gmail.com> Date: Tue Jul 28 18:29:18 2020 -0400 [fix] add bart to LM_MAPPING (huggingface#6099) commit 5abe503 Author: Sam Shleifer <sshleifer@gmail.com> Date: Tue Jul 28 18:27:58 2020 -0400 Fix huggingface#6096: MBartTokenizer's mask token (huggingface#6098) commit b1c8b76 Author: Joe Davison <josephddavison@gmail.com> Date: Tue Jul 28 14:46:03 2020 -0400 Fix zero-shot pipeline single seq output shape (huggingface#6104) commit 06834bc Author: Lysandre Debut <lysandre@huggingface.co> Date: Tue Jul 28 12:44:25 2020 -0400 Logs should not be hidden behind a logger.info (huggingface#6097) commit dafa296 Author: Sam Shleifer <sshleifer@gmail.com> Date: Tue Jul 28 11:24:23 2020 -0400 [s2s] Delete useless method, log tokens_per_batch (huggingface#6081) commit dc4755c Author: Tanmay Thakur <thakurtanmay72@yahoo.com> Date: Tue Jul 28 19:30:23 2020 +0530 create model-card for lordtt13/emo-mobilebert (huggingface#6030) commit 28931f8 Author: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Date: Tue Jul 28 09:48:39 2020 -0400 Fix huggingface#6092 (huggingface#6093) * Fix huggingface#6092 * Format commit 5e97c82 Author: Manuel Romero <mrm8488@gmail.com> Date: Tue Jul 28 15:36:00 2020 +0200 Create README.md (huggingface#6076) commit 54f49af Author: Clement <clementdelangue@gmail.Com> Date: Tue Jul 28 09:14:00 2020 -0400 Add inference widget examples (huggingface#5825) commit 0206efb Author: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Date: Tue Jul 28 09:08:20 2020 -0400 Make all data collators accept dict (huggingface#6065) * Make all data collators accept dict * Style commit 31a5486 Author: Sam Shleifer <sshleifer@gmail.com> Date: Tue Jul 28 08:41:27 2020 -0400 github issue template suggests who to tag (huggingface#5790) Co-authored-by: Julien Chaumond <chaumond@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Teven <teven.lescao@gmail.com> commit f0c7008 Author: Stas Bekman <stas00@users.noreply.github.com> Date: Tue Jul 28 05:34:58 2020 -0700 link to README.md (huggingface#6068) * add a link to README.md * Update README.md commit 4f814fd Author: Pavel Soriano <pavel.soriano@data.gouv.fr> Date: Tue Jul 28 14:33:52 2020 +0200 [Model Card] camembert-base-squadFR-fquad-piaf (huggingface#6087) commit 3c7fbf3 Author: Sam Shleifer <sshleifer@gmail.com> Date: Tue Jul 28 08:18:11 2020 -0400 MBART: support summarization tasks where max_src_len > max_tgt_len (huggingface#6003) * MBART: support summarization tasks * fix test * Style * add tokenizer test commit 842eb45 Author: Tanmay Thakur <thakurtanmay72@yahoo.com> Date: Tue Jul 28 13:55:12 2020 +0530 New Community NB Add (huggingface#5824) Signed-off-by: lordtt13 <thakurtanmay72@yahoo.com> commit 018d61f Author: Andrés Felipe Cruz <bones.felipe@gmail.com> Date: Tue Jul 28 01:19:17 2020 -0700 Moving transformers package import statements to relative imports in some files (huggingface#5796) * Moving rom transformers statements to relative imports in some files under src/ * Import order Co-authored-by: Lysandre Debut <lysandre@huggingface.co> commit 7214954 Author: Lysandre Debut <lysandre@huggingface.co> Date: Tue Jul 28 03:14:31 2020 -0400 Should return a tuple for serialization (huggingface#6061) commit 7a68d40 Author: Sam Shleifer <sshleifer@gmail.com> Date: Mon Jul 27 20:07:21 2020 -0400 [s2s] Don't mention packed data in README (huggingface#6079) commit b7345d2 Author: Sam Shleifer <sshleifer@gmail.com> Date: Mon Jul 27 20:00:44 2020 -0400 [fix] no warning for position_ids buffer (huggingface#6063) commit 1e00ef6 Author: Sam Shleifer <sshleifer@gmail.com> Date: Mon Jul 27 18:26:00 2020 -0400 [s2s] dont document packing because it hurts performance (huggingface#6077) commit 9d0d3a6 Author: sgugger <sylvain.gugger@gmail.com> Date: Mon Jul 27 18:03:09 2020 -0400 Pin TF while we wait for a fix commit 769e6ba Author: Ramsri Goutham Golla <ramsrigouthamg@gmail.com> Date: Tue Jul 28 01:55:37 2020 +0530 Create README.md (huggingface#6032) Adding model card - readme
* Switch from return_tuple to return_dict * Fix test * [WIP] Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleC… (#5614) * Test TF Flaubert + Add {XLM, Flaubert}{TokenClassification, MultipleChoice} models and tests * AutoModels Tiny tweaks * Style * Final changes before merge * Re-order for simpler review * Final fixes * Addressing @sgugger's comments * Test MultipleChoice * Rework TF trainer (#6038) * Fully rework training/prediction loops * fix method name * Fix variable name * Fix property name * Fix scope * Fix method name * Fix tuple index * Fix tuple index * Fix indentation * Fix variable name * fix eval before log * Add drop remainder for test dataset * Fix step number + fix logging datetime * fix eval loss value * use global step instead of step + fix logging at step 0 * Fix logging datetime * Fix global_step usage * Fix breaking loop + logging datetime * Fix step in prediction loop * Fix step breaking * Fix train/test loops * Force TF at least 2.2 for the trainer * Use assert_cardinality to facilitate the dataset size computation * Log steps per epoch * Make tfds compliant with TPU * Make tfds compliant with TPU * Use TF dataset enumerate instead of the Python one * revert previous commit * Fix data_dir * Apply style * rebase on master * Address Sylvain's comments * Address Sylvain's and Lysandre comments * Trigger CI * Remove unused import * Switch from return_tuple to return_dict * Fix test * Add recent model Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Julien Plu <plu.julien@gmail.com>
This PR brings several improvements to the TensorFlow Trainer:
@colanim Can you please test this PR with TPU to see if it looks ok.