Add checkpoints to BERT training #184

chenmoneygithub · 2022-05-16T23:30:22Z

Add checkpoints so that the intermediate training status is not lost.
Fix the problem that LinearDecayWithWarmup does not implement get_config.

mattdangerw · 2022-05-17T01:24:51Z

examples/bert/run_pretraining.py

    )
+    optimizer = keras.optimizers.Adam(learning_rate=learning_rate_schedule)
+
+    if FLAGS.restore_from_checkpoint is not None:


I think we need to make this work so it restores after a failure in a automatic fashion.

If the checkpoint save path is set, we should save a checkpoint after each epoch (is this currently saving best checkpoint or one for each epoch?).

If the script is terminated and re-run with the same arguments, we automatically pick up where we left off. And then we should maybe add a skip_restore flag to avoid this behavior, and always start from scratch, that default to false.

+1, it is very important that training should be fully autonomous: there will be failures, and we do NOT want to have to rerun the script with different manually-specified arguments every time there is a failure. The restart should be automated, and should resume from the latest saved state (both model wise and data pipeline wise) and that state should be retrievable without modifying any command line argument.

Updated this PR, please take another look, thanks!

The file loading logic looks a little fragile to me. Have you took a look at these:

https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/BackupAndRestore

https://www.tensorflow.org/api_docs/python/tf/train/CheckpointManager (lower level)

We should make sure we are modeling the best practices for doing this with the example we have here. Seems like there should be a better flow than this one here?

mattdangerw · 2022-05-23T21:43:08Z

Running just the testing code snippet we have in the README, I get the following, is this expected?

WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
W0523 14:41:11.842688 140239717047296 util.py:200] Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.iter
W0523 14:41:11.842785 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer.iter
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.beta_1
W0523 14:41:11.842808 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer.beta_1
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.beta_2
W0523 14:41:11.842825 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer.beta_2
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.decay
W0523 14:41:11.842840 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer.decay
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer.learning_rate
W0523 14:41:11.842854 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer.learning_rate
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'm' for (root)._pooler_layer.kernel
W0523 14:41:11.842869 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'm' for (root)._pooler_layer.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'm' for (root)._pooler_layer.bias
W0523 14:41:11.842884 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'm' for (root)._pooler_layer.bias
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'm' for (root)._logit_layer.kernel
W0523 14:41:11.842897 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'm' for (root)._logit_layer.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'm' for (root)._logit_layer.bias
W0523 14:41:11.842910 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'm' for (root)._logit_layer.bias
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root)._pooler_layer.kernel
W0523 14:41:11.842923 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root)._pooler_layer.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root)._pooler_layer.bias
W0523 14:41:11.842936 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root)._pooler_layer.bias
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root)._logit_layer.kernel
W0523 14:41:11.842949 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root)._logit_layer.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root)._logit_layer.bias
W0523 14:41:11.842962 140239717047296 util.py:209] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root)._logit_layer.bias

mattdangerw · 2022-05-23T21:46:38Z

examples/bert/run_pretraining.py

+
+    callbacks = []
+    if FLAGS.checkpoint_save_path is not None:
+        checkpoint_path = FLAGS.checkpoint_save_path + "/checkpoint_{epoch:2d}"


Have you tested this out? Just looking through the backup and restore code this dir, it looks like this directory get's joined with other directory names and passed directly to tf.train.CheckpointManager, which I don't think supports this {epoch:2d} format you are using?

Make sure you actually ls the directory you are saving too when testing this out.

Yea, this was a legacy code. this style works well with the model checkpoint callback, but not the backupandrestore. I will update the code.

mattdangerw · 2022-05-23T22:08:35Z

examples/bert/run_pretraining.py

    )
+    optimizer = keras.optimizers.Adam(learning_rate=learning_rate_schedule)
+
+    if FLAGS.skip_restore:


Maybe we want skip_restore to just clear the directory without a warning?

Otherwise I don't totally get this. If you point checkpoint_save_path to an empty directory, doesn't this flag do nothing?

Either

directory is empty, flag does nothing

directory is not empty, flag causes an error

What is the use case?

If we clear the directory if skip_restore I understand the use case better--you can re-run the script with the same training args, but make sure you restart from scratch each time.

sg, I am clearing out the directory if skip_restored is set to True

mattdangerw · 2022-05-23T22:11:26Z

examples/bert/run_pretraining.py

+)
+
+flags.DEFINE_string(
+    "checkpoint_save_path",


checkpoint_directory? to give hint this should be a directory not a filepath?

Yea, I updated the code to checkpoint_save_dir

Use the whole word directory to agree with arg naming in create_sentence_split_data

mattdangerw · 2022-05-23T22:37:08Z

examples/bert/run_pretraining.py

    )
+    optimizer = keras.optimizers.Adam(learning_rate=learning_rate_schedule)
+
+    if FLAGS.skip_restore and len(os.listdir(FLAGS.checkpoint_save_dir)) > 0:


This hits some weird undefined behavior if skip_restore=True and checkpoint_save_dir=None. Maybe something like

if FLAGS.skip_restore and FLAGS.checkpoint_save_dir is not None: if os.path.exists(FLAGS.checkpoint_save_dir): if os.path.isdir(FLAGS.checkpoint_save_dir): rmdir... else: raise error should be directory

Yes, we should check if it is a dir before deleting it, but we probably don't want to error out when it's not a directory? If it's not a directory they can still write to the path. My concern is that we are exposing a strange logic - if you want to skip restoring, you have to make sure that checkpoint_save_directory either points to nothing or a directory.

Updated the code with directory check, and we can discuss on if error is needed here.

I'm not sure the use case you are trying to protect? There is never a way a user can write to a file.

In the version I suggested, if checkpoint dir is a file, you get an error right away. In the version you pushed, you just get the error after the first epoch from deeper withing tensorflow tensorflow.python.framework.errors_impl.FailedPreconditionError: /home/matt/bert_test_output/myfile.ckpt is not a directory.

Thinking about it more, it is probably just be nicer to give that error regardless.

if FLAGS.checkpoint_save_dir is not None and os.path.exists(FLAGS.checkpoint_save_dir): if not os.path.isdir(FLAGS.checkpoint_save_dir): raise error should be directory elif FLAGS.skip_restore: rmdir

I made a mistake - I thought it is okay to have a directory and file having the same name under the same directory, but apparently it is not allowed.

Code updated.

mattdangerw

lgtm thanks!

chenmoneygithub force-pushed the fix-training branch from 561928d to 08fa570 Compare May 16, 2022 23:32

chenmoneygithub requested a review from mattdangerw May 16, 2022 23:33

mattdangerw requested changes May 17, 2022

View reviewed changes

chenmoneygithub force-pushed the fix-training branch 2 times, most recently from 7bc9424 to e348fd8 Compare May 20, 2022 01:22

mattdangerw requested changes May 23, 2022

View reviewed changes

mattdangerw reviewed May 23, 2022

View reviewed changes

mattdangerw approved these changes May 24, 2022

View reviewed changes

Add checkpoints to BERT training

f744f5a

chenmoneygithub force-pushed the fix-training branch from a27cd57 to f744f5a Compare May 24, 2022 22:47

rebase

d80b836

chenmoneygithub merged commit 1ca9ae1 into keras-team:master May 24, 2022

chenmoneygithub deleted the fix-training branch November 30, 2022 21:12

Add checkpoints to BERT training #184

Add checkpoints to BERT training #184

Uh oh!

Conversation

chenmoneygithub commented May 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented May 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattdangerw May 23, 2022 •

edited

Loading

mattdangerw May 23, 2022 •

edited

Loading

mattdangerw May 24, 2022 •

edited

Loading