New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF2 porting: Enable early stopping + model save and load #739
TF2 porting: Enable early stopping + model save and load #739
Conversation
From what I can see, the flip side of early stopping is saving model weights: In researching this aspect, I believe this is relevant to Ludwig because Ludwig's implementation is built on subclassing model and layers: Is this a correct interpretation? |
This looks good. Regarding your question, I believe the most relevant part is: https://www.tensorflow.org/guide/keras/save_and_serialize#apis_for_saving_weights_to_disk_and_loading_them_back What I'm imagining is also an improvement over the previous implementation in a couple ways:
Let me kno what you think about these 3 points. |
All three suggestions make sense. Do you want me to continue with the work to save model weights on this PR or open a new PR? |
Let me add another explanation of a piece of logic that maybe is not super obvious: progress. |
This reenables the functionality for saving weights during training when there is an improvement and at the end. This just enables the current TF1 behavior. I wanted to establish a working baseline and create some unit tests before working on the improvements. These are the new unit tests:
If this looks like a good starting point, I'll start working on these changes:
|
|
||
y_pred = np.load(os.path.join(exp_dir_name, 'y_predictions.npy')) | ||
|
||
mse = mean_squared_error(y_pred, generated_data.test_df['y']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we can do here is that after the first full experiment we load the numpy predictions, and after the second experiment with resume we load the numpy predictions and then we assert that they are the same with np.isclose(first_preds, second_preds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the recommended change in the test. Here is the commit 5a33a45
I think either restore may not be saving the weights or the resume is not loading the weights correctly. The last epoch on the first full_experiment
looks like this
Epoch 28
Training: 100%|██████████| 3/3 [00:00<00:00, 65.62it/s]
Evaluation train: 100%|██████████| 3/3 [00:00<00:00, 110.61it/s]
Evaluation vali : 100%|██████████| 1/1 [00:00<00:00, 117.43it/s]
Evaluation test : 100%|██████████| 1/1 [00:00<00:00, 116.25it/s]
Took 0.1004s
Took 0.1004s
╒═══════╤═════════╤═════════╤══════════════════════╤═══════════════════════╤════════╕
│ y │ loss │ error │ mean_squared_error │ mean_absolute_error │ r2 │
╞═══════╪═════════╪═════════╪══════════════════════╪═══════════════════════╪════════╡
│ train │ 20.6874 │ -3.7530 │ 20.6874 │ 3.8724 │ 0.9998 │
├───────┼─────────┼─────────┼──────────────────────┼───────────────────────┼────────┤
│ vali │ 24.5326 │ -4.2580 │ 24.5326 │ 4.2998 │ 0.9997 │
├───────┼─────────┼─────────┼──────────────────────┼───────────────────────┼────────┤
│ test │ 25.1545 │ -4.3416 │ 25.1545 │ 4.3740 │ 0.9997 │
╘═══════╧═════════╧═════════╧══════════════════════╧═══════════════════════╧════════╛
On the second full_experiment
with the resume, the first epoch report is epoch 28, which I think makes sense but the values don't look correct
Resuming training of model: /tmp/pytest-of-root/pytest-0/test_model_save_resume0/results/experiment_run/model
Resuming training of model: /tmp/pytest-of-root/pytest-0/test_model_save_resume0/results/experiment_run/model
Epoch 28
Epoch 28
Training: 100%|██████████| 3/3 [00:00<00:00, 40.29it/s]
Evaluation train: 100%|██████████| 3/3 [00:00<00:00, 110.75it/s]
Evaluation vali : 100%|██████████| 1/1 [00:00<00:00, 115.91it/s]
Evaluation test : 100%|██████████| 1/1 [00:00<00:00, 118.83it/s]
Took 0.1300s
Took 0.1300s
╒═══════╤══════════╤═════════════╤══════════════════════╤═══════════════════════╤═════════╕
│ y │ loss │ error │ mean_squared_error │ mean_absolute_error │ r2 │
╞═══════╪══════════╪═════════════╪══════════════════════╪═══════════════════════╪═════════╡
│ train │ 458.7538 │ 287591.8750 │ 460.9559 │ 287591.8438 │ -2.3807 │
├───────┼──────────┼─────────────┼──────────────────────┼───────────────────────┼─────────┤
│ vali │ 514.6791 │ 339206.7188 │ 514.6791 │ 339206.7188 │ -3.0783 │
├───────┼──────────┼─────────────┼──────────────────────┼───────────────────────┼─────────┤
│ test │ 495.4767 │ 311607.8438 │ 495.4767 │ 311607.8438 │ -3.2109 │
╘═══════╧══════════╧═════════════╧══════════════════════╧═══════════════════════╧═════════╛
Great work so far! It's also great that for loading and saving weights it is as simple as 1 function call on the ecd object :) |
re:
Actually, I was thinking this was too easy of a change. This means I must have missed something. :-) Thank you for suggestion on how to test model reloading. Once I confirm loading is working I'll start on the new functions. re:
Since saving weights only involves the
and when we want to keep "the model that performed the best on validation" do
Any thoughts on this approach? |
Sounds good to me. my only doubt is that we probably don't need to assign it to self because that ecd copy objectwill be needed only inside the train function and at the end of it it will be either returned or discarded (depending on the logic), so probably we can get away without assigning it to self. |
Here is the log file for this comment on the model save/resume unit test |
Got it, so it looks like it doesn't work.
|
Yeah, like I wrote earlier...this seemed too easy. :-) No conclusions yet but I have some observations to pass along. When I look in the debugger after loading the weights, it appears that the lists for weights are empty even the though I noticed this from the api documentation
Given that we have user defined classes, it sounds like "a first call" is needed. The following example may illustrate this observation. I create a custom Model class, define couple of Dense layers. After "running" one input tensor through the custom model, I save the weights and then reload the weights into a second instance of the custom model.
If this line
OTOH, if I comment out
Anyway, I'm going pursue this avenue to see to see how it could resolve the issue. |
I may be wrong, but from your example my guess is that the weights are initialized lazily the first time you execute a call(). But it's weird that you can save them and reload them before then... not really sure what's going on :) |
Still working on the model training resume function. In looking over TF2's docs, I found this discussion: https://www.tensorflow.org/guide/checkpoint. I think this is the kind of functionality we are looking for. I'm looking to see how I can adapt this to Ludwig. |
Checkpointing saves every k steps so acumulates the model at all steps, which is not really what we want (although n that guide they also talk about restoring). This one should be more relevant I guess: https://www.tensorflow.org/guide/keras/save_and_serialize |
re:
Actually, this was the first approach I tried. When
This is how Anyway, I'm still working on understanding how weight restoration operates. |
Where does that error come from? We could try to dive deeper and understand its origin. |
OK, I'll take another look at the error. If interested, here is the full error stack trace
|
So I don't really know what Or maybe tensorflow internally uses names of variables in a way we don't really know and requires you don't change the variables a call function receives as inputs (which is what the error message seems to suggest). In that case, probably the responsible is this: https://github.com/uber/ludwig/blob/tf2_porting/ludwig/features/base_feature.py#L170-L173
And refactor the remaining code to use |
At first blush this fits the error message. I'll make your recommended changes and test. |
Before turning in, I wanted to provide an update.
Still encountering the same error. Digging deeper, I noticed
When I pick this up tomorrow, I'll see if I can redesign this part of the code so that it's input parameter is not modified. |
I added the test_model_save_reload_API test which enables us to test in a fine grained way saving and loading of models. Some feature types are commented out, those are the feature types for which it's not currently working. let me know if the way it works is clear. It's not super polished (in particular directories, data creation etc can be improved) but it's a starting point. |
I pulled the new unit test and it works for me. I'm still working on the issues re: sequence model save/restore procedure. |
That test can help you with the sequence input and output feature, if you uncomment them from the beginning of the test you'll see that if they are present it doesn't pass (and the same for all other commented features). |
…sed kwargs and missing call to .numpy() on TF tesnor)
…umpy() on TF tesnor)
Updates on my side: using the newly added test I solved most of the problems with input and output features. The only input feature that is untested is the timeseries feature (because it still need to be ported) and the only two output features that are not covered yet are sequence and text. All other feature types for fine. Guess this is good complementary work with yours as you were focusing on the sequence output. I also discovered some minor bugs in pre and post processing on some features and solved them along the way, which is good :) |
I ported also the timeseries feature and added them to the test. They work fine. Now the only missing features are sequence and text output features. |
Great news on the other features. re: model weights save/restore...
At this point, I've exhausted all the possibilities. So I'm going to take your advice and submit an issue with the TFA project. The example will be a custom model and layers with
I should have the examples ready in the next day or two and will make the submission. |
Commit eedc838 is the fix to the error that occurs when making a prediction with the Generator Decoder after restoring weights. We will have the issue of validating the restored weights value. I just wanted to get this into the baseline for future work. |
Sounds good. In the meantime I will keep on working on the restore, trying to figure out the metrics / optimizer issue. |
OK by me. Though I'll point out that the PR started to re-enable the early stopping function. Then the save/load weights was layered on the PR. One name could be Or if you prefer, the rename is |
@w4nderlust This just an update. As we discussed, I'm creating a minimal example to open an issue in TFA re: saving and loading weights when the Generator decoder is used. Actually, I'm creating two minimal examples:
The reasons for this posting is that I've confirmed that restoring the optimizer is key to avoiding this situation:
Using the simple regression example, I can recreate the "sudden jump" in the training loss value. The "sudden jump" occurs if I use new optimizer when resuming training. However if I reuse the optimizer from the initial training, there is no "sudden jump". Here are log files demonstrating the two situations: Not reusing the optimizer...there is "sudden jump" in the training loss when resuming training
Reusing the optimizer from initial training...no sudden jump in training loss when resuming training
Here is the code if you're interested in seeing the effect of optimizer. See line 176. Comment or uncomment line as desired. |
Funny, I just psoted this: It shows how by saving and reloading a stateless optimizer, thigs work fine, while saving and reloading a stateful optimizer things break. My take is that we should separate these 2 aspects, the resume aspect (making sure that when we resume things keep on working) and the plain save and load aspect (we only care about weights and predictions to be the same, not if the training can resume correclty). For the second one, all features work except sequene and test output features, which is what's important to fix on your side I beleive. Anyway it's great that we both confirmed the same behavior, now we know exactly where are the errors (optimizer restoration and TFA loading). |
@w4nderlust Two topics:
The failure indicates, "expected keyword" during initialization of the Adam optimizer. the "unexpected keyword" is Let me know if this test works for you.
To eliminate that possibility, I tried recreating the issue with the standard Ludwig unit tests. This is when I encountered the Adam optimizer issue. While not all tests were passing in Looks like a recent change may have broken something. |
I dug deeper into the Adam initialization error. I submitted PR #749 to fix. |
After fixing the Adam initialization error, I tried the original Ludwig program where weights for the sequence model do not appear to be load correctly. the problem still exists in that version of the program. Assuming this may be related to TFA, I worked on a "minimal example" to illustrate the issue with plans to open an issue with TFA. Here is the minimal example. Right now I'm going back over both Ludwig's TF2 implementation and how I constructed the minimal example. I'm hoping by comparing the two, I find the difference that may help us with save/load of the sequence model. |
Regarding the Adam parameters, yer, those sanes were the names of the TF1 parameters, changing to the TF2 names, fixed it. Regarding the sequence feature, i think you are on the right track. If the custom example works, then there's something in our current implementation that is different from the example that makes it not work, we should identify what it is. |
Code Pull Requests
Re-introduced early stopping option. From what I can tell there is no existing unit test for early stopping; I added early such a test. This test used two values (3 and 5) for
early_stop
option and confirmed training stopped per the early stop specification.The
test_model_training_options.py
can be used as a foundation to other unit tests around model training options, e.g. save or not save various logs, etc.This PR enabled only the early stopping test. Nothing else was reenabled.
If this looks OK, I can start re-enabling other model training options.