Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 1093 loss value mismatch #1103

Merged

Conversation

jimthompson5802
Copy link
Collaborator

@jimthompson5802 jimthompson5802 commented Feb 21, 2021

Code Pull Requests

Fix Issue #1093

Before this PR when an model consists of single output feature, the reported loss for combined differs slightly from the loss reported for the feature itself. In the case of multiple output features, the sum of the individual loss for each feature differs slightly from the loss reported for combined. As best I can tell, this is just a reporting issue. No effect on model convergence.

At this point this PR address the issue for the numeric, binary and categorical output features. Additional work is needed to propagate the fix to other output feature types.

Before PR

Here is excerpt of the Ludwig log that illustrates the issue.
b4_fix_log.txt

Here is a specific example for the numeric output feature. Note differences in the reported loss for y1 and combined.

Epoch 2
Training: 100%|██████████| 22/22 [00:00<00:00, 262.69it/s]
Evaluation train: 100%|██████████| 22/22 [00:00<00:00, 721.01it/s]
Evaluation vali : 100%|██████████| 3/3 [00:00<00:00, 625.74it/s]
Evaluation test : 100%|██████████| 7/7 [00:00<00:00, 800.83it/s]
Took 0.1817s
╒═══════╤════════╤═════════╤══════════════════════╤═══════════════════════╤═════════╕
│ y1    │   loss │   error │   mean_squared_error │   mean_absolute_error │      r2 │
╞═══════╪════════╪═════════╪══════════════════════╪═══════════════════════╪═════════╡
│ train │ 0.5693 │  0.2523 │               0.5693 │                0.5761 │ -5.4792 │
├───────┼────────┼─────────┼──────────────────────┼───────────────────────┼─────────┤
│ vali  │ 0.6719 │  0.2941 │               0.6719 │                0.6306 │ -6.6828 │
├───────┼────────┼─────────┼──────────────────────┼───────────────────────┼─────────┤
│ test  │ 0.7436 │  0.3089 │               0.7436 │                0.6677 │ -8.4869 │
╘═══════╧════════╧═════════╧══════════════════════╧═══════════════════════╧═════════╛
╒════════════╤════════╕
│ combined   │   loss │
╞════════════╪════════╡
│ train      │ 0.6402 │
├────────────┼────────┤
│ vali       │ 0.6639 │
├────────────┼────────┤
│ test       │ 0.6642 │
╘════════════╧════════╛

For the multi output feature use case, note the sum of the individual losses for the output feature does not equal combined

Epoch 3
Training: 100%|██████████| 22/22 [00:00<00:00, 121.39it/s]
Evaluation train: 100%|██████████| 22/22 [00:00<00:00, 223.90it/s]
Evaluation vali : 100%|██████████| 3/3 [00:00<00:00, 556.86it/s]
Evaluation test : 100%|██████████| 7/7 [00:00<00:00, 483.78it/s]
Took 0.4030s
╒═══════╤════════╤═════════╤══════════════════════╤═══════════════════════╤═════════╕
│ y1    │   loss │   error │   mean_squared_error │   mean_absolute_error │      r2 │
╞═══════╪════════╪═════════╪══════════════════════╪═══════════════════════╪═════════╡
│ train │ 0.4639 │  0.2116 │               0.4639 │                0.5136 │ -4.2804 │
├───────┼────────┼─────────┼──────────────────────┼───────────────────────┼─────────┤
│ vali  │ 0.5617 │  0.2632 │               0.5617 │                0.5691 │ -5.4227 │
├───────┼────────┼─────────┼──────────────────────┼───────────────────────┼─────────┤
│ test  │ 0.6211 │  0.2713 │               0.6211 │                0.6050 │ -6.9327 │
╘═══════╧════════╧═════════╧══════════════════════╧═══════════════════════╧═════════╛
╒═══════╤════════╤════════════╕
│ y2    │   loss │   accuracy │
╞═══════╪════════╪════════════╡
│ train │ 0.7907 │     0.4682 │
├───────┼────────┼────────────┤
│ vali  │ 0.8694 │     0.4375 │
├───────┼────────┼────────────┤
│ test  │ 0.7454 │     0.5755 │
╘═══════╧════════╧════════════╛
╒═══════╤════════╤════════════╤═════════════╕
│ y3    │   loss │   accuracy │   hits_at_k │
╞═══════╪════════╪════════════╪═════════════╡
│ train │ 2.5125 │     0.0809 │      0.2688 │
├───────┼────────┼────────────┼─────────────┤
│ vali  │ 2.4445 │     0.0417 │      0.2917 │
├───────┼────────┼────────────┼─────────────┤
│ test  │ 2.4906 │     0.0566 │      0.2642 │
╘═══════╧════════╧════════════╧═════════════╛
╒════════════╤════════╕
│ combined   │   loss │
╞════════════╪════════╡
│ train      │ 3.7727 │
├────────────┼────────┤
│ vali       │ 3.9990 │
├────────────┼────────┤
│ test       │ 3.8001 │
╘════════════╧════════╛

After PR

with the PR here is the same output for the single numeric feature

Epoch 2
Training: 100%|██████████| 22/22 [00:00<00:00, 236.35it/s]
Evaluation train: 100%|██████████| 22/22 [00:00<00:00, 1067.87it/s]
Evaluation vali : 100%|██████████| 3/3 [00:00<00:00, 361.55it/s]
Evaluation test : 100%|██████████| 7/7 [00:00<00:00, 679.90it/s]
Took 0.1920s
╒═══════╤════════╤═════════╤══════════════════════╤═══════════════════════╤═════════╕
│ y1    │   loss │   error │   mean_squared_error │   mean_absolute_error │      r2 │
╞═══════╪════════╪═════════╪══════════════════════╪═══════════════════════╪═════════╡
│ train │ 0.6367 │  0.3242 │               0.6367 │                0.5967 │ -7.4361 │
├───────┼────────┼─────────┼──────────────────────┼───────────────────────┼─────────┤
│ vali  │ 0.5193 │  0.1818 │               0.5193 │                0.5644 │ -5.7765 │
├───────┼────────┼─────────┼──────────────────────┼───────────────────────┼─────────┤
│ test  │ 0.5360 │  0.1913 │               0.5360 │                0.5402 │ -6.6955 │
╘═══════╧════════╧═════════╧══════════════════════╧═══════════════════════╧═════════╛
╒════════════╤════════╕
│ combined   │   loss │
╞════════════╪════════╡
│ train      │ 0.6367 │
├────────────┼────────┤
│ vali       │ 0.5193 │
├────────────┼────────┤
│ test       │ 0.5360 │
╘════════════╧════════╛

And with the multiple output features the sum of the losses for the multiple output features now equal the combined:

Epoch 3
Training: 100%|██████████| 22/22 [00:00<00:00, 147.71it/s]
Evaluation train: 100%|██████████| 22/22 [00:00<00:00, 402.81it/s]
Evaluation vali : 100%|██████████| 3/3 [00:00<00:00, 444.30it/s]
Evaluation test : 100%|██████████| 7/7 [00:00<00:00, 497.02it/s]
Took 0.2890s
╒═══════╤════════╤═════════╤══════════════════════╤═══════════════════════╤═════════╕
│ y1    │   loss │   error │   mean_squared_error │   mean_absolute_error │      r2 │
╞═══════╪════════╪═════════╪══════════════════════╪═══════════════════════╪═════════╡
│ train │ 0.5158 │  0.2757 │               0.5158 │                0.5341 │ -5.8364 │
├───────┼────────┼─────────┼──────────────────────┼───────────────────────┼─────────┤
│ vali  │ 0.4241 │  0.1516 │               0.4241 │                0.5097 │ -4.5333 │
├───────┼────────┼─────────┼──────────────────────┼───────────────────────┼─────────┤
│ test  │ 0.4400 │  0.1593 │               0.4400 │                0.4871 │ -5.3179 │
╘═══════╧════════╧═════════╧══════════════════════╧═══════════════════════╧═════════╛
╒═══════╤════════╤════════════╕
│ y2    │   loss │   accuracy │
╞═══════╪════════╪════════════╡
│ train │ 0.7735 │     0.4682 │
├───────┼────────┼────────────┤
│ vali  │ 0.6530 │     0.6667 │
├───────┼────────┼────────────┤
│ test  │ 0.7751 │     0.4811 │
╘═══════╧════════╧════════════╛
╒═══════╤════════╤════════════╤═════════════╕
│ y3    │   loss │   accuracy │   hits_at_k │
╞═══════╪════════╪════════════╪═════════════╡
│ train │ 2.5176 │     0.0636 │      0.2514 │
├───────┼────────┼────────────┼─────────────┤
│ vali  │ 2.5683 │     0.1250 │      0.2500 │
├───────┼────────┼────────────┼─────────────┤
│ test  │ 2.5416 │     0.0660 │      0.2170 │
╘═══════╧════════╧════════════╧═════════════╛
╒════════════╤════════╕
│ combined   │   loss │
╞════════════╪════════╡
│ train      │ 3.8069 │
├────────────┼────────┤
│ vali       │ 3.6454 │
├────────────┼────────┤
│ test       │ 3.7568 │
╘════════════╧════════╛

Note: Due to rounding hand calculating the sum of individual losses may not match the displayed combined sum in the 4th decimal position.
Here is the full log extract
after_fix_log.txt

@jimthompson5802
Copy link
Collaborator Author

Commit 6a10c10 resolves issue I encountered with sampled_softmax_cross_entropy. I noticed in testing the output feature and combined loss values differ slightly. For this specific case, I think this is expected because of sampling is involved in calculating the loss. Loss calculations occur in two different locations: once during training and the second time when the metric is updated. Since loss computations occur in two different times, the sample used is most likely different.

@jimthompson5802
Copy link
Collaborator Author

Given the changes for the sequence feature, this PR also fixes Issue #1096

@jimthompson5802
Copy link
Collaborator Author

@w4nderlust This PR is finally ready for review. Following summarizes the key changes:

Fix mis-match in reported loss values:

  • Converted all output features eval_loss_function to be subclass of tf.keras.losses.Loss this is the primary change to solve the mis-match in reported loss values

Fix for sampled_softmax_cross_entropy for sequence-like features

  • Added custom classes BasicDecoder and BasicDEcoderOutput to support retrieval of projection input tensor required for sampled softmax calculation.
  • Modified decder_teacher_forcing() method to use above custom classes to make available the projection input tensor: PROJECTION_INPUT.
  • Created sequence specific sampled softmax class: SequenceSampledCrossEntropyLoss
  • Renamed several classes to be more explicit
  • Created custom FixedUnigramCandidateSampler class to support updating the values to support Laplace smoothing of sampled candidate tensors.
  • Updated sequence_sampled_softmax_cross_entropy() function to support how Ludwig passes tensors in TF2
  • create sequence specific function to sample sequence-like features: sampled_values_from_sequence
  • Updated OutputFeature.call() method to support passing PROJECTION_INPUT tensor required for sampled softmax calculation.`

@jimthompson5802
Copy link
Collaborator Author

jimthompson5802 commented Apr 25, 2021

Completed rework of categorical sampled softmax tensor passing.

@jimthompson5802
Copy link
Collaborator Author

@w4nderlust

  • resolved the last two comments. However, I took a different approach to one suggested for this comment. The comment thread explains the approach.
  • Tested out the num_reserved_ids=2 parameter and adjusting uniqgram parameter to use class_counts[2:] for the fixed_unigram sampler. Did not make a difference, still saw zero entries in true_expected_count tensor. So backed off use of num_reserved_ids and still making use of Laplace smoothing to avoid numerical issues with sampled softmax calculation of logits.
  • Separated learned_unigram from fixed_unigram sampler.
  • Updated comments in the Sequence decoders to reflect our discussion on how to name and document shape of key tensors.
  • Clean up of deprecated code in various modules.

Assuming the Github Actions run is clean, the PR is ready for the next round of review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants