Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-7831: Add new learning curve plotting function/method #5164

Merged
merged 25 commits into from
Mar 22, 2021

Conversation

tomasfryda
Copy link
Contributor

@tomasfryda tomasfryda commented Dec 1, 2020

https://h2oai.atlassian.net/browse/PUBDEV-7831

Need to be tested for multiple different model configurations.
For now, this should work at least on the models that we get from AutoML (it is tested mainly on those).

Questionable choices:

  • When provided with StackedEnsemble, draw learning curve for the metalearner and make sure to have the metalearner model id in the plot title to make this choice explicitly visible.

  • If it seems to be hard to make nice ribbon, don't make a ribbon and instead draw individual CV lines. This is because of "DeepLearning epochs doesn't seem to be saved in the same point so the sd estimation for ribbon fails".

Related fixes

  • CoxPH scoring history is missing last entry: Fix CoxPH scoring history #5186 (merged)
  • GLM with lambda search reports number of iterations of the last model, not the winning submodel (merged)

Example

Unknown-5

@tomasfryda tomasfryda self-assigned this Dec 1, 2020
@tomasfryda tomasfryda force-pushed the tomf_pubdev-7831_learning_curve branch 3 times, most recently from cf7e9ca to a61a4d2 Compare December 7, 2020 17:21
@tomasfryda tomasfryda marked this pull request as ready for review December 8, 2020 09:56
@wendycwong
Copy link
Contributor

@tomasfryda

The scoring history only has xval deviance with it and nothing else. However, if you want to checkout other values apart from deviance, I suggest that you add them to scoring history instead of having cv_scoring_history. I think it will be easier.

Also, the CV in GLM only calculated the deviance values. If you need other metrics calculated, you can add it there (in method cv_computeAndSetOptimalParameters(ModelBuilder[] cvModelBuilders)) as well. The CV is used in GLM to figure out the best alpha and lambda values to use and nothing else. Also, the number of iterations may not be the same as in the main run.

I will check what happens when you limit the max runtime.

Wendy

@wendycwong
Copy link
Contributor

Also, at the end of cross-validation, we will calculate the deviance on the hold-out set and pick the best deviance for the alpha/lambda value. However, after that, we will not bother to save the deviance values for the later iterations. This is where it happens (in method cv_computeAndSetOptimalParameters(ModelBuilder[] cvModelBuilders)):

  _parms._lambda = Arrays.copyOf(_parms._lambda,lmin_max+1);
  _xval_deviances = Arrays.copyOf(_xval_deviances, lmin_max+1);
  _xval_sd = Arrays.copyOf(_xval_sd, lmin_max+1);

where lmin_max is the submodel with the lowest test deviance.

@wendycwong
Copy link
Contributor

Okay, when you set a max_runtime_secs and nfold > 1, each cv model and main model will be allocated equal time to build their model like this (in maxRuntimeSecsPerModel(int cvModelsCount, int parallelization) of ModelBuilder.java):

_parms._max_runtime_secs / Math.ceil((double)cvModelsCount / parallelization + 1).

So, in theory, they should not run out of time. Also, for GLM, it will always run for one iteration before it quits so as not to return an empty model. This will implies that we may use more time than is allowed by max_runtime_secs. Hope this helps. Wendy

@tomasfryda
Copy link
Contributor Author

_parms._max_runtime_secs / Math.ceil((double)cvModelsCount / parallelization + 1).

So, in theory, they should not run out of time. Also, for GLM, it will always run for one iteration before it quits so as not to return an empty model. This will implies that we may use more time than is allowed by max_runtime_secs. Hope this helps. Wendy

Thank you @wendycwong. Sorry, I most likely misled you by concentrating on the time constraint.

I added a lot of logging and fortunately I was able to reproduce it after couple runs. It looks like the problem is that GLM selects one of the first alpha values tested but generateSummary is called after the model is finished and it uses the _state._iter as number_of_iterations, however, the _state._iter contains the number of iteration from the last tried alpha value (which in this case is not the alpha_best). My fix would be something like adding a line to hex.glm.GLMModel#generateSummary:

if (_parms._lambda_search) { //h2o-algos/src/main/java/hex/glm/GLMModel.java#L1534
    lambdaSearch = 1;
    iter = _output._submodels[_output._selected_submodel_idx].iteration;  // THIS IS THE ADDED LINE
    _output._model_summary.set(0, 3, "nlambda = " + _parms._nlambdas + ", lambda.max = " + MathUtils.roundToNDigits(_lambda_max, 4) + ", lambda.min = " + MathUtils.roundToNDigits(_output.lambda_best(), 4) + ", lambda.1se = " + MathUtils.roundToNDigits(_output.lambda_1se(), 4));
}

Note that I still don't understand well how GLM works inside, e.g., I don't know if the submodels are created during CV or just final model and if the iteration in a submodel corresponds to the iteration of the final model that is used for prediction so this fix might be incorrect. What do you think?


The scoring history only has xval deviance with it and nothing else. However, if you want to checkout other values apart from deviance, I suggest that you add them to scoring history instead of having cv_scoring_history. I think it will be easier.

I created the cv_scoring_history[] because it seemed to me easier than adding it to the scoring_history.

I don't have a strong opinion about this but when I look at glm@model$scoring_history I get the table below, which seems to me like a very wide table with a lot of NAs already.

And since I want to get the actual scoring history of the CV models I would probably have to either

  • add a new column that would indicate whether it is a CV model or final model and rbind the cv scoring_histories, or

  • add multiple columns such as cv_training_rmse so there is a clear distinction between the final model and CV models and since some models are scored based on some time interval (I think deep learning) this would add a lot of new rows that would have NAs for the cv model or the final one.

But I might be missing some better way to do it - if you still think your idea is better could you please describe it more concretely?

            timestamp   duration iterations negative_log_likelihood objective training_rmse training_logloss
1 2020-12-11 12:25:13  0.000 sec          0               689.51201   0.66236            NA               NA
2 2020-12-11 12:25:13  0.001 sec          1               252.03686   0.36284            NA               NA
3 2020-12-11 12:25:13  0.002 sec          2               222.59586   0.35667            NA               NA
4 2020-12-11 12:25:13  0.002 sec          3               220.00584   0.35661            NA               NA
5 2020-12-11 12:25:13  0.003 sec          4               219.97507   0.35661       0.21102          0.21131
  training_r2 training_auc training_pr_auc training_lift training_classification_error validation_rmse
1          NA           NA              NA            NA                            NA              NA
2          NA           NA              NA            NA                            NA              NA
3          NA           NA              NA            NA                            NA              NA
4          NA           NA              NA            NA                            NA              NA
5     0.81033           NA              NA       2.65561                       0.02498         0.20668
  validation_logloss validation_r2 validation_auc validation_pr_auc validation_lift
1                 NA            NA             NA                NA              NA
2                 NA            NA             NA                NA              NA
3                 NA            NA             NA                NA              NA
4                 NA            NA             NA                NA              NA
5            0.20681       0.82245        0.99494           0.99149         2.48148
  validation_classification_error
1                              NA
2                              NA
3                              NA
4                              NA
5                         0.02239

Thank you again for helping me with the GLM.

@tomasfryda tomasfryda force-pushed the tomf_pubdev-7831_learning_curve branch from 859a430 to 6ec0c9d Compare December 14, 2020 13:41
@tomasfryda
Copy link
Contributor Author

@sebhrusen @ledell I went through the algos and created a mapping between stopping criteria (specified in documentation) and columns present in scoring_history. I couldn't find some stopping criteria, namely mean_per_class_error, MSE, RMSLE. I added some "metadata" after / to note when the criterium is present in the scoring history.

[{
    "mean_per_class_error": {},
    "MSE": {},
    "RMSLE": {},
    "anomaly_score": {
      "IsolationForest": ["mean_anomaly_score"]
    },
    "custom": {
      "GBM":["training_custom", "validation_custom"],
      "DRF":["training_custom", "validation_custom"]
    },
    "custom_increasing": {
      "GBM":["training_custom", "validation_custom"],
      "DRF":["training_custom", "validation_custom"]
    },
    "deviance": {
      "GLM/lambda_search": ["deviance_train", "deviance_test", "deviance_xval", "deviance_se"],
      "DRF/regression": ["training_deviance", "validation_deviance"],
      "GBM/regression": ["training_deviance", "validation_deviance"],
      "DeepLearning/regression": ["training_deviance", "validation_deviance"],
      "XGBoost/regression": ["training_deviance", "validation_deviance"],
    },
    "logloss/binomial,multinomial": {
      "DeepLearning": ["training_logloss", "validation_logloss"],
      "DRF": ["training_logloss", "validation_logloss"],
      "GBM": ["training_logloss", "validation_logloss"],
      "XGBoost": ["training_logloss", "validation_logloss"]
    },
    "RMSE/binomial,multinomial,regression": {
      "DeepLearning": ["training_rmse", "validation_rmse"],
      "DRF": ["training_rmse", "validation_rmse"],
      "GBM": ["training_rmse", "validation_rmse"],
      "XGBoost": ["training_rmse", "validation_rmse"]
    },
    "MAE/regression": {
      "DRF": ["training_mae", "validation_mae"],
      "GBM": ["training_mae", "validation_mae"],
      "DeepLearning": ["training_mae", "validation_mae"],
      "XGBoost": ["training_mae", "validation_mae"]
    },
    "AUC/binomial, + opt-in in multinomial": {
      "DeepLearning": ["training_auc", "validation_auc"],
      "DRF": ["training_auc", "validation_auc"],
      "GBM": ["training_auc", "validation_auc"],
      "XGBoost": ["training_auc", "validation_auc"]
    },
    "AUCPR/binomial, + opt-in in multinomial": {
      "DeepLearning": ["training_pr_auc", "validation_pr_auc"],
      "DRF": ["training_pr_auc", "validation_pr_auc"],
      "GBM": ["training_pr_auc", "validation_pr_auc"],
      "XGBoost": ["training_pr_auc", "validation_pr_auc"]
    },
    "lift_top_group/binomial": {
      "DeepLearning": ["training_lift", "validation_lift"],
      "DRF": ["training_lift", "validation_lift"],
      "GBM": ["training_lift", "validation_lift"],
      "XGBoost": ["training_lift", "validation_lift"]
    },
    "misclassification/binomial,multinomial": {
      "DeepLearning": ["training_classification_error", "validation_classification_error"],
      "DRF": ["training_classification_error", "validation_classification_error"],
      "GBM": ["training_classification_error", "validation_classification_error"],
      "XGBoost": ["training_classification_error", "validation_classification_error"]
    }
  },
  "Not a stopping criterium(based on docs) but present in scoring history": {
  "DeepLearning" : ["training_r2", "validation_r2"],
  "CoxPH": ["loglik"],
  "IsolationForest": ["mean_tree_path_length"],
  "GLM": ["objective", "convergence", "negative_log_likelihood", "sum(etai-eta0)^2"],
  "PCA": ["objective"]
  }]

Should I use only the names from stopping criteria (misclassification) or should I allow user specify both (misclassification or classification_error)?

@tomasfryda tomasfryda force-pushed the tomf_pubdev-7831_learning_curve branch from 6ec0c9d to f99a194 Compare December 15, 2020 10:03
@tomasfryda
Copy link
Contributor Author

tomasfryda commented Dec 16, 2020

FYI I created a separate PR for the GLM fix: #5191 (merged now)

@tomasfryda tomasfryda force-pushed the tomf_pubdev-7831_learning_curve branch from a24bfd6 to 19626a8 Compare January 13, 2021 18:10
Copy link
Contributor

@ledell ledell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this on a call, but I am summarizing the requested changes here:

  1. New plot label names:
  • Training
  • Training (CV Models)
  • Validation
  • Cross-validation
  1. Support lower and upper versions of names (e.g. tolower() in R), with the official version being lower case. We should consider whether to make all our metric names default to lower case... something to think about for another day.
> h2o.learning_curve_plot(gbm, metric = "AUC")
Error in match.arg(metric) : 
  'arg' should be one of “AUTO”, “auc”, “aucpr”, “mae”, “rmse”, “anomaly_score”, “convergence”, “custom”, “custom_increasing”, “deviance”, “lift_top_group”, “logloss”, “misclassification”, “negative_log_likelihood”, “objective”, “sumetaieta02”
  1. Remove the suffix of the AutoML model IDs (for plotting purposes) so that it's easier to read.

  2. Change cutoff line color to red or green.

  3. Add newline after "Selected" plot label for more space (let's see if this looks better or worse).

  4. Notes about GLM differences:

  • deviance (lambda search), iteration
  • objective (no lambda search), iterations

The GLM scoring history missing some metrics in GLM, and also in the case of no lambda search, we don't have the cross-validation metrics either. Can we add all of this? Let's discuss in #dev-h2o-3 to see if we can fill in the missing pieces in GLM so we can have a more unified plotting experience (especially since we are using GLM metalearner as the curve we plot for Stacked Ensembles).

@tomasfryda
Copy link
Contributor Author

I made the modification (1-5) and here are the results:

Python
python

R
XRT

Comment on lines +2392 to +2402
#' Create learning curve plot for an H2O Model. Learning curves show error metric dependence on
#' learning progress, e.g., RMSE vs number of trees trained so far in GBM. There can be up to 4 curves
#' showing Training, Validation, Training on CV Models, and Cross-validation error.
#'
#' @param model an H2O model
#' @param metric Metric to be used for the learning curve plot. These should mostly correspond with stopping metric.
#' @param cv_ribbon if True, plot the CV mean as a and CV standard deviation as a ribbon around the mean,
#' if NULL, it will attempt to automatically determine if this is suitable visualisation
#' @param cv_lines if True, plot scoring history for individual CV models, if NULL, it will attempt to
#' automatically determine if this is suitable visualisation
#'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New doc string here. Python version is basically the same except for NULL -> None

@tomasfryda tomasfryda force-pushed the tomf_pubdev-7831_learning_curve branch 2 times, most recently from 2bfbbc6 to 43ffcf2 Compare January 25, 2021 18:41
h2o-core/src/main/java/hex/ModelBuilder.java Show resolved Hide resolved
h2o-py/h2o/model/model_base.py Outdated Show resolved Hide resolved
h2o-py/h2o/explanation/_explain.py Outdated Show resolved Hide resolved
@tomasfryda tomasfryda force-pushed the tomf_pubdev-7831_learning_curve branch from 30fd5cb to 996acbe Compare March 19, 2021 12:54
@tomasfryda tomasfryda changed the base branch from rel-zermelo to master March 19, 2021 12:54
Comment on lines +212 to +213
parms._generate_scoring_history = true;
parms._score_iteration_interval = (parms._valid == null) ? 5 : -1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoring_iteration_interval might change depending on benchmark results.

When _valid is specified, we use lambda search which provides plenty of information for the learning curve. Otherwise, lambda search is off and the information in scoring history is often just one or two points entries. Even metalearner trained on Airlines dataset subset with 250k rows has less than 10 iterations so even this score_iteration_iterval might be too big.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With benchmark results from @wendycwong we decided to keep the value to 5 as it has less than 2% performance impact but it improves the learning curve in some situations significantly. This affects only the AUTO metalearner in SE.

Screen Shot 2021-03-22 at 20 11 48

@tomasfryda tomasfryda changed the base branch from master to rel-zipf March 22, 2021 19:08
copiedScoringHistory.set(rowIndex, colIndex,sh.get(rowIndex, colIndex));
}
}
mainModel._output._cv_scoring_history[i] = copiedScoringHistory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does cloning the object not work for this use case? it seems like you are just copying the table

if model.algo == "stackedensemble":
model = model.metalearner()

if model.algo not in ("stackedensemble", "glm", "gam", "glrm", "deeplearning",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to generalize it, instead of having this enumerated add "an interface" that the algo would implement

this would let us get rid of the big if that determines allowed_metrics and allowed_timesteps

@michalkurka michalkurka merged commit 8173e3f into rel-zipf Mar 22, 2021
flaviusburca pushed a commit to mware-solutions/h2o-3 that referenced this pull request Apr 21, 2021
)

* Add prototype implementation of learning curve in R

* Update R version

* Fix NPE when no scoring history available

* Add initial python version

* Unify colors between R and Python

* Add error for models without scoring history in python

* Fix alpha selection in GAM/GLM

* Use glm_model_summary as model_summary in GAMs

* Add tests and fix bugs

* Fix python cv_ribbon default override

* Change default colors and improve R legend

* Fix logic error in R

* Add coxPH and rename cv_individual_lines to cv_lines

* Add CoxPH and IsolationForest

* Map stopping metric to metric in scoring history

* Adjust docstring

* Add examples to docstings and fix R cran check

* Fix legend in matplotlib2

* Incorporate suggestions from MLI meeting

* Add more docstrings and make logloss as default metric for multiple scenarios

* Copy TwoDimTable as in GAM instead of clone

See hex.gam.GAMModel#copyTwoDimTable for the GAM's implementation.

* Remove matplotlib import at the top of the _explain.py file

* Move GAM specific modification of ModelBase to h2o-bindings/bin/custom/python/gen_gam.py

* Assign default implementation to learning curve plot that complains about missing matplotlib

* Adapt to the new features from Wendy's PR

(cherry picked from commit 8173e3f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants