Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-7831: Add new learning curve plotting function/method #5164

Merged
merged 25 commits into from
Mar 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
5f2f507
Add prototype implementation of learning curve in R
tomasfryda Nov 30, 2020
d4af1d8
Update R version
tomasfryda Dec 1, 2020
398497b
Fix NPE when no scoring history available
tomasfryda Dec 2, 2020
e8ce498
Add initial python version
tomasfryda Dec 2, 2020
5451e61
Unify colors between R and Python
tomasfryda Dec 2, 2020
c76f80c
Add error for models without scoring history in python
tomasfryda Dec 4, 2020
06b5fca
Fix alpha selection in GAM/GLM
tomasfryda Dec 4, 2020
c79b90c
Use glm_model_summary as model_summary in GAMs
tomasfryda Dec 4, 2020
c1d532a
Add tests and fix bugs
tomasfryda Dec 7, 2020
08ed93a
Fix python cv_ribbon default override
tomasfryda Dec 8, 2020
c44f710
Change default colors and improve R legend
tomasfryda Dec 10, 2020
694ab40
Fix logic error in R
tomasfryda Dec 11, 2020
28e4fe7
Add coxPH and rename cv_individual_lines to cv_lines
tomasfryda Dec 14, 2020
019230d
Add CoxPH and IsolationForest
tomasfryda Dec 14, 2020
9cbd15a
Map stopping metric to metric in scoring history
tomasfryda Dec 16, 2020
adcfcfe
Adjust docstring
tomasfryda Dec 16, 2020
059b00a
Add examples to docstings and fix R cran check
tomasfryda Dec 17, 2020
e1ebe85
Fix legend in matplotlib2
tomasfryda Jan 13, 2021
19e8c0d
Incorporate suggestions from MLI meeting
tomasfryda Jan 14, 2021
dae0c4d
Add more docstrings and make logloss as default metric for multiple s…
tomasfryda Jan 15, 2021
37aee53
Copy TwoDimTable as in GAM instead of clone
tomasfryda Jan 25, 2021
c94895d
Remove matplotlib import at the top of the _explain.py file
tomasfryda Jan 26, 2021
634e104
Move GAM specific modification of ModelBase to h2o-bindings/bin/custo…
tomasfryda Jan 26, 2021
2b0bd96
Assign default implementation to learning curve plot that complains a…
tomasfryda Mar 19, 2021
996acbe
Adapt to the new features from Wendy's PR
tomasfryda Mar 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions h2o-algos/src/main/java/hex/ensemble/Metalearners.java
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,9 @@ protected void setCustomParams(GLMParameters parms) {
//add GLM custom params
super.setCustomParams(parms);

parms._generate_scoring_history = true;
parms._score_iteration_interval = (parms._valid == null) ? 5 : -1;
Comment on lines +212 to +213
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoring_iteration_interval might change depending on benchmark results.

When _valid is specified, we use lambda search which provides plenty of information for the learning curve. Otherwise, lambda search is off and the information in scoring history is often just one or two points entries. Even metalearner trained on Airlines dataset subset with 250k rows has less than 10 iterations so even this score_iteration_iterval might be too big.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With benchmark results from @wendycwong we decided to keep the value to 5 as it has less than 2% performance impact but it improves the learning curve in some situations significantly. This affects only the AUTO metalearner in SE.

Screen Shot 2021-03-22 at 20 11 48


//specific to AUTO mode
parms._non_negative = true;
//parms._alpha = new double[] {0.0, 0.25, 0.5, 0.75, 1.0};
Expand Down
10 changes: 9 additions & 1 deletion h2o-bindings/bin/custom/R/gen_gam.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,15 @@
if(!missing(missing_values_handling))
parms$missing_values_handling <- missing_values_handling
""",

module="""
.h2o.fill_gam <- function(model, parameters, allparams) {
if (is.null(model$scoring_history))
model$scoring_history <- model$glm_scoring_history
if (is.null(model$model_summary))
model$model_summary <- model$glm_model_summary
return(model)
}
"""
)

doc = dict(
Expand Down
25 changes: 25 additions & 0 deletions h2o-bindings/bin/custom/python/gen_gam.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,31 @@ def Lambda(self):
def Lambda(self, value):
self._parms["lambda"] = value

def _additional_used_columns(self, parms):
"""
:return: Gam columns if specified.
"""
return parms["gam_columns"]

def summary(self):
"""Print a detailed summary of the model."""
model = self._model_json["output"]
if "glm_model_summary" in model and model["glm_model_summary"] is not None:
return model["glm_model_summary"]
print("No model summary for this model")

def scoring_history(self):
"""
Retrieve Model Score History.

:returns: The score history as an H2OTwoDimTable or a Pandas DataFrame.
"""
model = self._model_json["output"]
if "glm_scoring_history" in model and model["glm_scoring_history"] is not None:
return model["glm_scoring_history"].as_data_frame()
print("No score history for this model")


extensions = dict(
__imports__="""
import h2o
Expand Down
1 change: 1 addition & 0 deletions h2o-core/src/main/java/hex/Model.java
Original file line number Diff line number Diff line change
Expand Up @@ -928,6 +928,7 @@ public String[] features() {
* User-facing model scoring history - 2D table with modeling accuracy as a function of time/trees/epochs/iterations, etc.
*/
public TwoDimTable _scoring_history;
public TwoDimTable[] _cv_scoring_history;

public double[] _distribution;
public double[] _modelClassDist;
Expand Down
26 changes: 26 additions & 0 deletions h2o-core/src/main/java/hex/ModelBuilder.java
Original file line number Diff line number Diff line change
Expand Up @@ -936,6 +936,32 @@ public void cv_mainModelScores(int N, ModelMetrics.MetricBuilder mbs[], ModelBui
Log.info(mainModel._output._cross_validation_metrics.toString());
mainModel._output._cross_validation_metrics_summary = makeCrossValidationSummaryTable(cvModKeys);

// Put cross-validation scoring history to the main model
if (mainModel._output._scoring_history != null) { // check if scoring history is supported (e.g., NaiveBayes doesn't)
mainModel._output._cv_scoring_history = new TwoDimTable[cvModKeys.length];
for (int i = 0; i < cvModKeys.length; i++) {
TwoDimTable sh = cvModKeys[i].get()._output._scoring_history;
String[] rowHeaders = sh.getRowHeaders();
String[] colTypes = sh.getColTypes();
int tableSize = rowHeaders.length;
int colSize = colTypes.length;
TwoDimTable copiedScoringHistory = new TwoDimTable(
sh.getTableHeader(),
sh.getTableDescription(),
sh.getRowHeaders(),
sh.getColHeaders(),
sh.getColTypes(),
sh.getColFormats(),
sh.getColHeaderForRowHeaders());
for (int rowIndex = 0; rowIndex < tableSize; rowIndex++) {
for (int colIndex = 0; colIndex < colSize; colIndex++) {
copiedScoringHistory.set(rowIndex, colIndex,sh.get(rowIndex, colIndex));
}
}
mainModel._output._cv_scoring_history[i] = copiedScoringHistory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does cloning the object not work for this use case? it seems like you are just copying the table

}
}
michalkurka marked this conversation as resolved.
Show resolved Hide resolved

if (!_parms._keep_cross_validation_models) {
int count = Model.deleteAll(cvModKeys);
Log.info(count+" CV models were removed");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@ public class ModelOutputSchemaV3<O extends Model.Output, S extends ModelOutputSc
@API(help="Scoring history", direction=API.Direction.OUTPUT, level=API.Level.secondary)
public TwoDimTableV3 scoring_history;

@API(help="Cross-Validation scoring history", direction=API.Direction.OUTPUT, level=API.Level.secondary)
public TwoDimTableV3 cv_scoring_history[];

@API(help="Model reproducibility information", direction=API.Direction.OUTPUT, level=API.Level.secondary)
public TwoDimTableV3[] reproducibility_information_table;

Expand Down
24 changes: 24 additions & 0 deletions h2o-py/h2o/estimators/gam.py
Original file line number Diff line number Diff line change
Expand Up @@ -1078,3 +1078,27 @@ def Lambda(self):
@Lambda.setter
def Lambda(self, value):
self._parms["lambda"] = value

def _additional_used_columns(self, parms):
"""
:return: Gam columns if specified.
"""
return parms["gam_columns"]

def summary(self):
"""Print a detailed summary of the model."""
model = self._model_json["output"]
if "glm_model_summary" in model and model["glm_model_summary"] is not None:
return model["glm_model_summary"]
print("No model summary for this model")

def scoring_history(self):
"""
Retrieve Model Score History.

:returns: The score history as an H2OTwoDimTable or a Pandas DataFrame.
"""
model = self._model_json["output"]
if "glm_scoring_history" in model and model["glm_scoring_history"] is not None:
return model["glm_scoring_history"].as_data_frame()
print("No score history for this model")
4 changes: 3 additions & 1 deletion h2o-py/h2o/explanation/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ def _register_dummy_methods():
h2o.model.ModelBase.explain_row = _complain_about_matplotlib
h2o.model.ModelBase.pd_plot = _complain_about_matplotlib
h2o.model.ModelBase.ice_plot = _complain_about_matplotlib
h2o.model.ModelBase.learning_curve_plot = _complain_about_matplotlib

h2o.automl._base.H2OAutoMLBaseMixin.pd_multi_plot = _complain_about_matplotlib
h2o.automl._base.H2OAutoMLBaseMixin.varimp_heatmap = _complain_about_matplotlib
Expand All @@ -27,7 +28,7 @@ def _register_dummy_methods():
import numpy
import matplotlib
from ._explain import varimp_heatmap, model_correlation_heatmap, shap_explain_row_plot, shap_summary_plot,\
explain, explain_row, pd_plot, pd_multi_plot, ice_plot, residual_analysis_plot
explain, explain_row, pd_plot, pd_multi_plot, ice_plot, residual_analysis_plot, learning_curve_plot

__all__ = [
"explain",
Expand All @@ -52,6 +53,7 @@ def register_explain_methods():
h2o.model.ModelBase.explain_row = explain_row
h2o.model.ModelBase.pd_plot = pd_plot
h2o.model.ModelBase.ice_plot = ice_plot
h2o.model.ModelBase.learning_curve_plot = learning_curve_plot

h2o.automl._base.H2OAutoMLBaseMixin.pd_multi_plot = pd_multi_plot
h2o.automl._base.H2OAutoMLBaseMixin.varimp_heatmap = varimp_heatmap
Expand Down
Loading