You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue + logs described here: [https://support.h2o.ai/a/tickets/101619|https://support.h2o.ai/a/tickets/101619|smart-link]
{noformat}java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at hex.rulefit.Rule.generateLanguageRule(Rule.java:37)
at hex.rulefit.Rule.(Rule.java:25)
at hex.rulefit.Rule.traverseNodes(Rule.java:90)
at hex.rulefit.Rule.traverseNodes(Rule.java:100)
at hex.rulefit.Rule.traverseNodes(Rule.java:98)
at hex.rulefit.Rule.traverseNodes(Rule.java:100)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:100)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:100)
at hex.rulefit.Rule.extractRulesFromTree(Rule.java:79)
at hex.rulefit.Rule.extractRulesListFromModel(Rule.java:70)
at hex.rulefit.RuleFit$RuleFitDriver.computeImpl(RuleFit.java:201)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:242)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1577){noformat}
based on call, the issue occurs when high cardinality categorical variables are used, otherwise ok. One with +- 64K different values and second with +- 10K different values.
So high cardinality can result in rules that are VERY long, so long that not much readable for a human.
Currently we are pre-calculating rule language representations and that is also a big waste of resources in such a case and results into memory issue.
->
rule language representation should be calculated only “on demand” not being precalculated.
first fix could be like pre-calculate only rules that have non-zero coeficient, so that they are present in {{model._model_json["output"]['rule_importance']}} table. This could dramatically help.
this should be properly fixed so that rule language representations are not pre-calculated at all → modify {{model.rule_importance()}} method so that it can return current content (with language representations calculated on demand) and in {{model._model_json["output"]['rule_importance']}} there won't be language representations
since rules that are produced on such high cardinality data are not readable/explainable at all, EnumLimited categorical encoding should be introduced. Parameter to control the number of most-frequent categories to be taken into account could be exposed.
Before these improvements are incorporated into h2o, it could help to preprocess the data in enum limited encoding fashion: take into account only certain number of the most frequent categories, other categories leave empty.
The text was updated successfully, but these errors were encountered:
Issue + logs described here: [https://support.h2o.ai/a/tickets/101619|https://support.h2o.ai/a/tickets/101619|smart-link]
{noformat}java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at hex.rulefit.Rule.generateLanguageRule(Rule.java:37)
at hex.rulefit.Rule.(Rule.java:25)
at hex.rulefit.Rule.traverseNodes(Rule.java:90)
at hex.rulefit.Rule.traverseNodes(Rule.java:100)
at hex.rulefit.Rule.traverseNodes(Rule.java:98)
at hex.rulefit.Rule.traverseNodes(Rule.java:100)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:108)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:100)
at hex.rulefit.Rule.traverseNodes(Rule.java:105)
at hex.rulefit.Rule.traverseNodes(Rule.java:100)
at hex.rulefit.Rule.extractRulesFromTree(Rule.java:79)
at hex.rulefit.Rule.extractRulesListFromModel(Rule.java:70)
at hex.rulefit.RuleFit$RuleFitDriver.computeImpl(RuleFit.java:201)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:242)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1577){noformat}
based on call, the issue occurs when high cardinality categorical variables are used, otherwise ok. One with +- 64K different values and second with +- 10K different values.
So high cardinality can result in rules that are VERY long, so long that not much readable for a human.
Currently we are pre-calculating rule language representations and that is also a big waste of resources in such a case and results into memory issue.
->
rule language representation should be calculated only “on demand” not being precalculated.
first fix could be like pre-calculate only rules that have non-zero coeficient, so that they are present in {{model._model_json["output"]['rule_importance']}} table. This could dramatically help.
this should be properly fixed so that rule language representations are not pre-calculated at all → modify {{model.rule_importance()}} method so that it can return current content (with language representations calculated on demand) and in {{model._model_json["output"]['rule_importance']}} there won't be language representations
Before these improvements are incorporated into h2o, it could help to preprocess the data in enum limited encoding fashion: take into account only certain number of the most frequent categories, other categories leave empty.
The text was updated successfully, but these errors were encountered: