Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rulefit model estimator memory issue #7060

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 3 comments
Closed

Rulefit model estimator memory issue #7060

exalate-issue-sync bot opened this issue May 11, 2023 · 3 comments

Comments

@exalate-issue-sync
Copy link

Issue + logs described here: [https://support.h2o.ai/a/tickets/101619|https://support.h2o.ai/a/tickets/101619|smart-link]

{noformat}java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
        at hex.rulefit.Rule.generateLanguageRule(Rule.java:37)
        at hex.rulefit.Rule.(Rule.java:25)
        at hex.rulefit.Rule.traverseNodes(Rule.java:90)
        at hex.rulefit.Rule.traverseNodes(Rule.java:100)
        at hex.rulefit.Rule.traverseNodes(Rule.java:98)
        at hex.rulefit.Rule.traverseNodes(Rule.java:100)
        at hex.rulefit.Rule.traverseNodes(Rule.java:108)
        at hex.rulefit.Rule.traverseNodes(Rule.java:105)
        at hex.rulefit.Rule.traverseNodes(Rule.java:108)
        at hex.rulefit.Rule.traverseNodes(Rule.java:108)
        at hex.rulefit.Rule.traverseNodes(Rule.java:108)
        at hex.rulefit.Rule.traverseNodes(Rule.java:105)
        at hex.rulefit.Rule.traverseNodes(Rule.java:105)
        at hex.rulefit.Rule.traverseNodes(Rule.java:105)
        at hex.rulefit.Rule.traverseNodes(Rule.java:105)
        at hex.rulefit.Rule.traverseNodes(Rule.java:108)
        at hex.rulefit.Rule.traverseNodes(Rule.java:105)
        at hex.rulefit.Rule.traverseNodes(Rule.java:105)
        at hex.rulefit.Rule.traverseNodes(Rule.java:108)
        at hex.rulefit.Rule.traverseNodes(Rule.java:105)
        at hex.rulefit.Rule.traverseNodes(Rule.java:100)
        at hex.rulefit.Rule.traverseNodes(Rule.java:105)
        at hex.rulefit.Rule.traverseNodes(Rule.java:100)
        at hex.rulefit.Rule.extractRulesFromTree(Rule.java:79)
        at hex.rulefit.Rule.extractRulesListFromModel(Rule.java:70)
        at hex.rulefit.RuleFit$RuleFitDriver.computeImpl(RuleFit.java:201)
        at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:242)
        at water.H2O$H2OCountedCompleter.compute(H2O.java:1577){noformat}

based on call, the issue occurs when high cardinality categorical variables are used, otherwise ok.  One with +- 64K different values and second with +- 10K different values.

So high cardinality can result in rules that are VERY long, so long that not much readable for a human.

Currently we are pre-calculating rule language representations and that is also a big waste of resources in such a case and results into memory issue.

->

rule language representation should be calculated only “on demand” not being precalculated.

first fix could be like pre-calculate only rules that have non-zero coeficient, so that they are present in {{model._model_json["output"]['rule_importance']}} table. This could dramatically help.

this should be properly fixed so that rule language representations are not pre-calculated at all → modify {{model.rule_importance()}} method so that it can return current content (with language representations calculated on demand) and in {{model._model_json["output"]['rule_importance']}} there won't be language representations

  1. since rules that are produced on such high cardinality data are not readable/explainable at all, EnumLimited categorical encoding should be introduced. Parameter to control the number of most-frequent categories to be taken into account could be exposed.

Before these improvements are incorporated into h2o, it could help to preprocess the data in enum limited encoding fashion: take into account only certain number of the most frequent categories, other categories leave empty.

@exalate-issue-sync
Copy link
Author

Zuzana Olajcová commented: Resolved in [https://github.com//pull/6191|https://github.com//pull/6191|smart-link]

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Details

Jira Issue: PUBDEV-8616
Assignee: Zuzana Olajcová
Reporter: Zuzana Olajcová
State: Resolved
Fix Version: 3.36.1.3
Attachments: N/A
Development PRs: Available

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

Linked PRs from JIRA

#6191

@h2o-ops h2o-ops closed this as completed May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant