Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory efficiency of H2OMOJOPipelineModel #4557

Closed
exalate-issue-sync bot opened this issue May 22, 2023 · 6 comments
Closed

Improve memory efficiency of H2OMOJOPipelineModel #4557

exalate-issue-sync bot opened this issue May 22, 2023 · 6 comments
Assignees

Comments

@exalate-issue-sync
Copy link

Can we cache loaded MOJO models in memory to avoid duplication if H2OMOJOPipelineModel transformer is instantiated multiple times?

Actions:

  1. verify how many times the transformer is instantiated if there is n-partitions on an executor
  2. introduce a local cache (WeakReference) of for loaded MOJO models

CC: [~accountid:5c9943ec3a5542225fedb6b9] [~accountid:557058:eeeb611c-665e-431d-b442-1f255171db6f]

@exalate-issue-sync
Copy link
Author

Nidhi Mehta commented: #94138 (https://support.h2o.ai/a/tickets/94138) - Re: Deploying MOJO on Spark

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223] We have this code, but after testing it today I can verify it does not work as expected. the reader back end is created every time the prediction is done. Will fix, for start
{code:java}
def getOrCreateModel(): MojoPipeline = {
if (model == null) {
val reader = MojoPipelineReaderBackendFactory.createReaderBackend(new ByteArrayInputStream(mojoData))
model = MojoPipeline.loadFrom(reader)
}
model
}

{code}

@exalate-issue-sync
Copy link
Author

Jakub Hava commented:
{code:java}
To test:

import org.apache.spark.ml.h2o.models.H2OMOJOPipelineModel
val mojo = H2OMOJOPipelineModel.createFromMojo("/Users/kuba/devel/repos/sparkling-water/ml/src/test/resources/mojo2data/pipeline.mojo")
val csv = spark.read.option("header", "true").csv("/Users/kuba/devel/repos/sparkling-water/examples/smalldata/prostate/prostate.csv")

0.until(100).foreach{ _ =>
mojo.transform(csv).take(1)
}
{code}

If we put print statement into getOrCreateModel we see it is being created all over again.

First step is to create some sort of registry which is local to executor and ensures the mojo bytes does not have to be serialized and deserialized and new instance created

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: Created first implementation which avoids serializing the mojo and creating new instance for each row.

We should however investigate why this was happening in the first place. Putting this change to release so the user can try it as soon as possible

@DinukaH2O
Copy link

JIRA Issue Migration Info

Jira Issue: SW-1199
Assignee: Jakub Hava
Reporter: Michal Malohlava
State: Resolved
Fix Version: 2.1.53
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1160

@hasithjp
Copy link
Member

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-04-12T13:50:18.384-0700

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants