Improve memory efficiency of H2OMOJOPipelineModel #4557

exalate-issue-sync · 2023-05-22T16:52:08Z

Can we cache loaded MOJO models in memory to avoid duplication if H2OMOJOPipelineModel transformer is instantiated multiple times?

Actions:

verify how many times the transformer is instantiated if there is n-partitions on an executor
introduce a local cache (WeakReference) of for loaded MOJO models

CC: [~accountid:5c9943ec3a5542225fedb6b9] [~accountid:557058:eeeb611c-665e-431d-b442-1f255171db6f]

The text was updated successfully, but these errors were encountered:

exalate-issue-sync · 2023-05-22T16:52:09Z

Nidhi Mehta commented: #94138 (https://support.h2o.ai/a/tickets/94138) - Re: Deploying MOJO on Spark

exalate-issue-sync · 2023-05-22T16:52:11Z

Jakub Hava commented: [~accountid:557058:389d9607-5bd8-4611-8c6a-755fe9295223] We have this code, but after testing it today I can verify it does not work as expected. the reader back end is created every time the prediction is done. Will fix, for start
{code:java}
def getOrCreateModel(): MojoPipeline = {
if (model == null) {
val reader = MojoPipelineReaderBackendFactory.createReaderBackend(new ByteArrayInputStream(mojoData))
model = MojoPipeline.loadFrom(reader)
}
model
}

{code}

exalate-issue-sync · 2023-05-22T16:52:13Z

Jakub Hava commented:
{code:java}
To test:

import org.apache.spark.ml.h2o.models.H2OMOJOPipelineModel
val mojo = H2OMOJOPipelineModel.createFromMojo("/Users/kuba/devel/repos/sparkling-water/ml/src/test/resources/mojo2data/pipeline.mojo")
val csv = spark.read.option("header", "true").csv("/Users/kuba/devel/repos/sparkling-water/examples/smalldata/prostate/prostate.csv")

0.until(100).foreach{ _ =>
mojo.transform(csv).take(1)
}
{code}

If we put print statement into getOrCreateModel we see it is being created all over again.

First step is to create some sort of registry which is local to executor and ensures the mojo bytes does not have to be serialized and deserialized and new instance created

exalate-issue-sync · 2023-05-22T16:52:15Z

Jakub Hava commented: Created first implementation which avoids serializing the mojo and creating new instance for each row.

We should however investigate why this was happening in the first place. Putting this change to release so the user can try it as soon as possible

DinukaH2O · 2023-05-23T11:20:32Z

JIRA Issue Migration Info

Jira Issue: SW-1199
Assignee: Jakub Hava
Reporter: Michal Malohlava
State: Resolved
Fix Version: 2.1.53
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1160

hasithjp · 2023-05-29T14:28:36Z

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-04-12T13:50:18.384-0700

DinukaH2O assigned jakubhava May 23, 2023

DinukaH2O closed this as completed May 23, 2023

DinukaH2O added the fixVersion/2.1.53 label May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory efficiency of H2OMOJOPipelineModel #4557

Improve memory efficiency of H2OMOJOPipelineModel #4557

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023

Improve memory efficiency of H2OMOJOPipelineModel #4557

Improve memory efficiency of H2OMOJOPipelineModel #4557

Comments

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023