Figure out better way of caching MOJO Pipelines in H2OMOJOPipelineModel transformer #5438

exalate-issue-sync · 2023-05-22T19:28:53Z

Problem

The code:

sparkling-water/scoring/src/main/scala/ai/h2o/sparkling/ml/models/H2OMOJOPipelineModel.scala

Lines 36 to 39 in ea5b11a

    
           @transient private lazy val mojoPipeline: MojoPipeline = { 
        
             val reader = MojoPipelineReaderBackendFactory.createReaderBackend(new ByteArrayInputStream(getMojoData())) 
        
             MojoPipeline.loadFrom(reader) 
        
           }

loads MOJO pipeline from actual bytes, however, it will happen for each thread running in executor (i.e., thread representing executor core). This brings significant memory, time overhead for bigger MOJO models.

Goal
Load the MOJO model only once per JVM and share it cross multiple executor threads.

If we decide to cache MOJO, we have to make sure we will not leave it in memory for too long,
and also expect that Spark job can use multiple MOJOs

exalate-issue-sync · 2023-05-22T19:28:54Z

Michal Malohlava commented: Should we broadcast the model? (similar to idea sketch here: [https://stackoverflow.com/questions/40435741/object-cache-on-spark-executors|https://stackoverflow.com/questions/40435741/object-cache-on-spark-executors])

exalate-issue-sync · 2023-05-22T19:28:56Z

Jakub Hava commented: We already broadcast the mojo bytes → spark driver registers it as the broadcast variable and executors just fetch it when they need it.

We could also try create instance of the MOJO model on driver and broadcast it, but not sure if we hit any serialization issues. But that would be definitely good improvement. I can have a look on it today and at least check if the model be serialized or not

DinukaH2O · 2023-05-23T13:04:17Z

JIRA Issue Migration Info

Jira Issue: SW-1658
Assignee: Jakub Hava
Reporter: Michal Malohlava
State: Resolved
Fix Version: 3.26.7
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1558
#1568

hasithjp · 2023-05-29T15:49:39Z

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-10-01T21:14:09.067-0700

DinukaH2O assigned jakubhava May 23, 2023

DinukaH2O closed this as completed May 23, 2023

DinukaH2O added the fixVersion/3.26.7 label May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out better way of caching MOJO Pipelines in H2OMOJOPipelineModel transformer #5438

Figure out better way of caching MOJO Pipelines in H2OMOJOPipelineModel transformer #5438

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023

Figure out better way of caching MOJO Pipelines in H2OMOJOPipelineModel transformer #5438

Figure out better way of caching MOJO Pipelines in H2OMOJOPipelineModel transformer #5438

Comments

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023