Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Rule of Thumb for Data Conversion #5509

Closed
exalate-issue-sync bot opened this issue May 22, 2023 · 2 comments
Closed

Add Rule of Thumb for Data Conversion #5509

exalate-issue-sync bot opened this issue May 22, 2023 · 2 comments
Assignees

Comments

@exalate-issue-sync
Copy link

We should add the following Rule of Thumb to the Data Sharing section of Sparkling Water
[http://docs.h2o.ai/sparkling-water/2.3/latest-stable/doc/design/data_sharing.html|http://docs.h2o.ai/sparkling-water/2.3/latest-stable/doc/design/data_sharing.html]

h3. Memory Consideration When Converting Between Data Frames Types

When Using Sparkling Water External Backend:

If you have allocated the recommended memory amount to your H2O cluster (4 x the size of your dataset), you don't need to worry about memory constraints when converting between a Spark DataFrame and an H2OFrame; there is no collision with Spark storage.

Note: the 4 x the size of your dataset assumes your dataset is represented as a CSV. If your dataset is represented as JSON, XML or parquet, the requirements may differ significantly.

When Using Sparkling Water Internal Backend:

In internal backend mode H2O-3 shares the JVM with Spark executors. In this case, you will want to allocate enough memory to run Spark transformations on your DataFrame (which means allocating a minimum memory of your dataset and memory for those transformations), plus allocate an additional 4 x the size of your dataset.

Note: there is data duplication when you convert between a Spark DataFrame and an H2Oframe (though H2O uses compression tricks to help reduce the memory requirements for this conversion); there is no data duplication when you convert between an H2OFrame and a Spark DataFrame because Sparkling Water uses a wrapper around the H2OFrame, which uses the RDD/DataFrame API.

@DinukaH2O
Copy link

JIRA Issue Migration Info

Jira Issue: SW-1581
Assignee: Jakub Hava
Reporter: Lauren DiPerna
State: Resolved
Fix Version: 3.26.5
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1517

@hasithjp
Copy link
Member

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-08-29T12:20:26.521-0700

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants