Cluster sizing for binary file formats #7788

exalate-issue-sync · 2023-05-11T18:12:07Z

Users are not sure how to size cluster for binary files (like Parquet). There is a clear guidance for CSV files (5x the size) but there is no such guidance for binary files.

The goal of this Jira is to identify what characteristics of the file can be used to approximate the final size in h2o memory.

Ideas:

dimensions (#rows, #columns)
sparsity
number of numerical columns, categorical, string
data size on disk
…

The given set of “features” should be easy to provide by the user (the input shouldn’t be too complex).

The output of this work can either be a rule or a model representation (MOJO) of the rule.

exalate-issue-sync · 2023-05-11T18:12:09Z

Michal Kurka commented: Based on internal discussion on Slack: [https://h2oai.slack.com/archives/C03HXQSLW/p1603374166470900|https://h2oai.slack.com/archives/C03HXQSLW/p1603374166470900]

exalate-issue-sync · 2023-05-11T18:12:11Z

Tomas Pastorek commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] Just to specify the Steams interface.

Given the input parameters (dimensions, sparsity, ..). The output should be a set of the following values:

Number of Nodes: Specify the number of nodes of the H2O cluster.
Memory per Node [GB]: Specify the amount of memory allocated to a single node of the H2O cluster.
Extra Memory [%]: Specify the extra memory allocated to a single node as a percentage of memory per node. Algorithms like XGBoost use this additional memory, and you may need to increase this value if you are unable to build XGBoost models.
H2O Threads per Node: Specify the number of threads (CPUs) to use per node.
YARN Virtual Cores per Node: Specify the number of YARN virtual cores per node that will be requested from the YARN resource scheduler.

→ all integer parameters. Maybe the last two are not needed?.

exalate-issue-sync · 2023-05-11T18:12:12Z

Michal Kurka commented: @tomas Pastorek I think the output should only be “memory required in H2O” - this is where I would draw the line for the responsibility of this tool.

Steam is aware of the profiles and capabilities of the cluster. It should take the output of the tool (the total amount) of memory and translate it to the number of nodes and nodes per memory using the cluster profiles.

This design makes the tool usable across H2O versions.

WDYT?

exalate-issue-sync · 2023-05-11T18:12:14Z

Tomas Pastorek commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] makes sense, this is probably better

exalate-issue-sync · 2023-05-11T18:12:16Z

Ondrej Nekola commented: The current proposal - keep file size parameter for estimates of {{.csv}} files, add {{nrows}} and {{ncols}}.

h2o-ops · 2023-05-14T20:53:00Z

JIRA Issue Migration Info

Jira Issue: PUBDEV-7854
Assignee: Ondrej Nekola
Reporter: Michal Kurka
State: Resolved
Fix Version: 3.32.1.1
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#5123

h2o-ops closed this as completed May 14, 2023

h2o-ops added the fixVersion/3.32.1.1 label May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster sizing for binary file formats #7788

Cluster sizing for binary file formats #7788

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023

Cluster sizing for binary file formats #7788

Cluster sizing for binary file formats #7788

Comments

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023