Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster sizing for binary file formats #7788

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 6 comments
Closed

Cluster sizing for binary file formats #7788

exalate-issue-sync bot opened this issue May 11, 2023 · 6 comments

Comments

@exalate-issue-sync
Copy link

Users are not sure how to size cluster for binary files (like Parquet). There is a clear guidance for CSV files (5x the size) but there is no such guidance for binary files.

The goal of this Jira is to identify what characteristics of the file can be used to approximate the final size in h2o memory.

Ideas:

  • dimensions (#rows, #columns)
  • sparsity
  • number of numerical columns, categorical, string
  • data size on disk

The given set of “features” should be easy to provide by the user (the input shouldn’t be too complex).

The output of this work can either be a rule or a model representation (MOJO) of the rule.

@exalate-issue-sync
Copy link
Author

Michal Kurka commented: Based on internal discussion on Slack: [https://h2oai.slack.com/archives/C03HXQSLW/p1603374166470900|https://h2oai.slack.com/archives/C03HXQSLW/p1603374166470900]

@exalate-issue-sync
Copy link
Author

Tomas Pastorek commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] Just to specify the Steams interface.

Given the input parameters (dimensions, sparsity, ..). The output should be a set of the following values:

  • Number of Nodes: Specify the number of nodes of the H2O cluster.
  • Memory per Node [GB]: Specify the amount of memory allocated to a single node of the H2O cluster.
  • Extra Memory [%]: Specify the extra memory allocated to a single node as a percentage of memory per node. Algorithms like XGBoost use this additional memory, and you may need to increase this value if you are unable to build XGBoost models.
  • H2O Threads per Node: Specify the number of threads (CPUs) to use per node.
  • YARN Virtual Cores per Node: Specify the number of YARN virtual cores per node that will be requested from the YARN resource scheduler.

→ all integer parameters. Maybe the last two are not needed?.

@exalate-issue-sync
Copy link
Author

Michal Kurka commented: @tomas Pastorek I think the output should only be “memory required in H2O” - this is where I would draw the line for the responsibility of this tool.

Steam is aware of the profiles and capabilities of the cluster. It should take the output of the tool (the total amount) of memory and translate it to the number of nodes and nodes per memory using the cluster profiles.

This design makes the tool usable across H2O versions.

WDYT?

@exalate-issue-sync
Copy link
Author

Tomas Pastorek commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] makes sense, this is probably better

@exalate-issue-sync
Copy link
Author

Ondrej Nekola commented: The current proposal - keep file size parameter for estimates of {{.csv}} files, add {{nrows}} and {{ncols}}.

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-7854
Assignee: Ondrej Nekola
Reporter: Michal Kurka
State: Resolved
Fix Version: 3.32.1.1
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#5123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant