-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster sizing for binary file formats #7788
Comments
Michal Kurka commented: Based on internal discussion on Slack: [https://h2oai.slack.com/archives/C03HXQSLW/p1603374166470900|https://h2oai.slack.com/archives/C03HXQSLW/p1603374166470900] |
Tomas Pastorek commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] Just to specify the Steams interface. Given the input parameters (dimensions, sparsity, ..). The output should be a set of the following values:
→ all integer parameters. Maybe the last two are not needed?. |
Michal Kurka commented: @tomas Pastorek I think the output should only be “memory required in H2O” - this is where I would draw the line for the responsibility of this tool. Steam is aware of the profiles and capabilities of the cluster. It should take the output of the tool (the total amount) of memory and translate it to the number of nodes and nodes per memory using the cluster profiles. This design makes the tool usable across H2O versions. WDYT? |
Tomas Pastorek commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] makes sense, this is probably better |
Ondrej Nekola commented: The current proposal - keep file size parameter for estimates of {{.csv}} files, add {{nrows}} and {{ncols}}. |
JIRA Issue Migration Info Jira Issue: PUBDEV-7854 Linked PRs from JIRA |
Users are not sure how to size cluster for binary files (like Parquet). There is a clear guidance for CSV files (5x the size) but there is no such guidance for binary files.
The goal of this Jira is to identify what characteristics of the file can be used to approximate the final size in h2o memory.
Ideas:
The given set of “features” should be easy to provide by the user (the input shouldn’t be too complex).
The output of this work can either be a rule or a model representation (MOJO) of the rule.
The text was updated successfully, but these errors were encountered: