Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser fails if uploading more than 331186 columns #6527

Open
exalate-issue-sync bot opened this issue Feb 21, 2023 · 4 comments
Open

Parser fails if uploading more than 331186 columns #6527

exalate-issue-sync bot opened this issue Feb 21, 2023 · 4 comments
Assignees
Labels
bug Major Denote importance of issue to be fixed. Parser

Comments

@exalate-issue-sync
Copy link

When writing tests for [https://h2oai.atlassian.net/browse/PUBDEV-8876|https://h2oai.atlassian.net/browse/PUBDEV-8876|smart-link], encountered this issue:

{noformat} ~/repos/h2o/h2o-3/h2o-py/h2o/h2o.py in parse_setup(raw_frames, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns, custom_non_data_line_markers, partition_by, quotechar, escapechar)
874 if len(column_names) != len(j["column_types"]): raise ValueError(
875 "length of col_names should be equal to the number of columns: %d vs %d"
--> 876 % (len(column_names), len(j["column_types"])))
877 j["column_names"] = column_names
878 counter = 0

ValueError: length of col_names should be equal to the number of columns: 1000000 vs 331186{noformat}
@exalate-issue-sync
Copy link
Author

Sebastien Poirier commented: comments from internal discussion:

{noformat}michalkurka 4 days ago

There is a magic number, the default chunk size for Uploaded data is 4MB, your file is about 22MB, and it is split into 6 chunks. It looks like the header is longer than one chunk and the data is incorrectly parsed.

Something like this should only happen for datasets like yours, super-short and ultra-wide.

It is a bug that should be documented.{noformat}

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 10, 2023

JIRA Issue Details

Jira Issue: PUBDEV-8898
Assignee: New H2O Bugs
Reporter: Sebastien Poirier
State: Open
Fix Version: N/A
Attachments: N/A
Development PRs: N/A

@wendycwong
Copy link
Contributor

The error is caused by the header data takes more then one chunk of memory to store. In this case, if we can auto detect this condition and re-assign the chunk size, this should avoid the problem.

@wendycwong
Copy link
Contributor

Could be related to this GH: #6610

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Major Denote importance of issue to be fixed. Parser
Projects
None yet
Development

No branches or pull requests

3 participants