Parser fails if uploading more than 331186 columns #6527

exalate-issue-sync · 2023-02-21T22:09:27Z

When writing tests for [https://h2oai.atlassian.net/browse/PUBDEV-8876|https://h2oai.atlassian.net/browse/PUBDEV-8876|smart-link], encountered this issue:

{noformat} ~/repos/h2o/h2o-3/h2o-py/h2o/h2o.py in parse_setup(raw_frames, destination_frame, header, separator, column_names, column_types, na_strings, skipped_columns, custom_non_data_line_markers, partition_by, quotechar, escapechar)
874 if len(column_names) != len(j["column_types"]): raise ValueError(
875 "length of col_names should be equal to the number of columns: %d vs %d"
--> 876 % (len(column_names), len(j["column_types"])))
877 j["column_names"] = column_names
878 counter = 0

ValueError: length of col_names should be equal to the number of columns: 1000000 vs 331186{noformat}

The text was updated successfully, but these errors were encountered:

exalate-issue-sync · 2023-02-21T22:09:29Z

Sebastien Poirier commented: comments from internal discussion:

{noformat}michalkurka 4 days ago

There is a magic number, the default chunk size for Uploaded data is 4MB, your file is about 22MB, and it is split into 6 chunks. It looks like the header is longer than one chunk and the data is incorrectly parsed.

Something like this should only happen for datasets like yours, super-short and ultra-wide.

It is a bug that should be documented.{noformat}

h2o-ops · 2023-05-10T13:53:59Z

JIRA Issue Details

Jira Issue: PUBDEV-8898
Assignee: New H2O Bugs
Reporter: Sebastien Poirier
State: Open
Fix Version: N/A
Attachments: N/A
Development PRs: N/A

wendycwong · 2023-06-25T00:11:12Z

The error is caused by the header data takes more then one chunk of memory to store. In this case, if we can auto detect this condition and re-assign the chunk size, this should avoid the problem.

wendycwong · 2023-06-25T22:47:21Z

Could be related to this GH: #6610

wendycwong added bug Parser Major Denote importance of issue to be fixed. labels Jun 25, 2023

wendycwong mentioned this issue Jun 25, 2023

Error importing 270k column sparse data file into H2O #6610

Open

wendycwong assigned krasinski May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser fails if uploading more than 331186 columns #6527

Parser fails if uploading more than 331186 columns #6527

exalate-issue-sync bot commented Feb 21, 2023

exalate-issue-sync bot commented Feb 21, 2023

h2o-ops commented May 10, 2023

wendycwong commented Jun 25, 2023

wendycwong commented Jun 25, 2023

Parser fails if uploading more than 331186 columns #6527

Parser fails if uploading more than 331186 columns #6527

Comments

exalate-issue-sync bot commented Feb 21, 2023

exalate-issue-sync bot commented Feb 21, 2023

h2o-ops commented May 10, 2023

wendycwong commented Jun 25, 2023

wendycwong commented Jun 25, 2023