Parameter force_col_type does not work with skipped_columns when parsing parquet files #15860

wendycwong · 2023-10-23T19:33:40Z

support ticket: https://h2osupport.freshdesk.com/a/tickets/106711

I was testing the new parameter force_col_types with import_file from v3.44.0.1 – It worked in its basic form ( read file with 2 parameters [path, force_col_types=True]) However, it failed with the following error when additional parameters were passed as part of import_file fn

Param’s below are:
path_2_data parm points to a parquet file,
destination_frame is a uuid
skipped_columns: has index of columns being skipped as a list

input_data = h2o.import_file(
path_2_data,
destination_frame=destination_frame,
skipped_columns=skipped_columns,
force_col_types=True,
)

../../env/h2o/lib/python3.10/site-packages/h2o/h2o.py:501: in import_file
return H2OFrame()._import_parse(path, pattern, destination_frame, header, sep, col_names, col_types, na_strings,
../../env/h2o/lib/python3.10/site-packages/h2o/frame.py:449: in _import_parse
self._parse(rawkey, destination_frame, header, separator, column_names, column_types, na_strings,
../../env/h2o/lib/python3.10/site-packages/h2o/frame.py:466: in _parse
return self._parse_raw(setup)
../../env/h2o/lib/python3.10/site-packages/h2o/frame.py:495: in _parse_raw
H2OJob(h2o.api("POST /3/Parse", data=p), "Parse").poll()

self =
poll_updates = None

def poll(self, poll_updates=None):
    """
    Wait until the job finishes.

    This method will continuously query the server about the status of the job, until the job reaches a
    completion. During this time we will display (in stdout) a progress bar with % completion status.
    :param poll_updates: a callback function called a each polling iteration with 2 arguments:
        (current_job: H2OJob, bar_progression: float)
    """
    try:
        hidden = not H2OJob.__PROGRESS_BAR__
        pb = ProgressBar(widgets=self.__PROGRESS_WIDGETS__, hidden=hidden)
        if poll_updates:
            pb.execute(self._refresh_job_status, progress_monitor_fn=ft.partial(poll_updates, self))
        else:
            pb.execute(self._refresh_job_status)
    except StopIteration as e:
        if str(e) == "cancelled":
            self.cancel()
        # Potentially we may want to re-raise the exception here

    assert self.status in {"DONE", "CANCELLED", "FAILED"} or self._poll_count <= 0, \
        "Polling finished while the job has status %s" % self.status
    if self.warnings:
        for w in self.warnings:
            warnings.warn(w)

    # check if failed... and politely print relevant message
    if self.status == "CANCELLED":
        raise H2OJobCancelled("Job<%s> was cancelled by the user." % self.job_key)
    if self.status == "FAILED":
        if (isinstance(self.job, dict)) and ("stacktrace" in list(self.job)):

          raise EnvironmentError("Job with key {} failed with an exception: {}\nstacktrace: "

                                   "\n{}".format(self.job_key, self.exception, self.job["stacktrace"]))

E OSError: Job with key $03017f00000132d4ffffffff$_9dba8853d1ee9ccbf0e9c6ac0b44788 failed with an exception: java.lang.ArrayIndexOutOfBoundsException
E stacktrace:
E java.lang.ArrayIndexOutOfBoundsException

../../env/h2o/lib/python3.10/site-packages/h2o/job.py:88: OSError

The text was updated successfully, but these errors were encountered:

* fixed problem with skipped columns. The skipped columns needed to be removed from column types * Update h2o-core/src/main/java/water/parser/ParseDataset.java Co-authored-by: Marek Novotný <marek.novotny@h2o.ai>

wendycwong added the bug label Oct 23, 2023

wendycwong self-assigned this Oct 23, 2023

wendycwong mentioned this issue Oct 24, 2023

GH-15860: fixed problem with skipped columns. #15867

Merged

wendycwong added this to the 3.44.0.2 milestone Oct 24, 2023

wendycwong linked a pull request Oct 25, 2023 that will close this issue

GH-15860: fixed problem with skipped columns. #15867

Merged

wendycwong closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter force_col_type does not work with skipped_columns when parsing parquet files #15860

Parameter force_col_type does not work with skipped_columns when parsing parquet files #15860

wendycwong commented Oct 23, 2023

Parameter force_col_type does not work with skipped_columns when parsing parquet files #15860

Parameter force_col_type does not work with skipped_columns when parsing parquet files #15860

Comments

wendycwong commented Oct 23, 2023