Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter force_col_type does not work with skipped_columns when parsing parquet files #15860

Closed
wendycwong opened this issue Oct 23, 2023 · 0 comments · Fixed by #15867
Closed
Assignees
Labels
Milestone

Comments

@wendycwong
Copy link
Contributor

support ticket: https://h2osupport.freshdesk.com/a/tickets/106711

I was testing the new parameter force_col_types with import_file from v3.44.0.1 – It worked in its basic form ( read file with 2 parameters [path, force_col_types=True]) However, it failed with the following error when additional parameters were passed as part of import_file fn

Param’s below are:
path_2_data parm points to a parquet file,
destination_frame is a uuid
skipped_columns: has index of columns being skipped as a list

input_data = h2o.import_file(
path_2_data,
destination_frame=destination_frame,
skipped_columns=skipped_columns,
force_col_types=True,
)

../../env/h2o/lib/python3.10/site-packages/h2o/h2o.py:501: in import_file
return H2OFrame()._import_parse(path, pattern, destination_frame, header, sep, col_names, col_types, na_strings,
../../env/h2o/lib/python3.10/site-packages/h2o/frame.py:449: in _import_parse
self._parse(rawkey, destination_frame, header, separator, column_names, column_types, na_strings,
../../env/h2o/lib/python3.10/site-packages/h2o/frame.py:466: in _parse
return self._parse_raw(setup)
../../env/h2o/lib/python3.10/site-packages/h2o/frame.py:495: in _parse_raw
H2OJob(h2o.api("POST /3/Parse", data=p), "Parse").poll()


self =
poll_updates = None

def poll(self, poll_updates=None):
    """
    Wait until the job finishes.

    This method will continuously query the server about the status of the job, until the job reaches a
    completion. During this time we will display (in stdout) a progress bar with % completion status.
    :param poll_updates: a callback function called a each polling iteration with 2 arguments:
        (current_job: H2OJob, bar_progression: float)
    """
    try:
        hidden = not H2OJob.__PROGRESS_BAR__
        pb = ProgressBar(widgets=self.__PROGRESS_WIDGETS__, hidden=hidden)
        if poll_updates:
            pb.execute(self._refresh_job_status, progress_monitor_fn=ft.partial(poll_updates, self))
        else:
            pb.execute(self._refresh_job_status)
    except StopIteration as e:
        if str(e) == "cancelled":
            self.cancel()
        # Potentially we may want to re-raise the exception here

    assert self.status in {"DONE", "CANCELLED", "FAILED"} or self._poll_count <= 0, \
        "Polling finished while the job has status %s" % self.status
    if self.warnings:
        for w in self.warnings:
            warnings.warn(w)

    # check if failed... and politely print relevant message
    if self.status == "CANCELLED":
        raise H2OJobCancelled("Job<%s> was cancelled by the user." % self.job_key)
    if self.status == "FAILED":
        if (isinstance(self.job, dict)) and ("stacktrace" in list(self.job)):
          raise EnvironmentError("Job with key {} failed with an exception: {}\nstacktrace: "
                                   "\n{}".format(self.job_key, self.exception, self.job["stacktrace"]))

E OSError: Job with key $03017f00000132d4ffffffff$_9dba8853d1ee9ccbf0e9c6ac0b44788 failed with an exception: java.lang.ArrayIndexOutOfBoundsException
E stacktrace:
E java.lang.ArrayIndexOutOfBoundsException

../../env/h2o/lib/python3.10/site-packages/h2o/job.py:88: OSError

@wendycwong wendycwong added the bug label Oct 23, 2023
@wendycwong wendycwong self-assigned this Oct 23, 2023
@wendycwong wendycwong added this to the 3.44.0.2 milestone Oct 24, 2023
@wendycwong wendycwong linked a pull request Oct 25, 2023 that will close this issue
wendycwong added a commit that referenced this issue Oct 30, 2023
* fixed problem with skipped columns.  The skipped columns needed to be removed from column types

* Update h2o-core/src/main/java/water/parser/ParseDataset.java

Co-authored-by: Marek Novotný <marek.novotny@h2o.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant