sw does not support parquet import #4170

exalate-issue-sync · 2023-05-22T15:08:39Z

parquet parser is not registered-

{code:java}
09-26 20:21:13.862 127.0.0.1:54321 4219 #r thread INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, CSV]
{code}

Steps to reproduce:

{code}
library(sparklyr)
library(h2o)
options(rsparkling.sparklingwater.version = "2.1.14")
library(rsparkling)
Sys.setenv(SPARK_HOME="~/spark/spark-2.1.0-bin-hadoop2.7")

config <- spark_config()
config$sparklyr.shell.driver-memory <- '7G'
config$sparklyr.shell.executor-memory <- '7G'
sc <- spark_connect(master='local', version='2.1.0', config=config)
h2o_context(sc)
h2o.clusterInfo()
h2o.importFile("/Users/nidhimehta/full/",destination_frame = "full")
{code}

The text was updated successfully, but these errors were encountered:

exalate-issue-sync · 2023-05-22T15:08:41Z

Michal Malohlava commented: the original intention was not to include h2o-parsers in SW assembly, since they are not fully featured and depends on fix versions of libraries (e.g. Avro) which can lead to unexpected behavior.
Users should use Spark to load Parquet data, and transfer into h2o.

exalate-issue-sync · 2023-05-22T15:08:42Z

Michal Kurka commented: This needs to be most likely fix on H2O's side. We should use version Parquet 1.8.x (the one compatible with Spark).

exalate-issue-sync · 2023-05-22T15:08:44Z

Michal Kurka commented: Was able to reproduce also on 2.1.14

exalate-issue-sync · 2023-05-22T15:08:46Z

Jakub Hava commented: This was fixed in h2o

exalate-issue-sync · 2023-05-22T15:08:48Z

Michal Kurka commented: Parquet import was not working because Parquet libraries are part of Spark's distribution and are loaded by another class loader. We were accessing package-private class InternalParquetRecordReader which was throwing IllegalAccessException even though the calling class was in the same package (it was however loaded by a different class loader with a different protection domain). The solution was to copy InternalParquetRecordReader into H2O's code base and adapt it for our purposes. This class only uses public developer-facing API. We also adopted Parquet 1.8.1.

This was not tested on Spark 2.2.x.

DinukaH2O · 2023-05-23T10:31:04Z

JIRA Issue Migration Info

Jira Issue: SW-542
Assignee: Michal Kurka
Reporter: Nidhi Mehta
State: Resolved
Fix Version: 2.1.15
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

h2oai/h2o-3#1618

hasithjp · 2023-05-29T13:49:11Z

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2017-09-26T15:27:59.366-0700

DinukaH2O closed this as completed May 23, 2023

DinukaH2O added the fixVersion/2.1.15 label May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sw does not support parquet import #4170

sw does not support parquet import #4170

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023

sw does not support parquet import #4170

sw does not support parquet import #4170

Comments

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023