Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sw does not support parquet import #4170

Closed
exalate-issue-sync bot opened this issue May 22, 2023 · 7 comments
Closed

sw does not support parquet import #4170

exalate-issue-sync bot opened this issue May 22, 2023 · 7 comments

Comments

@exalate-issue-sync
Copy link

parquet parser is not registered-

{code:java}
09-26 20:21:13.862 127.0.0.1:54321 4219 #r thread INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, CSV]
{code}

Steps to reproduce:

{code}
library(sparklyr)
library(h2o)
options(rsparkling.sparklingwater.version = "2.1.14")
library(rsparkling)
Sys.setenv(SPARK_HOME="~/spark/spark-2.1.0-bin-hadoop2.7")

config <- spark_config()
config$sparklyr.shell.driver-memory <- '7G'
config$sparklyr.shell.executor-memory <- '7G'
sc <- spark_connect(master='local', version='2.1.0', config=config)
h2o_context(sc)
h2o.clusterInfo()
h2o.importFile("/Users/nidhimehta/full/",destination_frame = "full")
{code}

@exalate-issue-sync
Copy link
Author

Michal Malohlava commented: the original intention was not to include h2o-parsers in SW assembly, since they are not fully featured and depends on fix versions of libraries (e.g. Avro) which can lead to unexpected behavior.
Users should use Spark to load Parquet data, and transfer into h2o.

@exalate-issue-sync
Copy link
Author

Michal Kurka commented: This needs to be most likely fix on H2O's side. We should use version Parquet 1.8.x (the one compatible with Spark).

@exalate-issue-sync
Copy link
Author

Michal Kurka commented: Was able to reproduce also on 2.1.14

@exalate-issue-sync
Copy link
Author

Jakub Hava commented: This was fixed in h2o

@exalate-issue-sync
Copy link
Author

Michal Kurka commented: Parquet import was not working because Parquet libraries are part of Spark's distribution and are loaded by another class loader. We were accessing package-private class InternalParquetRecordReader which was throwing IllegalAccessException even though the calling class was in the same package (it was however loaded by a different class loader with a different protection domain). The solution was to copy InternalParquetRecordReader into H2O's code base and adapt it for our purposes. This class only uses public developer-facing API. We also adopted Parquet 1.8.1.

This was not tested on Spark 2.2.x.

@DinukaH2O
Copy link

JIRA Issue Migration Info

Jira Issue: SW-542
Assignee: Michal Kurka
Reporter: Nidhi Mehta
State: Resolved
Fix Version: 2.1.15
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

h2oai/h2o-3#1618

@hasithjp
Copy link
Member

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2017-09-26T15:27:59.366-0700

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants