Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XmlSerde for Hive with PrestoDB #9002

Open
leolorenzoluis opened this issue Sep 19, 2017 · 14 comments
Open

XmlSerde for Hive with PrestoDB #9002

leolorenzoluis opened this issue Sep 19, 2017 · 14 comments

Comments

@leolorenzoluis
Copy link

I have a table in Hive that uses XmlSerde from https://github.com/dvasilen/Hive-XML-SerDe.
I've uploaded the required jars in /hive-hadoop2 in presto coordinator/nodes and I am able to query the hdfs file that it points to from presto using the ff command:

hdfs dfs -cat hdfs://namenode:8020/user/hive/warehouse/xml_auctions/in_hdfs.xml'

It prints the XML file. However, when I try to query it with presto-cli then I get the ff error:

uery 20170919_022807_00021_hwgdm failed: Error opening Hive split hdfs://namenode:8020/user/hive/warehouse/xml_auctions/in_hdfs.xml (offset=0, length=420) using com.ibm.spss.hive.serde2.xml.XmlInputFormat: null
com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://namenode:8020/user/hive/warehouse/xml_auctions/in_hdfs.xml (offset=0, length=420) using com.ibm.spss.hive.serde2.xml.XmlInputFormat: null
	at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:200)
	at com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:72)
	at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
	at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
	at com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:71)
	at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:160)
	at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:87)
	at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
	at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
	at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:259)
	at com.facebook.presto.operator.Driver.processInternal(Driver.java:300)
	at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:234)
	at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:538)
	at com.facebook.presto.operator.Driver.processFor(Driver.java:229)
	at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:623)
	at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
	at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:478)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
	at com.ibm.spss.hive.serde2.xml.XmlInputFormat$XmlRecordReader.<init>(XmlInputFormat.java:72)
	at com.ibm.spss.hive.serde2.xml.XmlInputFormat.getRecordReader(XmlInputFormat.java:57)
	at com.facebook.presto.hive.HiveUtil.lambda$createRecordReader$3(HiveUtil.java:197)
	at com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138)
	at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:197)
	... 20 more

I can query the xml files perfectly fine if I'm in beeline or hive client. Is this an issue from the Serde lib or Presto?

@electrum
Copy link
Contributor

electrum commented Sep 19, 2017 via email

@leolorenzoluis
Copy link
Author

leolorenzoluis commented Sep 19, 2017

@electrum Thanks for your response. Does presto have a built in serde for handling XML or what would you recommend as a solution in handling XML files? Convert to other file formats that are supported such as Avra, Parquet or Orc? Spin up a postgresql and dump all XML files there since it's natively supported?

I think this is a good case where Presto should provide standard interfaces for adding custom serdes without having to fork the code and repackage it.

Does this also mean this is no longer true? #868

@electrum
Copy link
Contributor

electrum commented Sep 19, 2017

We would need to copy that code into Presto. Or possibly depend on it if they publish a useable Maven artifact.

The topic of custom serdes comes up maybe once a year. Having a pluggable interface is a lot of work. It's better if we have first class support for the few formats that are needed.

@leolorenzoluis
Copy link
Author

leolorenzoluis commented Sep 19, 2017

Why would Presto require the custom serde? As far as I understand, Presto queries where the data lives. What is Hive doing in this case if a presto worker is just asking hive to stream back the results and report back to the coordinator? Am I wrong and missing something?

@ashwinhs
Copy link

This is what the documentation says -

The Hive connector allows querying data stored in a Hive data warehouse. Hive is a combination of three components:

Data files in varying formats that are typically stored in the Hadoop Distributed File System (HDFS) or in Amazon S3.
Metadata about how the data files are mapped to schemas and tables. This metadata is stored in a database such as MySQL and is accessed via the Hive metastore service.
A query language called HiveQL. This query language is executed on a distributed computing framework such as MapReduce or Tez.
Presto only uses the first two components: the data and the metadata. It does not use HiveQL or any part of Hive’s execution environment.

@leolorenzoluis
Copy link
Author

leolorenzoluis commented Sep 19, 2017

Okay, so if that's the case. Can I just create a custom JDBC connector that uses HiveQL to query the XML from the table? Would there be any problem with existing Hive Connector (still want to use it for other stuff) and use the new connector for XML in Hive using JDBC?

Edit: Initial findings - It's really slow to do aggregation from Hive to Presto. Parsing takes awhile. I think I may end up needing to transform the XML to a different file format.

@ashwinhs How does presto query the data? Using hdfs client?

@electrum
Copy link
Contributor

electrum commented Sep 19, 2017 via email

@electrum
Copy link
Contributor

electrum commented Sep 19, 2017 via email

@leolorenzoluis
Copy link
Author

That's true. I ended up having to create another table with a different format such as Parquet or Orc and Presto of course was able to query and was fast. I'm not sure if that's good enough, but the problem would be maintaining the datasets between two tables, and syncing them.

@leolorenzoluis
Copy link
Author

@electrum What about other connectors that uses JDBC Connection? Wouldn't it have the same problem?

@electrum
Copy link
Contributor

electrum commented Sep 20, 2017 via email

@leolorenzoluis
Copy link
Author

@electrum May I know why it would be a single JDBC connection? Also, does presto uses single JDBC connection per connector or per transaction such as getting schemas, columns, etc? Is it reusing the JDBC pools?

If what you say is true for its limitation then that means presto will be slow for example, a PostgreSQL that uses JDBC connection, and a Redis that uses JDBC connection. Performing a simple join between two data sources you say is gonna be slow?

@sarwarbhuiyan
Copy link

+1
Seems xml is fading in the background in the big data ecosystems but big financials still use a lot of XML in the message interchange and storage for various integration scenarios. We'd like to use the XMLSerde as part of Amazon Athena which uses Presto under the hood. Would very much appreciate the support or any workaround.

@tooptoop4
Copy link

prestosql 334 has same issue, trinodb/trino#3888 might be an alternative

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants