XmlSerde for Hive with PrestoDB #9002

leolorenzoluis · 2017-09-19T02:29:31Z

I have a table in Hive that uses XmlSerde from https://github.com/dvasilen/Hive-XML-SerDe.
I've uploaded the required jars in /hive-hadoop2 in presto coordinator/nodes and I am able to query the hdfs file that it points to from presto using the ff command:

hdfs dfs -cat hdfs://namenode:8020/user/hive/warehouse/xml_auctions/in_hdfs.xml'

It prints the XML file. However, when I try to query it with presto-cli then I get the ff error:

uery 20170919_022807_00021_hwgdm failed: Error opening Hive split hdfs://namenode:8020/user/hive/warehouse/xml_auctions/in_hdfs.xml (offset=0, length=420) using com.ibm.spss.hive.serde2.xml.XmlInputFormat: null
com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://namenode:8020/user/hive/warehouse/xml_auctions/in_hdfs.xml (offset=0, length=420) using com.ibm.spss.hive.serde2.xml.XmlInputFormat: null
	at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:200)
	at com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:72)
	at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
	at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
	at com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:71)
	at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:160)
	at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:87)
	at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
	at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
	at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:259)
	at com.facebook.presto.operator.Driver.processInternal(Driver.java:300)
	at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:234)
	at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:538)
	at com.facebook.presto.operator.Driver.processFor(Driver.java:229)
	at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:623)
	at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
	at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:478)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
	at com.ibm.spss.hive.serde2.xml.XmlInputFormat$XmlRecordReader.<init>(XmlInputFormat.java:72)
	at com.ibm.spss.hive.serde2.xml.XmlInputFormat.getRecordReader(XmlInputFormat.java:57)
	at com.facebook.presto.hive.HiveUtil.lambda$createRecordReader$3(HiveUtil.java:197)
	at com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138)
	at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:197)
	... 20 more

I can query the xml files perfectly fine if I'm in beeline or hive client. Is this an issue from the Serde lib or Presto?

The text was updated successfully, but these errors were encountered:

electrum · 2017-09-19T04:03:18Z

Presto doesn't really support adding custom serde jars. The best option is to add the code to Presto with appropriate unit tests. This will ensure that it works and stays working in the future.

leolorenzoluis · 2017-09-19T15:03:56Z

@electrum Thanks for your response. Does presto have a built in serde for handling XML or what would you recommend as a solution in handling XML files? Convert to other file formats that are supported such as Avra, Parquet or Orc? Spin up a postgresql and dump all XML files there since it's natively supported?

I think this is a good case where Presto should provide standard interfaces for adding custom serdes without having to fork the code and repackage it.

Does this also mean this is no longer true? #868

electrum · 2017-09-19T15:42:45Z

We would need to copy that code into Presto. Or possibly depend on it if they publish a useable Maven artifact.

The topic of custom serdes comes up maybe once a year. Having a pluggable interface is a lot of work. It's better if we have first class support for the few formats that are needed.

leolorenzoluis · 2017-09-19T16:05:07Z

Why would Presto require the custom serde? As far as I understand, Presto queries where the data lives. What is Hive doing in this case if a presto worker is just asking hive to stream back the results and report back to the coordinator? Am I wrong and missing something?

ashwinhs · 2017-09-19T17:57:02Z

This is what the documentation says -

The Hive connector allows querying data stored in a Hive data warehouse. Hive is a combination of three components:

Data files in varying formats that are typically stored in the Hadoop Distributed File System (HDFS) or in Amazon S3.
Metadata about how the data files are mapped to schemas and tables. This metadata is stored in a database such as MySQL and is accessed via the Hive metastore service.
A query language called HiveQL. This query language is executed on a distributed computing framework such as MapReduce or Tez.
Presto only uses the first two components: the data and the metadata. It does not use HiveQL or any part of Hive’s execution environment.

leolorenzoluis · 2017-09-19T18:07:19Z

Okay, so if that's the case. Can I just create a custom JDBC connector that uses HiveQL to query the XML from the table? Would there be any problem with existing Hive Connector (still want to use it for other stuff) and use the new connector for XML in Hive using JDBC?

Edit: Initial findings - It's really slow to do aggregation from Hive to Presto. Parsing takes awhile. I think I may end up needing to transform the XML to a different file format.

@ashwinhs How does presto query the data? Using hdfs client?

electrum · 2017-09-19T19:49:32Z

See the overview here: https://prestodb.io/docs/current/connector/hive.html

electrum · 2017-09-19T19:53:45Z

That's possible, but it will be very slow since all the data will be pulled via a single JDBC connection.

leolorenzoluis · 2017-09-19T23:12:56Z

That's true. I ended up having to create another table with a different format such as Parquet or Orc and Presto of course was able to query and was fast. I'm not sure if that's good enough, but the problem would be maintaining the datasets between two tables, and syncing them.

leolorenzoluis · 2017-09-20T03:59:25Z

@electrum What about other connectors that uses JDBC Connection? Wouldn't it have the same problem?

electrum · 2017-09-20T14:24:07Z

Yes, they have the same limitation.

leolorenzoluis · 2017-09-25T05:22:02Z

@electrum May I know why it would be a single JDBC connection? Also, does presto uses single JDBC connection per connector or per transaction such as getting schemas, columns, etc? Is it reusing the JDBC pools?

If what you say is true for its limitation then that means presto will be slow for example, a PostgreSQL that uses JDBC connection, and a Redis that uses JDBC connection. Performing a simple join between two data sources you say is gonna be slow?

sarwarbhuiyan · 2018-07-04T12:17:17Z

+1
Seems xml is fading in the background in the big data ecosystems but big financials still use a lot of XML in the message interchange and storage for various integration scenarios. We'd like to use the XMLSerde as part of Amazon Athena which uses Presto under the hood. Would very much appreciate the support or any workaround.

tooptoop4 · 2020-06-05T11:27:10Z

prestosql 334 has same issue, trinodb/trino#3888 might be an alternative

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XmlSerde for Hive with PrestoDB #9002

XmlSerde for Hive with PrestoDB #9002

leolorenzoluis commented Sep 19, 2017

electrum commented Sep 19, 2017 via email

leolorenzoluis commented Sep 19, 2017 •

edited

electrum commented Sep 19, 2017 •

edited

leolorenzoluis commented Sep 19, 2017 •

edited

ashwinhs commented Sep 19, 2017

leolorenzoluis commented Sep 19, 2017 •

edited

electrum commented Sep 19, 2017 via email

electrum commented Sep 19, 2017 via email

leolorenzoluis commented Sep 19, 2017

leolorenzoluis commented Sep 20, 2017

electrum commented Sep 20, 2017 via email

leolorenzoluis commented Sep 25, 2017

sarwarbhuiyan commented Jul 4, 2018

tooptoop4 commented Jun 5, 2020

XmlSerde for Hive with PrestoDB #9002

XmlSerde for Hive with PrestoDB #9002

Comments

leolorenzoluis commented Sep 19, 2017

electrum commented Sep 19, 2017 via email

leolorenzoluis commented Sep 19, 2017 • edited

electrum commented Sep 19, 2017 • edited

leolorenzoluis commented Sep 19, 2017 • edited

ashwinhs commented Sep 19, 2017

leolorenzoluis commented Sep 19, 2017 • edited

electrum commented Sep 19, 2017 via email

electrum commented Sep 19, 2017 via email

leolorenzoluis commented Sep 19, 2017

leolorenzoluis commented Sep 20, 2017

electrum commented Sep 20, 2017 via email

leolorenzoluis commented Sep 25, 2017

sarwarbhuiyan commented Jul 4, 2018

tooptoop4 commented Jun 5, 2020

leolorenzoluis commented Sep 19, 2017 •

edited

electrum commented Sep 19, 2017 •

edited

leolorenzoluis commented Sep 19, 2017 •

edited

leolorenzoluis commented Sep 19, 2017 •

edited