New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XmlSerde for Hive with PrestoDB #9002
Comments
Presto doesn't really support adding custom serde jars. The best option is to add the code to Presto with appropriate unit tests. This will ensure that it works and stays working in the future.
|
@electrum Thanks for your response. Does presto have a built in serde for handling XML or what would you recommend as a solution in handling XML files? Convert to other file formats that are supported such as Avra, Parquet or Orc? Spin up a postgresql and dump all XML files there since it's natively supported? I think this is a good case where Presto should provide standard interfaces for adding custom serdes without having to fork the code and repackage it. Does this also mean this is no longer true? #868 |
We would need to copy that code into Presto. Or possibly depend on it if they publish a useable Maven artifact. The topic of custom serdes comes up maybe once a year. Having a pluggable interface is a lot of work. It's better if we have first class support for the few formats that are needed. |
Why would Presto require the custom serde? As far as I understand, Presto queries where the data lives. What is Hive doing in this case if a presto worker is just asking hive to stream back the results and report back to the coordinator? Am I wrong and missing something? |
This is what the documentation says -
|
Okay, so if that's the case. Can I just create a custom JDBC connector that uses HiveQL to query the XML from the table? Would there be any problem with existing Hive Connector (still want to use it for other stuff) and use the new connector for XML in Hive using JDBC? Edit: Initial findings - It's really slow to do aggregation from Hive to Presto. Parsing takes awhile. I think I may end up needing to transform the XML to a different file format. @ashwinhs How does presto query the data? Using hdfs client? |
See the overview here: https://prestodb.io/docs/current/connector/hive.html
|
That's possible, but it will be very slow since all the data will be pulled via a single JDBC connection.
|
That's true. I ended up having to create another table with a different format such as Parquet or Orc and Presto of course was able to query and was fast. I'm not sure if that's good enough, but the problem would be maintaining the datasets between two tables, and syncing them. |
@electrum What about other connectors that uses JDBC Connection? Wouldn't it have the same problem? |
Yes, they have the same limitation.
|
@electrum May I know why it would be a single JDBC connection? Also, does presto uses single JDBC connection per connector or per transaction such as getting schemas, columns, etc? Is it reusing the JDBC pools? If what you say is true for its limitation then that means presto will be slow for example, a PostgreSQL that uses JDBC connection, and a Redis that uses JDBC connection. Performing a simple join between two data sources you say is gonna be slow? |
+1 |
prestosql 334 has same issue, trinodb/trino#3888 might be an alternative |
I have a table in Hive that uses XmlSerde from https://github.com/dvasilen/Hive-XML-SerDe.
I've uploaded the required jars in
/hive-hadoop2
in presto coordinator/nodes and I am able to query the hdfs file that it points to from presto using the ff command:It prints the
XML
file. However, when I try to query it with presto-cli then I get the ff error:I can query the xml files perfectly fine if I'm in beeline or hive client. Is this an issue from the Serde lib or Presto?
The text was updated successfully, but these errors were encountered: