New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use-column-names by default #1558
Comments
We discussed this with @mfeteanu-dni @kgalieva @ryanrupp @piyushnarang @electrum at an issue in the old repo and also on our Slack (https://prestosql.slack.com/archives/CFLB9AMBN/p1556892406365700). Quoting @Praveen2112
Unfortunately, we didn't reach the certainty yet as to what's actually Hive's default behavior. |
from https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html
@pettyjamesm do you happen to know if this is Athena-specific or true for Hive 2, 3 as well? |
I am marking this as |
My test results are summarized in #6316. The tests themselves are built in #6479 In Hive 3.1.2-6, for both transactional and non-transactional tables, columns in ORC tables are matched by column number in Hive. The column names in data files don't matter for read. Instead the column names come from the metastore. This means that after a column rename, the column values in files created before the rename are seen by Hive. The values of renamed columns are also seen by Trino if the In Hive 3.1.2-6, columns in Parquet tables are matched by name in Hive. After a column rename, the column values in files created before the rename won't be seen by Hive, and the default value of the column will be returned. The values of renamed columns will be seen by Trino if the parameters hive.parquet.use-column-names has the default value of false. To match Hive, the parameter hive.parquet.use-column-names should be true. And of course that is not upwards-compatible change, and @dain and others have expressed concern about making such a change. The Hive Parquet default behavior of ignoring column values in data files created before a column rename seems nutty to me, FWIW. |
It's not, but it's also technically a "bug" (i.e., it should've never worked that way). For compatibility with Hive and Spark and anything else that might read parquet files from Hive tables, we should fix it. It's also a source of many questions from users who run into this unexpected behavior. I think the main concern right now is about decoupling the ORC vs Parquet behavior -- they are a little entangled right now, at least in terms of what configuration options affect what behavior. @findepi, what were the other related options that overlap with both formats? |
@martint As your comment, is Trino with ORC as a data format supports schema evolution ? Really, I found many comments, tests, and I'm hesitate which data format use with Trino if I will use it. @findepi recommended me ORC as a data format. But my question is always arround the schema evolution. Thanks |
@djsstarburst What happens when the table has columns
@martint They are entangled in a sense that if you enable Of course, it doesn't prevent us from setting |
I just added this test, @findepi. It demonstrates that the data is swapped, as seen both by Hive and Presto, and for both transactional and non-transactional ORC tables:
|
Currently
hive.parquet.use-column-names
,hive.orc.use-column-names
andhive.partition-use-column-names
all default to false.This certainly is not in line with Hive's default behavior.
Options:
hive.parquet.use-column-names=true
andhive.orc.use-column-names=true
matches Hive's default behaviorThis requires further investigation.
We should consider ORC and Parquet files created with different Hive versions (and maybe with Spark).
Ideally this should be verified against different Hive versions.
(Relates to: #1556)
The text was updated successfully, but these errors were encountered: