Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use-column-names by default #1558

Closed
findepi opened this issue Sep 19, 2019 · 11 comments · Fixed by #6479
Closed

use-column-names by default #1558

findepi opened this issue Sep 19, 2019 · 11 comments · Fixed by #6479
Labels
bug Something isn't working

Comments

@findepi
Copy link
Member

findepi commented Sep 19, 2019

Currently hive.parquet.use-column-names, hive.orc.use-column-names and hive.partition-use-column-names all default to false.
This certainly is not in line with Hive's default behavior.

Options:

  • it might be that hive.parquet.use-column-names=true and hive.orc.use-column-names=true matches Hive's default behavior
  • it might be that Hive's default behavior is more complex, e.g. some mix of approaches.
  • it might be that Hive's default depends on external vs internal tables (suggested here, worth double checking)

This requires further investigation.
We should consider ORC and Parquet files created with different Hive versions (and maybe with Spark).
Ideally this should be verified against different Hive versions.

(Relates to: #1556)

@findepi
Copy link
Member Author

findepi commented Sep 19, 2019

We discussed this with @mfeteanu-dni @kgalieva @ryanrupp @piyushnarang @electrum at an issue in the old repo and also on our Slack (https://prestosql.slack.com/archives/CFLB9AMBN/p1556892406365700).

Quoting @Praveen2112

if we write the parquet file from a hive or spark will he maintain the actual column name or does he name it as col0,col1.. if he doesn't maintain actual names then it might cause issue right ?

Unfortunately, we didn't reach the certainty yet as to what's actually Hive's default behavior.

@lxynov
Copy link
Member

lxynov commented Sep 19, 2019

@findepi hive.orc.use-column-names affects the top-level behavior but #1556 is about the sub-level behavior

@findepi
Copy link
Member Author

findepi commented Apr 8, 2020

from https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html

In this table, observe that Parquet and ORC are columnar formats with different default column access methods. By default, Parquet will access columns by name and ORC by index (ordinal value). Therefore, Athena provides a SerDe property defined when creating a table to toggle the default column access method which enables greater flexibility with schema evolution.

For Parquet, the parquet.column.index.access property may be set to TRUE, which sets the column access method to use the column’s ordinal number. Setting this property to FALSE will change the column access method to use column name. Similarly, for ORC use the orc.column.index.access property to control the column access method. For more information, see Index Access in ORC and Parquet.

Athena reads ORC by index by default
Athena reads Parquet by name by default

@pettyjamesm do you happen to know if this is Athena-specific or true for Hive 2, 3 as well?

@findepi
Copy link
Member Author

findepi commented May 8, 2020

#3668

@findepi
Copy link
Member Author

findepi commented May 9, 2020

I am marking this as bug. As shown in #3668, with current defaults we are not compatible with Hive.

@findepi
Copy link
Member Author

findepi commented May 11, 2020

#3683 (comment)

@djsstarburst
Copy link
Member

djsstarburst commented Jan 7, 2021

My test results are summarized in #6316. The tests themselves are built in #6479

In Hive 3.1.2-6, for both transactional and non-transactional tables, columns in ORC tables are matched by column number in Hive. The column names in data files don't matter for read. Instead the column names come from the metastore.

This means that after a column rename, the column values in files created before the rename are seen by Hive. The values of renamed columns are also seen by Trino if the hive.orc.use-column-names has the default value of false. I don't see a good argument for change the default value of hive.orc.use-column-names to true.

In Hive 3.1.2-6, columns in Parquet tables are matched by name in Hive. After a column rename, the column values in files created before the rename won't be seen by Hive, and the default value of the column will be returned. The values of renamed columns will be seen by Trino if the parameters hive.parquet.use-column-names has the default value of false. To match Hive, the parameter hive.parquet.use-column-names should be true. And of course that is not upwards-compatible change, and @dain and others have expressed concern about making such a change.

The Hive Parquet default behavior of ignoring column values in data files created before a column rename seems nutty to me, FWIW.

@martint
Copy link
Member

martint commented Jan 7, 2021

And of course that is not upwards-compatible change

It's not, but it's also technically a "bug" (i.e., it should've never worked that way). For compatibility with Hive and Spark and anything else that might read parquet files from Hive tables, we should fix it. It's also a source of many questions from users who run into this unexpected behavior. I think the main concern right now is about decoupling the ORC vs Parquet behavior -- they are a little entangled right now, at least in terms of what configuration options affect what behavior. @findepi, what were the other related options that overlap with both formats?

@Sarrouna
Copy link

Sarrouna commented Jan 8, 2021

@martint As your comment, is Trino with ORC as a data format supports schema evolution ? Really, I found many comments, tests, and I'm hesitate which data format use with Trino if I will use it. @findepi recommended me ORC as a data format. But my question is always arround the schema evolution. Thanks

@findepi
Copy link
Member Author

findepi commented Jan 8, 2021

In Hive 3.1.2-6 [..] ORC [...] This means that after a column rename, the column values in files created before the rename are seen by Hive.

@djsstarburst What happens when the table has columns a, b and file has columns b, a? Is the data swapped?
(in theory by-ordinal matching could be a fallback when by-name did not work)

what were the other related options that overlap with both formats?

@martint hive.partition-use-column-names for table/partition matching.
we have this by-ordinal today, and i believe Athena has it by-name (cc @pettyjamesm) and i do not know about Hive -- needs to be tested, ideally for both ORC and Parquet

They are entangled in a sense that if you enable hive.partition-use-column-names the hive.orc.use-column-names and hive.parquet.use-column-names also need to be true.
I do not know whether this logic is sound and whether it matches Hive behavior

Of course, it doesn't prevent us from setting hive.parquet.use-column-names (if this is the only thing we want to do), as it is an implication only.
Thus we can perhaps make an improvement for Parquet and face the entanglement when we try to change hive.partition-use-column-names.

@djsstarburst
Copy link
Member

What happens when the table has columns a, b and file has columns b, a? Is the data swapped?
(in theory by-ordinal matching could be a fallback when by-name did not work)

I just added this test, @findepi. It demonstrates that the data is swapped, as seen both by Hive and Presto, and for both transactional and non-transactional ORC tables:

    public void testOrcColumnSwap(boolean transactional)
    {
        withTemporaryTable("test_orc_column_renames", transactional, false, NONE, tableName -> {
            onPresto().executeQuery(format("CREATE TABLE %s (name VARCHAR, state VARCHAR) WITH (format = 'ORC', transactional = %s)", tableName, transactional));
            onPresto().executeQuery(format("INSERT INTO %s VALUES ('Katy', 'CA'), ('Joe', 'WA')", tableName));
            verifySelectForPrestoAndHive("SELECT * FROM " + tableName, "true", row("Katy", "CA"), row("Joe", "WA"));

            onPresto().executeQuery(format("ALTER TABLE %s RENAME COLUMN name TO new_name", tableName));
            onPresto().executeQuery(format("ALTER TABLE %s RENAME COLUMN state TO name", tableName));
            onPresto().executeQuery(format("ALTER TABLE %s RENAME COLUMN new_name TO state", tableName));
            log.info("This shows that Presto and Hive can still query old data after a single rename");
            verifySelectForPrestoAndHive("SELECT state, name FROM " + tableName, "TRUE", row("Katy", "CA"), row("Joe", "WA"));
        });
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
5 participants