use-column-names by default #1558

findepi · 2019-09-19T07:47:45Z

Currently hive.parquet.use-column-names, hive.orc.use-column-names and hive.partition-use-column-names all default to false.
This certainly is not in line with Hive's default behavior.

Options:

it might be that hive.parquet.use-column-names=true and hive.orc.use-column-names=true matches Hive's default behavior
it might be that Hive's default behavior is more complex, e.g. some mix of approaches.
it might be that Hive's default depends on external vs internal tables (suggested here, worth double checking)

This requires further investigation.
We should consider ORC and Parquet files created with different Hive versions (and maybe with Spark).
Ideally this should be verified against different Hive versions.

(Relates to: #1556)

The text was updated successfully, but these errors were encountered:

findepi · 2019-09-19T08:07:43Z

We discussed this with @mfeteanu-dni @kgalieva @ryanrupp @piyushnarang @electrum at an issue in the old repo and also on our Slack (https://prestosql.slack.com/archives/CFLB9AMBN/p1556892406365700).

Quoting @Praveen2112

if we write the parquet file from a hive or spark will he maintain the actual column name or does he name it as col0,col1.. if he doesn't maintain actual names then it might cause issue right ?

Unfortunately, we didn't reach the certainty yet as to what's actually Hive's default behavior.

lxynov · 2019-09-19T17:33:25Z

@findepi hive.orc.use-column-names affects the top-level behavior but #1556 is about the sub-level behavior

findepi · 2020-04-08T07:09:29Z

from https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html

In this table, observe that Parquet and ORC are columnar formats with different default column access methods. By default, Parquet will access columns by name and ORC by index (ordinal value). Therefore, Athena provides a SerDe property defined when creating a table to toggle the default column access method which enables greater flexibility with schema evolution.

For Parquet, the parquet.column.index.access property may be set to TRUE, which sets the column access method to use the column’s ordinal number. Setting this property to FALSE will change the column access method to use column name. Similarly, for ORC use the orc.column.index.access property to control the column access method. For more information, see Index Access in ORC and Parquet.

Athena reads ORC by index by default
Athena reads Parquet by name by default

@pettyjamesm do you happen to know if this is Athena-specific or true for Hive 2, 3 as well?

findepi · 2020-05-08T10:50:50Z

#3668

findepi · 2020-05-09T12:20:45Z

I am marking this as bug. As shown in #3668, with current defaults we are not compatible with Hive.

findepi · 2020-05-11T21:53:22Z

#3683 (comment)

djsstarburst · 2021-01-07T22:02:36Z

My test results are summarized in #6316. The tests themselves are built in #6479

In Hive 3.1.2-6, for both transactional and non-transactional tables, columns in ORC tables are matched by column number in Hive. The column names in data files don't matter for read. Instead the column names come from the metastore.

This means that after a column rename, the column values in files created before the rename are seen by Hive. The values of renamed columns are also seen by Trino if the hive.orc.use-column-names has the default value of false. I don't see a good argument for change the default value of hive.orc.use-column-names to true.

In Hive 3.1.2-6, columns in Parquet tables are matched by name in Hive. After a column rename, the column values in files created before the rename won't be seen by Hive, and the default value of the column will be returned. The values of renamed columns will be seen by Trino if the parameters hive.parquet.use-column-names has the default value of false. To match Hive, the parameter hive.parquet.use-column-names should be true. And of course that is not upwards-compatible change, and @dain and others have expressed concern about making such a change.

The Hive Parquet default behavior of ignoring column values in data files created before a column rename seems nutty to me, FWIW.

martint · 2021-01-07T22:56:12Z

And of course that is not upwards-compatible change

It's not, but it's also technically a "bug" (i.e., it should've never worked that way). For compatibility with Hive and Spark and anything else that might read parquet files from Hive tables, we should fix it. It's also a source of many questions from users who run into this unexpected behavior. I think the main concern right now is about decoupling the ORC vs Parquet behavior -- they are a little entangled right now, at least in terms of what configuration options affect what behavior. @findepi, what were the other related options that overlap with both formats?

Sarrouna · 2021-01-08T14:56:43Z

@martint As your comment, is Trino with ORC as a data format supports schema evolution ? Really, I found many comments, tests, and I'm hesitate which data format use with Trino if I will use it. @findepi recommended me ORC as a data format. But my question is always arround the schema evolution. Thanks

findepi · 2021-01-08T21:20:06Z

In Hive 3.1.2-6 [..] ORC [...] This means that after a column rename, the column values in files created before the rename are seen by Hive.

@djsstarburst What happens when the table has columns a, b and file has columns b, a? Is the data swapped?
(in theory by-ordinal matching could be a fallback when by-name did not work)

what were the other related options that overlap with both formats?

@martint hive.partition-use-column-names for table/partition matching.
we have this by-ordinal today, and i believe Athena has it by-name (cc @pettyjamesm) and i do not know about Hive -- needs to be tested, ideally for both ORC and Parquet

They are entangled in a sense that if you enable hive.partition-use-column-names the hive.orc.use-column-names and hive.parquet.use-column-names also need to be true.
I do not know whether this logic is sound and whether it matches Hive behavior

Of course, it doesn't prevent us from setting hive.parquet.use-column-names (if this is the only thing we want to do), as it is an implication only.
Thus we can perhaps make an improvement for Parquet and face the entanglement when we try to change hive.partition-use-column-names.

djsstarburst · 2021-01-08T22:05:30Z

What happens when the table has columns a, b and file has columns b, a? Is the data swapped?
(in theory by-ordinal matching could be a fallback when by-name did not work)

I just added this test, @findepi. It demonstrates that the data is swapped, as seen both by Hive and Presto, and for both transactional and non-transactional ORC tables:

    public void testOrcColumnSwap(boolean transactional)
    {
        withTemporaryTable("test_orc_column_renames", transactional, false, NONE, tableName -> {
            onPresto().executeQuery(format("CREATE TABLE %s (name VARCHAR, state VARCHAR) WITH (format = 'ORC', transactional = %s)", tableName, transactional));
            onPresto().executeQuery(format("INSERT INTO %s VALUES ('Katy', 'CA'), ('Joe', 'WA')", tableName));
            verifySelectForPrestoAndHive("SELECT * FROM " + tableName, "true", row("Katy", "CA"), row("Joe", "WA"));

            onPresto().executeQuery(format("ALTER TABLE %s RENAME COLUMN name TO new_name", tableName));
            onPresto().executeQuery(format("ALTER TABLE %s RENAME COLUMN state TO name", tableName));
            onPresto().executeQuery(format("ALTER TABLE %s RENAME COLUMN new_name TO state", tableName));
            log.info("This shows that Presto and Hive can still query old data after a single rename");
            verifySelectForPrestoAndHive("SELECT state, name FROM " + tableName, "TRUE", row("Katy", "CA"), row("Joe", "WA"));
        });
    }

findepi mentioned this issue May 8, 2020

Alter table and select throws Exception #3668

Closed

findepi added the bug Something isn't working label May 9, 2020

findepi mentioned this issue Jun 10, 2020

Support schema evolution by default #3983

Closed

ArvinZheng mentioned this issue Jul 2, 2020

Presto does not support to read ORC structs by the ordinal of Hive metadata #4321

Open

findepi mentioned this issue Jan 8, 2021

Allow case-insensitive fieldname matching for struct coercion in hive connector #5575

Closed

findepi mentioned this issue Jan 27, 2021

Change Trino to match Hive on column rename/drop/add #6479

Merged

electrum closed this as completed in #6479 Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use-column-names by default #1558

use-column-names by default #1558

findepi commented Sep 19, 2019 •

edited

findepi commented Sep 19, 2019 •

edited

lxynov commented Sep 19, 2019

findepi commented Apr 8, 2020

findepi commented May 8, 2020

findepi commented May 9, 2020

findepi commented May 11, 2020

djsstarburst commented Jan 7, 2021 •

edited

martint commented Jan 7, 2021

Sarrouna commented Jan 8, 2021

findepi commented Jan 8, 2021

djsstarburst commented Jan 8, 2021

use-column-names by default #1558

use-column-names by default #1558

Comments

findepi commented Sep 19, 2019 • edited

findepi commented Sep 19, 2019 • edited

lxynov commented Sep 19, 2019

findepi commented Apr 8, 2020

findepi commented May 8, 2020

findepi commented May 9, 2020

findepi commented May 11, 2020

djsstarburst commented Jan 7, 2021 • edited

martint commented Jan 7, 2021

Sarrouna commented Jan 8, 2021

findepi commented Jan 8, 2021

djsstarburst commented Jan 8, 2021

findepi commented Sep 19, 2019 •

edited

findepi commented Sep 19, 2019 •

edited

djsstarburst commented Jan 7, 2021 •

edited