Parquet Hive schema evolution checks based on column names #12212

piyushnarang · 2019-01-11T17:30:51Z

We occasionally run into an issue in our Hive parquet partitioned tables when users introduce new columns in the middle of the table.
For example:

CREATE TABLE my_table ( 
    day varchar,                                           
    hour integer,                                          
    user_id_fast bigint)
WITH (                                                    
    format = 'PARQUET',                                    
    partitioned_by = ARRAY['day_part'])
;

If new columns are added in the middle:

CREATE TABLE my_table ( 
    day varchar,            
    is_attributed boolean,
    platform varchar,                               
    hour integer,                                          
    user_id_fast bigint)
...
;

Attempting to access one of the partitions with the older schema results in errors as Presto is matching the types of the columns in the partition and the table by index. So index 1 ends up being an integer in one and a boolean in the other which results in an error (https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java#L299). The behavior is similar in case of structs as well - https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveCoercionPolicy.java#L99

From what I understand Hive does this checking based on columnNames in case of Parquet. Is this something we can implement here as well? (I'm not sure what the correct behavior should be for formats like ORC, so it might be that we need to have different checks for different formats).

On similar lines - #8911. Once the flag hive.parquet.use-column-names is set, we use column names while reading Parquet data. This flag is currently not involved at the schema checking step.

The text was updated successfully, but these errors were encountered:

piyushnarang · 2019-01-14T19:20:29Z

@dain / @nezihyigitbasi - what do you guys think about this? (checking schema fields by columnName instead of index)

stale · 2021-01-17T22:03:24Z

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

findepi mentioned this issue Sep 19, 2019

use-column-names by default trinodb/trino#1558

Closed

stale bot added the stale label Jan 17, 2021

dborkar assigned imjalpreet Apr 13, 2021

stale bot removed the stale label Apr 13, 2021

imjalpreet mentioned this issue Apr 27, 2021

Partition schema evolution for Parquet #16011

Merged

zhenxiao closed this as completed in #16011 May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet Hive schema evolution checks based on column names #12212

Parquet Hive schema evolution checks based on column names #12212

piyushnarang commented Jan 11, 2019

piyushnarang commented Jan 14, 2019

stale bot commented Jan 17, 2021

Parquet Hive schema evolution checks based on column names #12212

Parquet Hive schema evolution checks based on column names #12212

Comments

piyushnarang commented Jan 11, 2019

piyushnarang commented Jan 14, 2019

stale bot commented Jan 17, 2021