Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Hive schema evolution checks based on column names #12212

Closed
piyushnarang opened this issue Jan 11, 2019 · 2 comments · Fixed by #16011
Closed

Parquet Hive schema evolution checks based on column names #12212

piyushnarang opened this issue Jan 11, 2019 · 2 comments · Fixed by #16011
Assignees

Comments

@piyushnarang
Copy link
Contributor

We occasionally run into an issue in our Hive parquet partitioned tables when users introduce new columns in the middle of the table.
For example:

CREATE TABLE my_table ( 
    day varchar,                                           
    hour integer,                                          
    user_id_fast bigint)
WITH (                                                    
    format = 'PARQUET',                                    
    partitioned_by = ARRAY['day_part'])
;

If new columns are added in the middle:

CREATE TABLE my_table ( 
    day varchar,            
    is_attributed boolean,
    platform varchar,                               
    hour integer,                                          
    user_id_fast bigint)
...
;

Attempting to access one of the partitions with the older schema results in errors as Presto is matching the types of the columns in the partition and the table by index. So index 1 ends up being an integer in one and a boolean in the other which results in an error (https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitManager.java#L299). The behavior is similar in case of structs as well - https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveCoercionPolicy.java#L99

From what I understand Hive does this checking based on columnNames in case of Parquet. Is this something we can implement here as well? (I'm not sure what the correct behavior should be for formats like ORC, so it might be that we need to have different checks for different formats).

On similar lines - #8911. Once the flag hive.parquet.use-column-names is set, we use column names while reading Parquet data. This flag is currently not involved at the schema checking step.

@piyushnarang
Copy link
Contributor Author

@dain / @nezihyigitbasi - what do you guys think about this? (checking schema fields by columnName instead of index)

@stale
Copy link

stale bot commented Jan 17, 2021

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants