Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for partition schema evolution for HUDI tables #16348

Merged
merged 1 commit into from Jul 14, 2021

Conversation

imjalpreet
Copy link
Member

This is a follow-up PR for #16011. This PR enables partition schema evolution for HUDI tables.

== RELEASE NOTES ==

Hive Changes
* Add support for allowing to match columns between table and partition schemas by names for HUDI tables. This is enabled when configuration property ``hive.parquet.use-column-names`` or the hive catalog session property ``parquet_use_column_names`` is set to true. By default they are mapped by index. 

@imjalpreet imjalpreet requested a review from zhenxiao June 28, 2021 09:03
Copy link
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor thing

Optional<HiveStorageFormat> hiveStorageFormat = getHiveStorageFormat(storageFormat);

// Use Hive Storage Format as Parquet if table is of HUDI format
Optional<HiveStorageFormat> finalHiveStorageFormat = (!hiveStorageFormat.isPresent() && isHudiFormat(storageFormat)) ? Optional.of(PARQUET) : hiveStorageFormat;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:
s/finalHiveStorageFormat/resolvedStorageFormat/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, made the requested change.

@@ -385,7 +389,11 @@ public CounterStat getHighMemorySplitSource()
}
}

Optional<HiveStorageFormat> storageFormat = getHiveStorageFormat(table.getStorage().getStorageFormat());
StorageFormat storageFormat = table.getStorage().getStorageFormat();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we check session property or configuration here? according to the release note:
This is enabled when configuration property hive.parquet.use-column-names or the hive catalog session property parquet_use_column_names is set to true. By default they are mapped by index.

Copy link
Member Author

@imjalpreet imjalpreet Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao Line 392 and Line 393 are required in any case. Only next couple of lines are not mandatory if hive.parquet.use-column-names is not set to true. But in any case value of this variable is only used in one method getTableToPartitionMapping which has the required check at line 528.

Anyways, I have added a check here as well, let me know if you feel it is not necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice

Copy link
Member Author

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao Sorry for the delay, I was away for the past few days.

I have made the requested changes, please let me know your views.

@@ -385,7 +389,11 @@ public CounterStat getHighMemorySplitSource()
}
}

Optional<HiveStorageFormat> storageFormat = getHiveStorageFormat(table.getStorage().getStorageFormat());
StorageFormat storageFormat = table.getStorage().getStorageFormat();
Copy link
Member Author

@imjalpreet imjalpreet Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao Line 392 and Line 393 are required in any case. Only next couple of lines are not mandatory if hive.parquet.use-column-names is not set to true. But in any case value of this variable is only used in one method getTableToPartitionMapping which has the required check at line 528.

Anyways, I have added a check here as well, let me know if you feel it is not necessary.

Optional<HiveStorageFormat> hiveStorageFormat = getHiveStorageFormat(storageFormat);

// Use Hive Storage Format as Parquet if table is of HUDI format
Optional<HiveStorageFormat> finalHiveStorageFormat = (!hiveStorageFormat.isPresent() && isHudiFormat(storageFormat)) ? Optional.of(PARQUET) : hiveStorageFormat;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, made the requested change.

Copy link
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, @imjalpreet
looks good. once minor issue

@@ -385,7 +389,11 @@ public CounterStat getHighMemorySplitSource()
}
}

Optional<HiveStorageFormat> storageFormat = getHiveStorageFormat(table.getStorage().getStorageFormat());
StorageFormat storageFormat = table.getStorage().getStorageFormat();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice

StorageFormat storageFormat = table.getStorage().getStorageFormat();
Optional<HiveStorageFormat> hiveStorageFormat = getHiveStorageFormat(storageFormat);

Optional<HiveStorageFormat> resolvedHiveStorageFormat;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:

Optional<HiveStorageFormat> resolvedHiveStorageFormat = hiveStorageFormat;
if (isUseParquetColumnNames(session)) {
 ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao The variable resolvedHiveStorageFormat is being used in a lambda function so it has to be a final or an effectively final variable. If I set it before the if condition it won't remain an effectively final variable as it's value will change for the second time inside the if block.

Due to this reason I had to use an else block.

Let me know if you think I can improve it some other way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might miss something. which lambda function is used for resolvedHiveStorageFormat? seems it is only used in getTableToPartitionMapping?

Copy link
Member Author

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao I have left a comment at the line where the lambda function starts. Can you have a look and let me know in case of any concerns.

}
else {
resolvedHiveStorageFormat = hiveStorageFormat;
}

Iterable<List<HivePartition>> partitionNameBatches = partitionExponentially(hivePartitions, minPartitionBatchSize, maxPartitionBatchSize);
Iterable<List<HivePartitionMetadata>> partitionBatches = transform(partitionNameBatches, partitionBatch -> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao I was talking about this. You are right that resolvedHiveStorageFormat is only being used in the method getTableToPartitionMapping but that method is being called from this lambda function. The call is on the line 466 in the updated code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it. you are correct

}
else {
resolvedHiveStorageFormat = hiveStorageFormat;
}

Iterable<List<HivePartition>> partitionNameBatches = partitionExponentially(hivePartitions, minPartitionBatchSize, maxPartitionBatchSize);
Iterable<List<HivePartitionMetadata>> partitionBatches = transform(partitionNameBatches, partitionBatch -> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it. you are correct

@zhenxiao zhenxiao merged commit 5090612 into prestodb:master Jul 14, 2021
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants