Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve the partition storage location from the metastore #11864

Merged
merged 2 commits into from
Apr 13, 2022

Conversation

findinpath
Copy link
Contributor

Description

Follow the guideline outlined in Hive's source code:

Users should always use the MetaStore API to get the path
name for a partition.
Users should not directly take partition values and turn
it into a path name by themselves, because this internal
logic below may change in the future.

See org.apache.hadoop.hive.common.FileUtils.charToEscape for details.

Using the guideline outlined above comes in handy when dealing with
characters that need escaping in the path of the partition either
on HMS or on HDFS side.

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Hive connector

How would you describe this change to a non-technical end user or system administrator?

Fix the comparison mechanism for the partition names in the system.sync_partition_metadata stored procedure.

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Hive
* Fix the comparison mechanism for the partition names in the `system.sync_partition_metadata` stored procedure.

@ebyhr
Copy link
Member

ebyhr commented Apr 8, 2022

Does this PR resolve the same issue of #11719?

List<String> partitionsInMetastore = metastore.getPartitionsByNames(schemaName, tableName, partitionsNamesInMetastore).values().stream()
.filter(Optional::isPresent).map(Optional::get)
.map(partition -> new Path(partition.getStorage().getLocation()).toUri())
.map(uri -> tableLocation.toUri().relativize(uri).getPath())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_partition_metadata assumes the partitions names match locations
is it still well-defined if they don't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the question.
The logic here is relatively similar to what is being done in io.trino.plugin.hive.HiveSplitManager where the partition names are retrieved first and then the corresponding partitions are retrieved afterwards by names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sync_partition_metadata relies on certain naming convention.
Split manager shouldn't and doesn't rely on that convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changes here? Were the partition names returned as escaped or unescaped? In the new test you added, the locations in HDFS are already escaped.

@findinpath
Copy link
Contributor Author

@nineinchnick I seem to have added a duplicate for #11719 . Sorry for not checking before whether there is already a PR for this issue in progress. :(

Thank you @ebyhr for the heads up.

@findinpath findinpath force-pushed the hive-sync-partition-metadata branch from 9209583 to c234aeb Compare April 8, 2022 08:18
@findepi
Copy link
Member

findepi commented Apr 8, 2022

@nineinchnick I seem to have added a duplicate for #11719 . Sorry for not checking before whether there is already a PR for this issue in progress. :(

the two are significantly different.

Copy link
Member

@nineinchnick nineinchnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#11719 tries to address a different problem, where the locations read from the file system are not escaped, but HMS is returning them escaped and sync_partition_metadata tries to add them again on the second run and fails with an AlreadyExists exception.

@@ -64,6 +64,33 @@ public void testAddPartition()
cleanup(tableName);
}

@Test(groups = {HIVE_PARTITIONING, SMOKE, TRINO_JDBC})
@Flaky(issue = ERROR_COMMITTING_WRITE_TO_HIVE_ISSUE, match = ERROR_COMMITTING_WRITE_TO_HIVE_MATCH)
public void testAddPartitionContainingCharactersThatNeedUrlEncoding()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already tests for adding and dropping partitions, maybe it's enough to add new dirs in the prepare() method instead of adding more tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. There are already tests for adding/dropping partitions. However, the newly added tests deal with special characters which are getting encoded in HMS (and on HDFS).

List<String> partitionsInMetastore = metastore.getPartitionsByNames(schemaName, tableName, partitionsNamesInMetastore).values().stream()
.filter(Optional::isPresent).map(Optional::get)
.map(partition -> new Path(partition.getStorage().getLocation()).toUri())
.map(uri -> tableLocation.toUri().relativize(uri).getPath())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changes here? Were the partition names returned as escaped or unescaped? In the new test you added, the locations in HDFS are already escaped.

Follow the guideline outlined in Hive's source code:

> Users should always use the MetaStore API to get the path
> name for a partition.
> Users should not directly take partition values and turn
> it into a path name by themselves, because this internal
> logic below may change in the future.

See `org.apache.hadoop.hive.common.FileUtils.charToEscape` for details.

Using the guideline outlined above comes in handy when dealing with
characters that need escaping in the path of the partition either
on HMS or on HDFS side.
Follow the guideline outlined in Hive's source code:

> Users should always use the MetaStore API to get the path
> name for a partition.
> Users should not directly take partition values and turn
> it into a path name by themselves, because this internal
> logic below may change in the future.

See `org.apache.hadoop.hive.common.FileUtils.charToEscape` for details.

Using the guideline outlined above comes in handy when dealing with
characters that need escaping in the path of the partition either
on HMS or on HDFS side.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants