-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve iceberg hive statistics documentation #22162
Improve iceberg hive statistics documentation #22162
Conversation
Codenotify: Notifying subscribers in CODENOTIFY files for diff 29cc196...dcf052d.
|
- Documented the configuration properties for merging statistics from the hive metastore. - Added a single integration test to ensure stats are always stored when ANALYZE is run, regardless of whether the iceberg.hive-statistics-merge-strategy is set
50824eb
to
dcf052d
Compare
==================================================== ============================================================ ============= | ||
Property Name Description Default | ||
==================================================== ============================================================ ============= | ||
``iceberg.hive-statistics-merge-strategy`` The strategy to use when merging statistics from the Hive ``NONE`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZacBlanco This was already documented last week, maybe we can edit that if we want to modify the explanation: https://github.com/prestodb/presto/blame/29cc1963bb6f1f268e4a46044385616e715e2214/presto-docs/src/main/sphinx/connector/iceberg.rst#L228
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks. I think my fork wasn't updated. I'll modify this PR
Session noMergeSession = Session.builder(getSession()) | ||
.setCatalogSessionProperty("iceberg", HIVE_METASTORE_STATISTICS_MERGE_STRATEGY, NONE.name()) | ||
.build(); | ||
assertQuerySucceeds("CREATE TABLE analyzeWithoutProp as SELECT * FROM orders"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a case for a partitioned table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my test:
❯ presto-cli \
--catalog iceberg \
--schema tpcds_sf1000_parquet_varchar_iceberg_part \
--server https://***************
presto:tpcds_sf1000_parquet_varchar_iceberg_part> set session iceberg.hive_statistics_merge_strategy='USE_NULLS_FRACTION_AND_NDV';
SET SESSION
presto:tpcds_sf1000_parquet_varchar_iceberg_part> show stats for inventory;
column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_val>
----------------------+---------------+-----------------------+---------------------+-----------+-----------+--------->
inv_item_sk | 2.271521964E9 | NULL | 0.0 | NULL | 1 | 300000 >
inv_warehouse_sk | 1.93783029E8 | NULL | 0.0 | NULL | 1 | 20 >
inv_quantity_on_hand | 3318960.0 | NULL | 0.05000479948914432 | NULL | 0 | 1000 >
inv_date_sk | 6932.0 | NULL | 0.0 | NULL | 2450815 | 2452635 >
NULL | NULL | NULL | NULL | 7.83E8 | NULL | NULL >
(5 rows)
Query 20240312_081321_00012_65hxe, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 0:04, server-side: 0:04] [0 rows, 0B] [0 rows/s, 0B/s]
presto:tpcds_sf1000_parquet_varchar_iceberg_part> analyze inventory;
ANALYZE: 783000000 rows
Query 20240312_081333_00013_65hxe, FINISHED, 9 nodes
Splits: 295 total, 295 done (100.00%)
[Latency: client-side: 0:20, server-side: 0:19] [783M rows, 2.3GB] [40.4M rows/s, 122MB/s]
presto:tpcds_sf1000_parquet_varchar_iceberg_part> show stats for inventory;
column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_val>
----------------------+---------------+-----------------------+---------------------+-----------+-----------+--------->
inv_item_sk | 2.271521964E9 | NULL | 0.0 | NULL | 1 | 300000 >
inv_warehouse_sk | 1.93783029E8 | NULL | 0.0 | NULL | 1 | 20 >
inv_quantity_on_hand | 3318960.0 | NULL | 0.05000479948914432 | NULL | 0 | 1000 >
inv_date_sk | 6932.0 | NULL | 0.0 | NULL | 2450815 | 2452635 >
NULL | NULL | NULL | NULL | 7.83E8 | NULL | NULL >
(5 rows)
Query 20240312_081357_00014_65hxe, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 452ms, server-side: 347ms] [0 rows, 0B] [0 rows/s, 0B/s]
presto:tpcds_sf1000_parquet_varchar_iceberg_part> show create table inventory;
Create Table
------------------------------------------------------------------------------------
CREATE TABLE iceberg.tpcds_sf1000_parquet_varchar_iceberg_part.inventory (
"inv_item_sk" integer,
"inv_warehouse_sk" integer,
"inv_quantity_on_hand" integer,
"inv_date_sk" integer
)
WITH (
delete_mode = 'copy-on-write',
format = 'PARQUET',
format_version = '2',
location = 's3a://presto-workload/tpcds-sf1000-parquet-iceberg-part/inventory',
partitioning = ARRAY['inv_date_sk']
)
(1 row)
Query 20240312_081417_00015_65hxe, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 209ms, server-side: 104ms] [0 rows, 0B] [0 rows/s, 0B/s]
presto:tpcds_sf1000_parquet_varchar_iceberg_part>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized we don't return NDV stats for partitioned tables. See #22168
Closing this. I think the current documentation is fine for now |
Description
Motivation and Context
Documentation improvements
Impact
N/A
Test Plan
N/A
Contributor checklist
Release Notes