Improve iceberg hive statistics documentation #22162

ZacBlanco · 2024-03-12T05:45:54Z

Description

Documented the configuration properties for merging statistics from the hive metastore.
Added a single integration test to ensure stats are always stored when ANALYZE is run, regardless of whether the iceberg.hive-statistics-merge-strategy is set

Motivation and Context

Documentation improvements

Impact

N/A

Test Plan

N/A

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== NO RELEASE NOTE ==

github-actions · 2024-03-12T05:46:48Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 29cc196...dcf052d.

Notify	File(s)
@steveburnett	presto-docs/src/main/sphinx/connector/iceberg.rst

- Documented the configuration properties for merging statistics from the hive metastore. - Added a single integration test to ensure stats are always stored when ANALYZE is run, regardless of whether the iceberg.hive-statistics-merge-strategy is set

imjalpreet · 2024-03-12T07:08:08Z

presto-docs/src/main/sphinx/connector/iceberg.rst

+==================================================== ============================================================ =============
+Property Name                                        Description                                                  Default
+==================================================== ============================================================ =============
+``iceberg.hive-statistics-merge-strategy``           The strategy to use when merging statistics from the Hive    ``NONE``


@ZacBlanco This was already documented last week, maybe we can edit that if we want to modify the explanation: https://github.com/prestodb/presto/blame/29cc1963bb6f1f268e4a46044385616e715e2214/presto-docs/src/main/sphinx/connector/iceberg.rst#L228

Ah thanks. I think my fork wasn't updated. I'll modify this PR

ethanyzhang · 2024-03-12T08:10:59Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/hive/TestIcebergHiveStatistics.java

+        Session noMergeSession = Session.builder(getSession())
+                .setCatalogSessionProperty("iceberg", HIVE_METASTORE_STATISTICS_MERGE_STRATEGY, NONE.name())
+                .build();
+        assertQuerySucceeds("CREATE TABLE analyzeWithoutProp as SELECT * FROM orders");


Can you add a case for a partitioned table?

This is my test:

❯ presto-cli \ --catalog iceberg \ --schema tpcds_sf1000_parquet_varchar_iceberg_part \ --server https://*************** presto:tpcds_sf1000_parquet_varchar_iceberg_part> set session iceberg.hive_statistics_merge_strategy='USE_NULLS_FRACTION_AND_NDV'; SET SESSION presto:tpcds_sf1000_parquet_varchar_iceberg_part> show stats for inventory; column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_val> ----------------------+---------------+-----------------------+---------------------+-----------+-----------+---------> inv_item_sk | 2.271521964E9 | NULL | 0.0 | NULL | 1 | 300000 > inv_warehouse_sk | 1.93783029E8 | NULL | 0.0 | NULL | 1 | 20 > inv_quantity_on_hand | 3318960.0 | NULL | 0.05000479948914432 | NULL | 0 | 1000 > inv_date_sk | 6932.0 | NULL | 0.0 | NULL | 2450815 | 2452635 > NULL | NULL | NULL | NULL | 7.83E8 | NULL | NULL > (5 rows) Query 20240312_081321_00012_65hxe, FINISHED, 1 node Splits: 1 total, 1 done (100.00%) [Latency: client-side: 0:04, server-side: 0:04] [0 rows, 0B] [0 rows/s, 0B/s] presto:tpcds_sf1000_parquet_varchar_iceberg_part> analyze inventory; ANALYZE: 783000000 rows Query 20240312_081333_00013_65hxe, FINISHED, 9 nodes Splits: 295 total, 295 done (100.00%) [Latency: client-side: 0:20, server-side: 0:19] [783M rows, 2.3GB] [40.4M rows/s, 122MB/s] presto:tpcds_sf1000_parquet_varchar_iceberg_part> show stats for inventory; column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_val> ----------------------+---------------+-----------------------+---------------------+-----------+-----------+---------> inv_item_sk | 2.271521964E9 | NULL | 0.0 | NULL | 1 | 300000 > inv_warehouse_sk | 1.93783029E8 | NULL | 0.0 | NULL | 1 | 20 > inv_quantity_on_hand | 3318960.0 | NULL | 0.05000479948914432 | NULL | 0 | 1000 > inv_date_sk | 6932.0 | NULL | 0.0 | NULL | 2450815 | 2452635 > NULL | NULL | NULL | NULL | 7.83E8 | NULL | NULL > (5 rows) Query 20240312_081357_00014_65hxe, FINISHED, 1 node Splits: 1 total, 1 done (100.00%) [Latency: client-side: 452ms, server-side: 347ms] [0 rows, 0B] [0 rows/s, 0B/s] presto:tpcds_sf1000_parquet_varchar_iceberg_part> show create table inventory; Create Table ------------------------------------------------------------------------------------ CREATE TABLE iceberg.tpcds_sf1000_parquet_varchar_iceberg_part.inventory ( "inv_item_sk" integer, "inv_warehouse_sk" integer, "inv_quantity_on_hand" integer, "inv_date_sk" integer ) WITH ( delete_mode = 'copy-on-write', format = 'PARQUET', format_version = '2', location = 's3a://presto-workload/tpcds-sf1000-parquet-iceberg-part/inventory', partitioning = ARRAY['inv_date_sk'] ) (1 row) Query 20240312_081417_00015_65hxe, FINISHED, 1 node Splits: 1 total, 1 done (100.00%) [Latency: client-side: 209ms, server-side: 104ms] [0 rows, 0B] [0 rows/s, 0B/s] presto:tpcds_sf1000_parquet_varchar_iceberg_part>

I realized we don't return NDV stats for partitioned tables. See #22168

ZacBlanco · 2024-03-19T20:33:45Z

Closing this. I think the current documentation is fine for now

ZacBlanco requested a review from a team as a code owner March 12, 2024 05:45

ZacBlanco requested a review from presto-oss March 12, 2024 05:45

ZacBlanco mentioned this pull request Mar 12, 2024

TPC-DS SF-1K Q72 extremely slow in Iceberg #22148

Closed

ZacBlanco force-pushed the upstream-iceberg-hive-stats-merge-docs branch from 50824eb to dcf052d Compare March 12, 2024 05:56

imjalpreet reviewed Mar 12, 2024

View reviewed changes

ethanyzhang reviewed Mar 12, 2024

View reviewed changes

ethanyzhang mentioned this pull request Mar 12, 2024

TPC-DS SF-1K 13 Prestissimo query results mismatch when data is partitioned #22149

Closed

ZacBlanco closed this Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve iceberg hive statistics documentation #22162

Improve iceberg hive statistics documentation #22162

ZacBlanco commented Mar 12, 2024

github-actions bot commented Mar 12, 2024 •

edited

Loading

imjalpreet Mar 12, 2024

ZacBlanco Mar 12, 2024

ethanyzhang Mar 12, 2024

ethanyzhang Mar 12, 2024

ZacBlanco Mar 12, 2024

ZacBlanco commented Mar 19, 2024

Improve iceberg hive statistics documentation #22162

Improve iceberg hive statistics documentation #22162

Conversation

ZacBlanco commented Mar 12, 2024

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

github-actions bot commented Mar 12, 2024 • edited Loading

imjalpreet Mar 12, 2024

Choose a reason for hiding this comment

ZacBlanco Mar 12, 2024

Choose a reason for hiding this comment

ethanyzhang Mar 12, 2024

Choose a reason for hiding this comment

ethanyzhang Mar 12, 2024

Choose a reason for hiding this comment

ZacBlanco Mar 12, 2024

Choose a reason for hiding this comment

ZacBlanco commented Mar 19, 2024

github-actions bot commented Mar 12, 2024 •

edited

Loading