Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve iceberg hive statistics documentation #22162

Conversation

ZacBlanco
Copy link
Contributor

Description

  • Documented the configuration properties for merging statistics from the hive metastore.
  • Added a single integration test to ensure stats are always stored when ANALYZE is run, regardless of whether the iceberg.hive-statistics-merge-strategy is set

Motivation and Context

Documentation improvements

Impact

N/A

Test Plan

N/A

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

Copy link

github-actions bot commented Mar 12, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff 29cc196...dcf052d.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/connector/iceberg.rst

- Documented the configuration properties for merging statistics from
  the hive metastore.
- Added a single integration test to ensure stats are always stored when
  ANALYZE is run, regardless of whether the
  iceberg.hive-statistics-merge-strategy is set
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-hive-stats-merge-docs branch from 50824eb to dcf052d Compare March 12, 2024 05:56
==================================================== ============================================================ =============
Property Name Description Default
==================================================== ============================================================ =============
``iceberg.hive-statistics-merge-strategy`` The strategy to use when merging statistics from the Hive ``NONE``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZacBlanco This was already documented last week, maybe we can edit that if we want to modify the explanation: https://github.com/prestodb/presto/blame/29cc1963bb6f1f268e4a46044385616e715e2214/presto-docs/src/main/sphinx/connector/iceberg.rst#L228

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks. I think my fork wasn't updated. I'll modify this PR

Session noMergeSession = Session.builder(getSession())
.setCatalogSessionProperty("iceberg", HIVE_METASTORE_STATISTICS_MERGE_STRATEGY, NONE.name())
.build();
assertQuerySucceeds("CREATE TABLE analyzeWithoutProp as SELECT * FROM orders");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case for a partitioned table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my test:

❯ presto-cli \
--catalog iceberg \
--schema tpcds_sf1000_parquet_varchar_iceberg_part \
--server https://***************
presto:tpcds_sf1000_parquet_varchar_iceberg_part> set session iceberg.hive_statistics_merge_strategy='USE_NULLS_FRACTION_AND_NDV';
SET SESSION
presto:tpcds_sf1000_parquet_varchar_iceberg_part> show stats for inventory;
     column_name      |   data_size   | distinct_values_count |   nulls_fraction    | row_count | low_value | high_val>
----------------------+---------------+-----------------------+---------------------+-----------+-----------+--------->
 inv_item_sk          | 2.271521964E9 | NULL                  |                 0.0 | NULL      | 1         | 300000  >
 inv_warehouse_sk     |  1.93783029E8 | NULL                  |                 0.0 | NULL      | 1         | 20      >
 inv_quantity_on_hand |     3318960.0 | NULL                  | 0.05000479948914432 | NULL      | 0         | 1000    >
 inv_date_sk          |        6932.0 | NULL                  |                 0.0 | NULL      | 2450815   | 2452635 >
 NULL                 | NULL          | NULL                  | NULL                |    7.83E8 | NULL      | NULL    >
(5 rows)

Query 20240312_081321_00012_65hxe, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 0:04, server-side: 0:04] [0 rows, 0B] [0 rows/s, 0B/s]

presto:tpcds_sf1000_parquet_varchar_iceberg_part> analyze inventory;
ANALYZE: 783000000 rows

Query 20240312_081333_00013_65hxe, FINISHED, 9 nodes
Splits: 295 total, 295 done (100.00%)
[Latency: client-side: 0:20, server-side: 0:19] [783M rows, 2.3GB] [40.4M rows/s, 122MB/s]

presto:tpcds_sf1000_parquet_varchar_iceberg_part> show stats for inventory;
     column_name      |   data_size   | distinct_values_count |   nulls_fraction    | row_count | low_value | high_val>
----------------------+---------------+-----------------------+---------------------+-----------+-----------+--------->
 inv_item_sk          | 2.271521964E9 | NULL                  |                 0.0 | NULL      | 1         | 300000  >
 inv_warehouse_sk     |  1.93783029E8 | NULL                  |                 0.0 | NULL      | 1         | 20      >
 inv_quantity_on_hand |     3318960.0 | NULL                  | 0.05000479948914432 | NULL      | 0         | 1000    >
 inv_date_sk          |        6932.0 | NULL                  |                 0.0 | NULL      | 2450815   | 2452635 >
 NULL                 | NULL          | NULL                  | NULL                |    7.83E8 | NULL      | NULL    >
(5 rows)

Query 20240312_081357_00014_65hxe, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 452ms, server-side: 347ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto:tpcds_sf1000_parquet_varchar_iceberg_part> show create table inventory;
                                    Create Table
------------------------------------------------------------------------------------
 CREATE TABLE iceberg.tpcds_sf1000_parquet_varchar_iceberg_part.inventory (
    "inv_item_sk" integer,
    "inv_warehouse_sk" integer,
    "inv_quantity_on_hand" integer,
    "inv_date_sk" integer
 )
 WITH (
    delete_mode = 'copy-on-write',
    format = 'PARQUET',
    format_version = '2',
    location = 's3a://presto-workload/tpcds-sf1000-parquet-iceberg-part/inventory',
    partitioning = ARRAY['inv_date_sk']
 )
(1 row)

Query 20240312_081417_00015_65hxe, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 209ms, server-side: 104ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto:tpcds_sf1000_parquet_varchar_iceberg_part>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized we don't return NDV stats for partitioned tables. See #22168

@ZacBlanco
Copy link
Contributor Author

Closing this. I think the current documentation is fine for now

@ZacBlanco ZacBlanco closed this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants