Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg] variable-width column data sizes are generally wrong #22208

Closed
ZacBlanco opened this issue Mar 14, 2024 · 2 comments · Fixed by #22327
Closed

[Iceberg] variable-width column data sizes are generally wrong #22208

ZacBlanco opened this issue Mar 14, 2024 · 2 comments · Fixed by #22327
Assignees
Labels
bug iceberg Apache Iceberg related

Comments

@ZacBlanco
Copy link
Contributor

Your Environment

Any iceberg table

Expected Behavior

Data size statistics should accurately reflect the size in memory when operating in Presto.

Current Behavior

Iceberg's TableStatisticsMaker tries to use the Iceberg manifest file information to calculate the data size for each column. This information is used in the optimizer for things like determining the join distribution type based on row size. However, in the iceberg spec, this data is actually the on-disk data size, not necessarily the in-memory size which is what we care about.

It turns out that most, if not all of the data size statistics iceberg reports is incorrect by an order of 3-5x. This amount could change depending on disk storage format, compression, encryption, etc.

Possible Solution

There are two dimensions to the solution

  1. Do we want to allow ANALYZE on iceberg tables to generate stats for data size?
  2. Can we still utilize the manifest information, even when stats don't exist?
without ANALYZE with ANALYZE
don't change manifest info don't change the code at all just add data size collection support to iceberg
do change manifest info add some heuristics to TableStatisticsMaker add support for correct data size with ANALYZE and try to improve TableStatisticsMaker with heuristics
  1. Don't return data size stats at all except after an ANALYZE
  2. Return incorrect stats, but allow ANALYZE to overwrite and improve them.
  3. Keep incorrect stats, but apply some adjustment factor based on file type/configuration

Steps to Reproduce

Use the IcebergQueryRunner and execute SHOW STATS on any tpch/ds table.

SHOW STATS FOR (select comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  167815.0 | NULL                  |            0.0 | NULL      | NULL      | NULL
 NULL        | NULL      | NULL                  | NULL           |   15000.0 | NULL      | NULL
(2 rows)

Query 20240314_192652_00012_nmth9, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 124ms, server-side: 83ms] [0 rows, 0B] [0 rows/s, 0B/s]

check the value returned by the aggregation function used to calculate data size (sum_data_size_for_stats)

presto:tpch> select sum_data_size_for_stats(comment) from orders;
 _col0
--------
 727364
(1 row)

Query 20240314_192726_00013_nmth9, FINISHED, 4 nodes
Splits: 9 total, 9 done (100.00%)
[Latency: client-side: 295ms, server-side: 244ms] [15K rows, 228KB] [61.5K rows/s, 934KB/s]

727364 > 167815 by a factor of about 4.5.

Context

Can cause query slowdowns if incorrect join distribution type is chosen

@ZacBlanco ZacBlanco added bug iceberg Apache Iceberg related labels Mar 14, 2024
@ZacBlanco ZacBlanco self-assigned this Mar 14, 2024
@ZacBlanco
Copy link
Contributor Author

cc: @aaneja @ClarenceThreepwood

@ZacBlanco
Copy link
Contributor Author

ZacBlanco commented Mar 16, 2024

Some more follow-up information:

Comprison to Trino

I also tested trino's Iceberg implementation and found that their data sizes for variable-width iceberg columns are also generally wrong.

trino:tpch> SHOW STATS FOR (SELECT comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  554688.0 |               14756.0 |            0.0 |      NULL | NULL      | NULL
 NULL        |      NULL |                  NULL |           NULL |   15000.0 | NULL      | NULL
(2 rows)

----

trino:tpch> select "$internal$sum_data_size_for_stats"(comment) from orders;
 _col0
--------
 727364
(1 row)

Understanding the Iceberg code

After digging into the iceberg library, I found that (at least for parquet format) that the column stats are generated by this line

https://github.com/apache/iceberg/blob/560b72344350816eb31f9a165c2947caa7381a9b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L127

The call they use here is the parquet file footer's getTotalSize method. This represents the bytes on disk. However, parquet footers also have a getTotalUncompressedSize method. I tested to see how this value compares when used to generate the column statistics and found that it is much closer to the true value.

presto:tpch> SHOW STATS FOR (select comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  745598.0 | NULL                  |            0.0 | NULL      | NULL      | NULL
 NULL        | NULL      | NULL                  | NULL           |   15000.0 | NULL      | NULL
(2 rows)

Query 20240316_002952_00016_muzsz, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 76ms, server-side: 61ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto:tpch> select sum_data_size_for_stats(comment) from orders;
 _col0
--------
 727364
(1 row)

IMO I think that we should be able to contribute this change to the Iceberg community but it might be hard to get them to accept such a change

  1. The spec states that the data size field in the manifest should be the on-disk size. I don't think this is a very useful metric for the optimizer unless we can consistently convert this to a "true" size value.
  2. Other file formats don't have support for reading the uncompressed size - e.g. in Iceberg's ORC implementation, they call getBytesOnDisk and there isn't a corresponding method for uncompressed or "in memory" size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug iceberg Apache Iceberg related
Projects
Archived in project
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant