[Iceberg] variable-width column data sizes are generally wrong #22208

ZacBlanco · 2024-03-14T19:29:58Z

Your Environment

Any iceberg table

Expected Behavior

Data size statistics should accurately reflect the size in memory when operating in Presto.

Current Behavior

Iceberg's TableStatisticsMaker tries to use the Iceberg manifest file information to calculate the data size for each column. This information is used in the optimizer for things like determining the join distribution type based on row size. However, in the iceberg spec, this data is actually the on-disk data size, not necessarily the in-memory size which is what we care about.

It turns out that most, if not all of the data size statistics iceberg reports is incorrect by an order of 3-5x. This amount could change depending on disk storage format, compression, encryption, etc.

Possible Solution

There are two dimensions to the solution

Do we want to allow ANALYZE on iceberg tables to generate stats for data size?
Can we still utilize the manifest information, even when stats don't exist?

	without ANALYZE	with ANALYZE
don't change manifest info	don't change the code at all	just add data size collection support to iceberg
do change manifest info	add some heuristics to TableStatisticsMaker	add support for correct data size with ANALYZE and try to improve TableStatisticsMaker with heuristics

Don't return data size stats at all except after an ANALYZE
Return incorrect stats, but allow ANALYZE to overwrite and improve them.
Keep incorrect stats, but apply some adjustment factor based on file type/configuration

Steps to Reproduce

Use the IcebergQueryRunner and execute SHOW STATS on any tpch/ds table.

SHOW STATS FOR (select comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  167815.0 | NULL                  |            0.0 | NULL      | NULL      | NULL
 NULL        | NULL      | NULL                  | NULL           |   15000.0 | NULL      | NULL
(2 rows)

Query 20240314_192652_00012_nmth9, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 124ms, server-side: 83ms] [0 rows, 0B] [0 rows/s, 0B/s]

check the value returned by the aggregation function used to calculate data size (sum_data_size_for_stats)

presto:tpch> select sum_data_size_for_stats(comment) from orders;
 _col0
--------
 727364
(1 row)

Query 20240314_192726_00013_nmth9, FINISHED, 4 nodes
Splits: 9 total, 9 done (100.00%)
[Latency: client-side: 295ms, server-side: 244ms] [15K rows, 228KB] [61.5K rows/s, 934KB/s]

727364 > 167815 by a factor of about 4.5.

Context

Can cause query slowdowns if incorrect join distribution type is chosen

The text was updated successfully, but these errors were encountered:

ZacBlanco · 2024-03-14T19:30:27Z

cc: @aaneja @ClarenceThreepwood

ZacBlanco · 2024-03-16T00:47:14Z

Some more follow-up information:

Comprison to Trino

I also tested trino's Iceberg implementation and found that their data sizes for variable-width iceberg columns are also generally wrong.

trino:tpch> SHOW STATS FOR (SELECT comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  554688.0 |               14756.0 |            0.0 |      NULL | NULL      | NULL
 NULL        |      NULL |                  NULL |           NULL |   15000.0 | NULL      | NULL
(2 rows)

----

trino:tpch> select "$internal$sum_data_size_for_stats"(comment) from orders;
 _col0
--------
 727364
(1 row)

Understanding the Iceberg code

After digging into the iceberg library, I found that (at least for parquet format) that the column stats are generated by this line

https://github.com/apache/iceberg/blob/560b72344350816eb31f9a165c2947caa7381a9b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L127

The call they use here is the parquet file footer's getTotalSize method. This represents the bytes on disk. However, parquet footers also have a getTotalUncompressedSize method. I tested to see how this value compares when used to generate the column statistics and found that it is much closer to the true value.

presto:tpch> SHOW STATS FOR (select comment from orders);
 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 comment     |  745598.0 | NULL                  |            0.0 | NULL      | NULL      | NULL
 NULL        | NULL      | NULL                  | NULL           |   15000.0 | NULL      | NULL
(2 rows)

Query 20240316_002952_00016_muzsz, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
[Latency: client-side: 76ms, server-side: 61ms] [0 rows, 0B] [0 rows/s, 0B/s]

presto:tpch> select sum_data_size_for_stats(comment) from orders;
 _col0
--------
 727364
(1 row)

IMO I think that we should be able to contribute this change to the Iceberg community but it might be hard to get them to accept such a change

The spec states that the data size field in the manifest should be the on-disk size. I don't think this is a very useful metric for the optimizer unless we can consistently convert this to a "true" size value.
Other file formats don't have support for reading the uncompressed size - e.g. in Iceberg's ORC implementation, they call getBytesOnDisk and there isn't a corresponding method for uncompressed or "in memory" size.

ZacBlanco added bug iceberg Apache Iceberg related labels Mar 14, 2024

ZacBlanco self-assigned this Mar 14, 2024

ZacBlanco mentioned this issue Mar 25, 2024

[Iceberg] Collect data size statistics for Iceberg tables #22327

Merged

6 tasks

ZacBlanco closed this as completed in #22327 Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Iceberg] variable-width column data sizes are generally wrong #22208

[Iceberg] variable-width column data sizes are generally wrong #22208

ZacBlanco commented Mar 14, 2024

ZacBlanco commented Mar 14, 2024

ZacBlanco commented Mar 16, 2024 •

edited

Loading

[Iceberg] variable-width column data sizes are generally wrong #22208

[Iceberg] variable-width column data sizes are generally wrong #22208

Comments

ZacBlanco commented Mar 14, 2024

Your Environment

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

ZacBlanco commented Mar 14, 2024

ZacBlanco commented Mar 16, 2024 • edited Loading

Comprison to Trino

Understanding the Iceberg code

ZacBlanco commented Mar 16, 2024 •

edited

Loading