-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Iceberg] variable-width column data sizes are generally wrong #22208
Comments
Some more follow-up information: Comprison to TrinoI also tested trino's Iceberg implementation and found that their data sizes for variable-width iceberg columns are also generally wrong.
Understanding the Iceberg codeAfter digging into the iceberg library, I found that (at least for parquet format) that the column stats are generated by this line The call they use here is the parquet file footer's
IMO I think that we should be able to contribute this change to the Iceberg community but it might be hard to get them to accept such a change
|
Your Environment
Any iceberg table
Expected Behavior
Data size statistics should accurately reflect the size in memory when operating in Presto.
Current Behavior
Iceberg's
TableStatisticsMaker
tries to use the Iceberg manifest file information to calculate the data size for each column. This information is used in the optimizer for things like determining the join distribution type based on row size. However, in the iceberg spec, this data is actually the on-disk data size, not necessarily the in-memory size which is what we care about.It turns out that most, if not all of the data size statistics iceberg reports is incorrect by an order of 3-5x. This amount could change depending on disk storage format, compression, encryption, etc.
Possible Solution
There are two dimensions to the solution
ANALYZE
ANALYZE
to overwrite and improve them.Steps to Reproduce
Use the
IcebergQueryRunner
and executeSHOW STATS
on any tpch/ds table.check the value returned by the aggregation function used to calculate data size (
sum_data_size_for_stats
)727364 > 167815 by a factor of about 4.5.
Context
Can cause query slowdowns if incorrect join distribution type is chosen
The text was updated successfully, but these errors were encountered: