Fix reading Iceberg stats when table has nested fields #8647

findepi · 2021-07-23T12:40:52Z

findepi · 2021-07-23T12:43:17Z

findepi · 2021-07-23T13:57:09Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

+                .projected(0, 2, 3, 4, 5, 6) // ignore data size which is available for Parquet, but not for ORC
+                .skippingTypesCheck()
+                .matches("VALUES " +
+                        // TODO (https://github.com/trinodb/trino/issues/8648) the NDV numbers are wrong, should be 1, not 0


doesn't projected index 3 map to nulls-fraction which is correctly 0?

spec mentions that distinct values are deprecated anyway, so keeping them null (index 2) seems fine to me.

@phd3 good catch. i fell into my own trap here.

I was surprised to see that these were deprecated, but it looks like they were just added back: apache/iceberg#2805

@electrum i don't think per-file NDV is that useful, unless query is very selective.

Do we return them today, if they are present?

Since things like apache/iceberg#2805 depend on the writer, i guess we need to have SHOW STATS coverage with Trino and Spark writers.
cc @joshthoward

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

The method is oneliner. This also makes `PartitionTable#getPartitions` even more similar to analogous code inside `TableStatisticsMaker#makeTableStatistics` method.

alexjo2144 · 2021-07-23T18:32:35Z

Looks good to me, just checking that the other changes in the existing PR to fix this issue wound up not being necessary? https://github.com/trinodb/trino/pull/8337/files#diff-b610df3211e6549a57e131f03074eb1b133c6ab227cf4a0ae3868758d4c40491R294

findepi · 2021-07-23T19:56:14Z

I think PartitionTable.convert changes are not needed (and I don't think they are sound).

hashhar · 2021-07-26T11:59:16Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

-        this.idToTypeMapping = icebergTable.schema().columns().stream()
-                .filter(column -> column.type().isPrimitiveType())
-                .collect(Collectors.toMap(Types.NestedField::fieldId, (column) -> column.type().asPrimitiveType()));
+        this.idToTypeMapping = primitiveFieldTypes(icebergTable.schema());


We should somehow indicate/note that the idToTypeMapping contains the source fields instead of transformed ones. Not sure how though.

Only PartitionField refers to transformed field and to get to source field you need to use PartitionField#sourceId. The inconsistency is a bit confusing to me but that all happens within Iceberg API.

We should somehow indicate/note that the idToTypeMapping contains the source fields instead of transformed ones.

this didn't change, right?

Only PartitionField refers to transformed field and to get to source field you need to use PartitionField#sourceId.

We could narrow the map to only partition fields (and rename appropriately) , is that what you mean here?

Nothing changed - I'm just expressing frustration at something which I had to spend a bit of time fighting when debugging a past issue. 🙂

I don't think we can do anything other than people being aware of the difference between PartitionField and NestedField even though they sound the same.

hashhar

Looks good to me.

phd3 · 2021-07-26T14:07:04Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java

+                .projected(0, 2, 3, 4, 5, 6) // ignore data size which is available for Parquet, but not for ORC
+                .skippingTypesCheck()
+                .matches("VALUES " +
+                        // TODO (https://github.com/trinodb/trino/issues/8648) the NDV numbers are wrong, should be 1, not 0


doesn't projected index 3 map to nulls-fraction which is correctly 0?

spec mentions that distinct values are deprecated anyway, so keeping them null (index 2) seems fine to me.

phd3 · 2021-07-26T14:18:36Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/Partition.java

@@ -227,7 +227,7 @@ public void updateNullCount(Map<Integer, Long> nullCounts)
                this.nullCounts.merge(key, counts, Long::sum));
    }

-    public static Map<Integer, Object> toMap(Map<Integer, Type.PrimitiveType> idToTypeMapping, Map<Integer, ByteBuffer> idToMetricMap)
+    public static Map<Integer, Object> convertBounds(Map<Integer, Type.PrimitiveType> idToTypeMapping, Map<Integer, ByteBuffer> idToMetricMap)


optional: the method doesn't have anything specific to "bounds". May be convertToValues or getValuesFromByteBuffers, or convertMetrics, but don't have a suggestion that sounds obviously better.

the method doesn't have anything specific to "bounds".

i think it actually has, because bounds come as byte buffers, whole ordinary partition values do not.

Keeping current iceberg data structures aside, is there a reason why logically byte buffers would be more suitable for representing bounds, and hence this naming is more appropriate?

(either way convertBounds reads more easily than toMap, so feel free to resolve this.)

is there a reason why logically byte buffers would be more suitable for representing bounds

i don't see why we would use different representation in different code paths.
some performance reason?
or some unintentional API difference?
i need to find out.

(also seen here:

trino/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

Line 293 in 4e30d82

// Partition values are passed as String, but min/max values are passed as a CharBuffer

)

(either way convertBounds reads more easily than toMap, so feel free to resolve this.)

👍

findepi requested review from phd3 and alexjo2144 July 23, 2021 12:40

cla-bot bot added the cla-signed label Jul 23, 2021

findepi force-pushed the findepi/iceberg-nested branch from 6009357 to 602d85a Compare July 23, 2021 12:43

findepi mentioned this pull request Jul 23, 2021

Incorrect number distinct values (NDV) stats reported by Iceberg #8648

Closed

findepi force-pushed the findepi/iceberg-nested branch from 602d85a to bd760af Compare July 23, 2021 13:56

findepi commented Jul 23, 2021

View reviewed changes

findepi added 2 commits July 23, 2021 16:25

Inline PartitionTable#toMap

880fcb8

The method is oneliner. This also makes `PartitionTable#getPartitions` even more similar to analogous code inside `TableStatisticsMaker#makeTableStatistics` method.

Give method a more meaningful name

a5230ac

findepi force-pushed the findepi/iceberg-nested branch 2 times, most recently from be06564 to e2e1d03 Compare July 23, 2021 14:28

hashhar reviewed Jul 26, 2021

View reviewed changes

hashhar approved these changes Jul 26, 2021

View reviewed changes

phd3 reviewed Jul 26, 2021

View reviewed changes

Fix reading Iceberg stats when table has nested fields

c8a933d

findepi force-pushed the findepi/iceberg-nested branch from e2e1d03 to c8a933d Compare July 26, 2021 15:04

phd3 approved these changes Jul 26, 2021

View reviewed changes

findepi merged commit 669a41e into trinodb:master Jul 27, 2021

findepi deleted the findepi/iceberg-nested branch July 27, 2021 07:22

findepi mentioned this pull request Jul 27, 2021

Release notes for 360 #8455

Closed

11 tasks

findepi added this to the 360 milestone Jul 27, 2021

hashhar mentioned this pull request Oct 27, 2022

Handle a series of Exception for timestamp with timezone #7969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reading Iceberg stats when table has nested fields #8647

Fix reading Iceberg stats when table has nested fields #8647

findepi commented Jul 23, 2021

findepi commented Jul 23, 2021

findepi Jul 23, 2021

phd3 Jul 26, 2021

findepi Jul 26, 2021

electrum Jul 28, 2021

findepi Jul 29, 2021

alexjo2144 commented Jul 23, 2021

findepi commented Jul 23, 2021

hashhar Jul 26, 2021 •

edited

findepi Jul 26, 2021

hashhar Jul 26, 2021

hashhar left a comment

phd3 Jul 26, 2021

phd3 Jul 26, 2021

findepi Jul 26, 2021

phd3 Jul 26, 2021

findepi Jul 26, 2021

Fix reading Iceberg stats when table has nested fields #8647

Fix reading Iceberg stats when table has nested fields #8647

Conversation

findepi commented Jul 23, 2021

findepi commented Jul 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjo2144 commented Jul 23, 2021

findepi commented Jul 23, 2021

hashhar Jul 26, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hashhar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hashhar Jul 26, 2021 •

edited