Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prestosql 327 queries (parquet on s3) scanning 2~3x more data than prestodb 215 #3031

Closed
tooptoop4 opened this issue Mar 7, 2020 · 2 comments

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Mar 7, 2020

from slack:
Hi team, we have recently performed an upgrade from Facebook Presto 215 to PrestoSQL 327, after the migration, we found one of the queries scanning 2~3x more data. Any idea why?

martin 1 month ago
Can you provide the output of EXPLAIN for both queries?

david 1 month ago
And EXPLAIN ANALYZE as well, if possible, as that should show where it is scanning more data. It's possible this is an issue with how the accounting/reporting happens (but we should understand why)

martin 1 month ago
Also, what format are you using for your data? What’s your storage backend, S3, HDFS or other?

Jiawei Zhang 1 month ago
We are using parquet data from S3, metastore is hive.

Jiawei Zhang 1 month ago
explain analyze 215

Fragment 1 [SINGLE]
    CPU: 5.26ms, Scheduled: 11.02ms, Input: 1 row (71B); per task: avg.: 1.00 std.dev.: 0.00, Output: 1 row (71B)
    Output layout: [device_id, platform, count_4, count]
    Output partitioning: SINGLE []
    Grouped Execution: false
    - TopN[100 by (count DESC_NULLS_LAST)] => [device_id:varchar, platform:varchar, count_4:bigint, count:bigint]
            CPU: 4.00ms (0.00%), Scheduled: 10.00ms (0.00%), Output: 1 row (71B)
            Input avg.: 1.00 rows, Input std.dev.: 0.00%
        - LocalExchange[SINGLE] () => device_id:varchar, platform:varchar, count_4:bigint, count:bigint
                CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 1 row (71B)
                Input avg.: 0.06 rows, Input std.dev.: 387.30%
            - RemoteSource[2] => [device_id:varchar, platform:varchar, count_4:bigint, count:bigint]
                    CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 1 row (71B)
                    Input avg.: 0.06 rows, Input std.dev.: 387.30%

Fragment 2 [HASH]
    CPU: 203.53ms, Scheduled: 282.37ms, Input: 6 rows (480B); per task: avg.: 0.07 std.dev.: 0.63, Output: 1 row (71B)
    Output layout: [device_id, platform, count_4, count]
    Output partitioning: SINGLE []
    Grouped Execution: false
    - TopNPartial[100 by (count DESC_NULLS_LAST)] => [device_id:varchar, platform:varchar, count_4:bigint, count:bigint]
            CPU: 4.00ms (0.00%), Scheduled: 21.00ms (0.00%), Output: 1 row (71B)
            Input avg.: 0.00 rows, Input std.dev.: 3793.42%
        - Project[] => [device_id:varchar, platform:varchar, count_4:bigint, count:bigint]
                CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Input: 1 row (80B), Output: 1 row (71B), Filtered: 0.00%
                Input avg.: 0.00 rows, Input std.dev.: 3793.42%
            - Aggregate(FINAL)[device_id, platform][$hashvalue] => [device_id:varchar, platform:varchar, $hashvalue:bigint, count_4:bigint, count:bigint]
                    CPU: 1.00ms (0.00%), Scheduled: 1.00ms (0.00%), Output: 1 row (80B)
                    Input avg.: 0.00 rows, Input std.dev.: 3793.42%
                    count_4 := "count"("count_21")
                    count := "count"("count_22")
                - LocalExchange[HASH][$hashvalue] ("device_id", "platform") => device_id:varchar, platform:varchar, count_22:bigint, count_21:bigint, $hashvalue:bigint
                        CPU: 0.00ns (0.00%), Scheduled: 1.00ms (0.00%), Output: 6 rows (480B)
                        Input avg.: 0.00 rows, Input std.dev.: 2681.42%
                    - RemoteSource[3] => [device_id:varchar, platform:varchar, count_22:bigint, count_21:bigint, $hashvalue_23:bigint]
                            CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 6 rows (480B)
                            Input avg.: 0.00 rows, Input std.dev.: 2681.42%

Fragment 3 [HASH]
    CPU: 151.38ms, Scheduled: 201.55ms, Input: 604 rows (75.00kB); per task: avg.: 6.71 std.dev.: 28.42, Output: 6 rows (480B)
    Output layout: [device_id, platform, count_22, count_21, $hashvalue_24]
    Output partitioning: HASH [device_id, platform][$hashvalue_24]
    Grouped Execution: false
    - Aggregate(PARTIAL)[device_id, platform][$hashvalue_24] => [device_id:varchar, platform:varchar, $hashvalue_24:bigint, count_22:bigint, count_21:bigint]
            CPU: 6.00ms (0.00%), Scheduled: 6.00ms (0.00%), Output: 6 rows (480B)
            Input avg.: 0.42 rows, Input std.dev.: 1737.55%
            count_22 := "count"("username") (mask = expr$distinct)
            count_21 := "count"("mapped_username") (mask = expr$distinct_20)
        - MarkDistinct[distinct=platform:varchar, mapped_username:varchar marker=expr$distinct_20][$hashvalue_25] => [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, expr$distinct:boolean, $hashvalue_24:bigint, $hashvalue_25:bigint, $hashvalue_26:bigint, expr$distinct_20:boolean]
                CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 604 rows (76.19kB)
                Input avg.: 0.42 rows, Input std.dev.: 1737.55%
            - LocalExchange[HASH][$hashvalue_26] ("device_id", "platform", "mapped_username") => device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, expr$distinct:boolean, $hashvalue_24:bigint, $hashvalue_25:bigint, $hashvalue_26:bigint
                    CPU: 0.00ns (0.00%), Scheduled: 8.00ms (0.00%), Output: 604 rows (75.00kB)
                    Input avg.: 0.42 rows, Input std.dev.: 1143.80%
                - RemoteSource[4] => [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, expr$distinct:boolean, $hashvalue_27:bigint, $hashvalue_28:bigint, $hashvalue_29:bigint]
                        CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 604 rows (75.00kB)
                        Input avg.: 0.42 rows, Input std.dev.: 1143.80%

Fragment 4 [HASH]
    CPU: 191.45ms, Scheduled: 206.36ms, Input: 604 rows (84.44kB); per task: avg.: 6.71 std.dev.: 29.43, Output: 604 rows (75.02kB)
    Output layout: [device_id, platform, username, mapped_username, expr$distinct, $hashvalue_30, $hashvalue_31, $hashvalue_32]
    Output partitioning: HASH [device_id, platform, mapped_username][$hashvalue_32]
    Grouped Execution: false
    - Project[] => [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_30:bigint, $hashvalue_31:bigint, $hashvalue_32:bigint, expr$distinct:boolean]
            CPU: 6.00ms (0.00%), Scheduled: 6.00ms (0.00%), Input: 604 rows (85.63kB), Output: 604 rows (75.02kB), Filtered: 0.00%
            Input avg.: 0.42 rows, Input std.dev.: 1796.41%
        - MarkDistinct[distinct=platform:varchar, username:varchar marker=expr$distinct][$hashvalue_33] => [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_30:bigint, $hashvalue_31:bigint, $hashvalue_32:bigint, $hashvalue_33:bigint, $hashvalue_34:bigint, expr$distinct:boolean]
                CPU: 1.00ms (0.00%), Scheduled: 1.00ms (0.00%), Output: 604 rows (85.63kB)
                Input avg.: 0.42 rows, Input std.dev.: 1796.41%
            - LocalExchange[HASH][$hashvalue_34] ("device_id", "platform", "username") => device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_30:bigint, $hashvalue_31:bigint, $hashvalue_32:bigint, $hashvalue_33:bigint, $hashvalue_34:bigint
                    Cost: {rows: 0 (0B), cpu: 0.00, memory: 0.00, network: 0.00}
                    CPU: 3.00ms (0.00%), Scheduled: 6.00ms (0.00%), Output: 604 rows (84.44kB)
                    Input avg.: 0.42 rows, Input std.dev.: 1273.70%
                - RemoteSource[5] => [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_35:bigint, $hashvalue_36:bigint, $hashvalue_37:bigint, $hashvalue_38:bigint, $hashvalue_39:bigint]
                        Cost: {rows: 0 (0B), cpu: 0.00, memory: 0.00, network: 0.00}
                        CPU: 4.00ms (0.00%), Scheduled: 4.00ms (0.00%), Output: 604 rows (84.44kB)
                        Input avg.: 0.42 rows, Input std.dev.: 1273.70%

Fragment 5 [SOURCE]
    CPU: 30.20m, Scheduled: 1.40h, Input: 3819046365 rows (145.47GB); per task: avg.: 42433848.50 std.dev.: 2674830.76, Output: 604 rows (84.41kB)
    Output layout: [device_id, platform, username, mapped_username, $hashvalue_40, $hashvalue_41, $hashvalue_42, $hashvalue_43, $hashvalue_44]
    Output partitioning: HASH [device_id, platform, username][$hashvalue_44]
    Grouped Execution: false
    - ScanFilterProject[table = hive:risk:device_check, grouped = false, filterPredicate = ((("device_id" = CAST('b8528345-b827-4e55-a087-0a24973024fe' AS varchar)) AND (NOT ("username" IS NULL))) AND ("date_diff"('DAY', "timestamp", "$literal$timestamp"(1580601600000)) <= BIGINT '90'))] => [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_40:bigint, $hashvalue_41:bigint, $hashvalue_42:bigint, $hashvalue_43:bigint, $hashvalue_44:bigint]
            Cost: {rows: 0 (0B), cpu: 0.00, memory: 0.00, network: 0.00}/{rows: 0 (0B), cpu: 0.00, memory: 0.00, network: 0.00}/{rows: 0 (0B), cpu: 0.00, memory: 0.00, network: 0.00}
            CPU: 30.20m (100.00%), Scheduled: 1.40h (100.00%), Input: 3819046365 rows (145.47GB), Output: 604 rows (84.41kB), Filtered: 100.00%
            Input avg.: 216917.32 rows, Input std.dev.: 59.44%
            $hashvalue_40 := "combine_hash"("combine_hash"(bigint '0', COALESCE("$operator$hash_code"("device_id"), 0)), COALESCE("$operator$hash_code"("platform"), 0))
            $hashvalue_41 := "combine_hash"("combine_hash"(bigint '0', COALESCE("$operator$hash_code"("platform"), 0)), COALESCE("$operator$hash_code"("mapped_username"), 0))
            $hashvalue_42 := "combine_hash"("combine_hash"("combine_hash"(bigint '0', COALESCE("$operator$hash_code"("device_id"), 0)), COALESCE("$operator$hash_code"("platform"), 0)), COALESCE("$operator$hash_code"("mapped_username"), 0))
            $hashvalue_43 := "combine_hash"("combine_hash"(bigint '0', COALESCE("$operator$hash_code"("platform"), 0)), COALESCE("$operator$hash_code"("username"), 0))
            $hashvalue_44 := "combine_hash"("combine_hash"("combine_hash"(bigint '0', COALESCE("$operator$hash_code"("device_id"), 0)), COALESCE("$operator$hash_code"("platform"), 0)), COALESCE("$operator$hash_code"("username"), 0))
            LAYOUT: risk.device_check
            device_id := device_id:string:4:REGULAR
            mapped_username := mapped_username:string:17:REGULAR
            platform := platform:string:1:REGULAR
            username := username:string:8:REGULAR
            timestamp := timestamp:timestamp:5:REGULAR
            dt:string:-1:PARTITION_KEY
                :: [[2019-10-31, 2020-02-01]]

Jiawei Zhang 1 month ago
explain analyze 327

Fragment 1 [SINGLE]
    CPU: 933.42us, Scheduled: 996.67us, Input: 1 row (71B); per task: avg.: 1.00 std.dev.: 0.00, Output: 1 row (71B)
    Output layout: [device_id, platform, count_4, count]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    TopN[100 by (count DESC_NULLS_LAST)]
    │   Layout: [device_id:varchar, platform:varchar, count_4:bigint, count:bigint]
    │   CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 1 row (71B)
    │   Input avg.: 1.00 rows, Input std.dev.: 0.00%
    └─ LocalExchange[SINGLE] ()
       │   Layout: [device_id:varchar, platform:varchar, count_4:bigint, count:bigint]
       │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
       │   CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 1 row (71B)
       │   Input avg.: 0.06 rows, Input std.dev.: 387.30%
       └─ RemoteSource[2]
              Layout: [device_id:varchar, platform:varchar, count_4:bigint, count:bigint]
              CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 1 row (71B)
              Input avg.: 0.06 rows, Input std.dev.: 387.30%

Fragment 2 [HASH]
    CPU: 214.55ms, Scheduled: 255.95ms, Input: 6 rows (480B); per task: avg.: 0.06 std.dev.: 0.60, Output: 1 row (71B)
    Output layout: [device_id, platform, count_4, count]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    TopNPartial[100 by (count DESC_NULLS_LAST)]
    │   Layout: [device_id:varchar, platform:varchar, count_4:bigint, count:bigint]
    │   CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 1 row (71B)
    │   Input avg.: 0.00 rows, Input std.dev.: 3998.75%
    └─ Project[]
       │   Layout: [device_id:varchar, platform:varchar, count_4:bigint, count:bigint]
       │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
       │   CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 1 row (71B)
       │   Input avg.: 0.00 rows, Input std.dev.: 3998.75%
       └─ Aggregate(FINAL)[device_id, platform][$hashvalue]
          │   Layout: [device_id:varchar, platform:varchar, $hashvalue:bigint, count_4:bigint, count:bigint]
          │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
          │   CPU: 0.00ns (0.00%), Scheduled: 2.00ms (0.00%), Output: 1 row (80B)
          │   Input avg.: 0.00 rows, Input std.dev.: 3998.75%
          │   count_4 := count("count_21")
          │   count := count("count_22")
          └─ LocalExchange[HASH][$hashvalue] ("device_id", "platform")
             │   Layout: [device_id:varchar, platform:varchar, count_22:bigint, count_21:bigint, $hashvalue:bigint]
             │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
             │   CPU: 0.00ns (0.00%), Scheduled: 13.00ms (0.00%), Output: 6 rows (480B)
             │   Input avg.: 0.00 rows, Input std.dev.: 1882.96%
             └─ RemoteSource[3]
                    Layout: [device_id:varchar, platform:varchar, count_22:bigint, count_21:bigint, $hashvalue_23:bigint]
                    CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 6 rows (480B)
                    Input avg.: 0.00 rows, Input std.dev.: 1882.96%

Fragment 3 [HASH]
    CPU: 183.29ms, Scheduled: 215.40ms, Input: 604 rows (75.00kB); per task: avg.: 6.04 std.dev.: 27.04, Output: 6 rows (480B)
    Output layout: [device_id, platform, count_22, count_21, $hashvalue_24]
    Output partitioning: HASH [device_id, platform][$hashvalue_24]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Aggregate(PARTIAL)[device_id, platform][$hashvalue_24]
    │   Layout: [device_id:varchar, platform:varchar, $hashvalue_24:bigint, count_22:bigint, count_21:bigint]
    │   CPU: 2.00ms (0.00%), Scheduled: 2.00ms (0.00%), Output: 6 rows (480B)
    │   Input avg.: 0.38 rows, Input std.dev.: 1831.84%
    │   count_22 := count("username") (mask = expr$distinct)
    │   count_21 := count("mapped_username") (mask = expr$distinct_20)
    └─ MarkDistinct[distinct=platform:varchar, mapped_username:varchar marker=expr$distinct_20][$hashvalue_25]
       │   Layout: [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, expr$distinct:boolean, $hashvalue_24:bigint, $hashvalue_25:bigint, $hashvalue_26:bigint, expr$distinct_20:boolean]
       │   CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 604 rows (76.19kB)
       │   Input avg.: 0.38 rows, Input std.dev.: 1831.84%
       └─ LocalExchange[HASH][$hashvalue_26] ("device_id", "platform", "mapped_username")
          │   Layout: [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, expr$distinct:boolean, $hashvalue_24:bigint, $hashvalue_25:bigint, $hashvalue_26:bigint]
          │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
          │   CPU: 1.00ms (0.00%), Scheduled: 5.00ms (0.00%), Output: 604 rows (75.00kB)
          │   Input avg.: 0.38 rows, Input std.dev.: 1132.90%
          └─ RemoteSource[4]
                 Layout: [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, expr$distinct:boolean, $hashvalue_27:bigint, $hashvalue_28:bigint, $hashvalue_29:bigint]
                 CPU: 0.00ns (0.00%), Scheduled: 0.00ns (0.00%), Output: 604 rows (75.00kB)
                 Input avg.: 0.38 rows, Input std.dev.: 1132.90%

Fragment 4 [HASH]
    CPU: 238.96ms, Scheduled: 273.73ms, Input: 604 rows (84.44kB); per task: avg.: 6.04 std.dev.: 27.99, Output: 604 rows (75.02kB)
    Output layout: [device_id, platform, username, mapped_username, expr$distinct, $hashvalue_30, $hashvalue_31, $hashvalue_32]
    Output partitioning: HASH [device_id, platform, mapped_username][$hashvalue_32]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Project[]
    │   Layout: [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_30:bigint, $hashvalue_31:bigint, $hashvalue_32:bigint, expr$distinct:boolean]
    │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
    │   CPU: 7.00ms (0.00%), Scheduled: 10.00ms (0.00%), Output: 604 rows (75.02kB)
    │   Input avg.: 0.38 rows, Input std.dev.: 1893.88%
    └─ MarkDistinct[distinct=platform:varchar, username:varchar marker=expr$distinct][$hashvalue_33]
       │   Layout: [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_30:bigint, $hashvalue_31:bigint, $hashvalue_32:bigint, $hashvalue_33:bigint, $hashvalue_34:bigint, expr$distinct:boolean]
       │   CPU: 2.00ms (0.00%), Scheduled: 4.00ms (0.00%), Output: 604 rows (85.63kB)
       │   Input avg.: 0.38 rows, Input std.dev.: 1893.88%
       └─ LocalExchange[HASH][$hashvalue_34] ("device_id", "platform", "username")
          │   Layout: [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_30:bigint, $hashvalue_31:bigint, $hashvalue_32:bigint, $hashvalue_33:bigint, $hashvalue_34:bigint]
          │   Estimates: {rows: 0 (0B), cpu: 0, memory: 0B, network: 0B}
          │   CPU: 5.00ms (0.00%), Scheduled: 6.00ms (0.00%), Output: 604 rows (84.44kB)
          │   Input avg.: 0.38 rows, Input std.dev.: 1133.63%
          └─ RemoteSource[5]
                 Layout: [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_35:bigint, $hashvalue_36:bigint, $hashvalue_37:bigint, $hashvalue_38:bigint, $hashvalue_39:bigint]
                 CPU: 6.00ms (0.00%), Scheduled: 6.00ms (0.00%), Output: 604 rows (84.44kB)
                 Input avg.: 0.38 rows, Input std.dev.: 1133.63%

Fragment 5 [SOURCE]
    CPU: 1.39h, Scheduled: 2.13h, Input: 5400689339 rows (281.18GB); per task: avg.: 36004595.59 std.dev.: 12814237.66, Output: 604 rows (84.41kB)
    Output layout: [device_id, platform, username, mapped_username, $hashvalue_40, $hashvalue_41, $hashvalue_42, $hashvalue_43, $hashvalue_44]
    Output partitioning: HASH [device_id, platform, username][$hashvalue_44]
    Stage Execution Strategy: UNGROUPED_EXECUTION
    ScanFilterProject[table = hive:risk:device_check, grouped = false, filterPredicate = (((NOT ("username" IS NULL)) AND ("device_id" = CAST('b8528345-b827-4e55-a087-0a24973024fe' AS varchar))) AND ("@date_diff|bigint|varchar(3)|timestamp|timestamp@date_diff(varchar(x),timestamp,timestamp):bigint"('DAY', "timestamp", "@$literal$|timestamp|bigint@$literal$<f,t>(f):t"(1580601600000)) <= BIGINT '90'))]
        Layout: [device_id:varchar, platform:varchar, username:varchar, mapped_username:varchar, $hashvalue_40:bigint, $hashvalue_41:bigint, $hashvalue_42:bigint, $hashvalue_43:bigint, $hashvalue_44:bigint]
        Estimates: {rows: 0 (0B), cpu: 0, memory: 0B, network: 0B}/{rows: 0 (0B), cpu: 0, memory: 0B, network: 0B}/{rows: 0 (0B), cpu: 0, memory: 0B, network: 0B}
        CPU: 1.39h (100.00%), Scheduled: 2.98h (100.00%), Output: 604 rows (84.41kB)
        Input avg.: 306752.77 rows, Input std.dev.: 456.34%
        $hashvalue_40 := combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("device_id"), 0)), COALESCE("$operator$hash_code"("platform"), 0))
        $hashvalue_41 := combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("platform"), 0)), COALESCE("$operator$hash_code"("mapped_username"), 0))
        $hashvalue_42 := combine_hash(combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("device_id"), 0)), COALESCE("$operator$hash_code"("platform"), 0)), COALESCE("$operator$hash_code"("mapped_username"), 0))
        $hashvalue_43 := combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("platform"), 0)), COALESCE("$operator$hash_code"("username"), 0))
        $hashvalue_44 := combine_hash(combine_hash(combine_hash(bigint '0', COALESCE("$operator$hash_code"("device_id"), 0)), COALESCE("$operator$hash_code"("platform"), 0)), COALESCE("$operator$hash_code"("username"), 0))
        device_id := device_id:string:4:REGULAR
        mapped_username := mapped_username:string:17:REGULAR
        platform := platform:string:1:REGULAR
        timestamp := timestamp:timestamp:5:REGULAR
        username := username:string:8:REGULAR
        dt:string:-1:PARTITION_KEY
            :: [[2019-10-31, 2020-02-01]]
        Input: 5400689339 rows (281.18GB), Filtered: 100.00%

Jiawei Zhang 1 month ago
Cluster is using all default config

sopel39 1 month ago
In PrestoSQL INPUT data is actually data after decompression. You can check Query JSON to actually see how much physical data was read

Jiawei Zhang 1 month ago
I checked the query json, decompressed data size is also doubled in the 327 version:

Jiawei Zhang 1 month ago
JavaScript/JSON
327.json
327.json.txt

Jiawei Zhang 1 month ago
JavaScript/JSON
215.json
215.json.txt

Jiawei Zhang 1 month ago
We also noticed that some tasks scan rows with a speed of 1/4 or less than other tasks. That does not happen on specific worker node.

sopel39 1 month ago
indeed, 327 reads more data (409.45GB vs 243.32GB) . Is your file format Parquet?

sopel39 1 month ago
what is the query duration?

Jiawei Zhang 1 month ago
Yes file is parquet format. Query lasted ~40s on FB v215 and ~10min on 327

martin 1 month ago
Can you try with Presto 324? I’d like to see if this might be related to some changes that went into version 325.

Jiawei Zhang 1 month ago
I could try it in next few days.

Jiawei Zhang 17 days ago
Not yet. Still getting a new cluster.

@tooptoop4
Copy link
Contributor Author

Regression potentially introduced by #1925 from 325 release onwards

@hashhar
Copy link
Member

hashhar commented Sep 3, 2024

A new baseline needs to be established with most recent Trino to see if the gap still exists and how much.

Closing since the data in the issue is no longer relevant.

@hashhar hashhar closed this as not planned Won't fix, can't repro, duplicate, stale Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants