-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native seem to generate larger data size than Java #22184
Comments
Experiment with and without CASTI was looking at possible causes for the latency difference observed between Native & Java clusters for Q23 and observed the below, w.r.t performance of the CAST operator On a Native clusterMeasure read speed of the
Compare this against the read speed when we are forced to CAST the column to a decimal
On a Java clusterMeasure read speed of the
Compare this against the read speed when we are forced to CAST the column to a decimal
Possible cause(s)
|
@karteekmurthys : Lets write a micro-benchmark for this code. https://github.com/facebookincubator/velox/blob/main/velox/expression/CastExpr-inl.h#L424 |
@ZacBlanco the bug seems specific to decimal types. A related bug was fixed in Velox recently. facebookincubator/velox#8859 |
The build where this occured was based off of the following commit: https://github.com/prestodb/presto/commits/80d2c79 This branch has a velox submodule at https://github.com/facebookincubator/velox/tree/b6e3aad817bb45fa7f2a665840683b0bc27bdd32. That commit ( |
This fix is in aggregates @majetideepak. Zac is referring to the size of Data being read to be too big:
Presto:
|
Zac is referring to this
|
I found this similar to a query we saw where exchange received 15TB and turned that into 1.69PB. Internally we have a task tracking this issue: exchange->getOutput() would return a reusable vector which has an "inaccurate" size. we should read and register 'allocated' instead of 'used' bytes from it. cc: @mbasmanova |
@aaneja Are the CAST perf on native clusters stable? Can you please run it a few times and pick the stable numbers? I see the CPU time is about the same on decimal but elapsed time doubled. We need to rule out the perf invariance. |
@yingsu00 Yes I could repro this consistently, see https://gist.github.com/aaneja/dc70f655695933b6ff11978120bbebab |
We will add encoding in shuffle around April 15. We have tried preserving constants, replacing flat values with constants when the values are all the same and making string dictionaries in the case of few distinct values. These together drop the data on the wire by 25% on a slice of our batch workload. Adding LZ4 to that drops another 25%, so we end up at half the network traffic. A slightly different question is the reporting of operator output data size. Presto uses retainedSize(). Velox uses estimatedFlatSize. These mean different things. Neither is really representative. We could define a used retained size by taking estimated flat size but counting non-scalar data wrapped in dictionaries for distinct uses, not all uses. So, if there is a column where all rows refer to the same wide value via dictionary, the wide value should be counted once, not for every row. This is Ke’s example. The DS examples do not involve encoding opportunities, they are high cardinality fixed width data. The combination of reencoding and more intuitive size accounting will resolve this. We’ll see if we get there in mid-April. |
@oerling : Do you have any open PRs for the improvements you are suggesting ? Would be great to follow them. |
Created #22346 to decouple the CAST issue from this one cc: @karteekmurthys |
Environment
These tests were performed against presto native and java on 0.287 with hive and TPC-DS SF10k on a 16-node cluster
Issue
Originally, we found this issue using TPC-DS Q23. We noticed that the Presto native execution took significantly longer to complete compared to Java.
Baseline
here is native whileTarget
is java.wall_ms
here for the native query is over 5x longer.This discrepancy was traced in the Q23 query plan to a join in native which has significantly more data input than the corresponding java execution. Notice that the data in the native execution generated by fragment 10 is small (519GB) but when read by the remote source as the input to the inner join the input is 5TB is size! Java in comparison has fragment 10 outputting 734GB in fragment 10 and calculating
RemoteSource[10]
as having 858GB input.Java Execution
Native Execution
Our suspect is the severe increase in data is likely the cause of the query slowdown. I was able to extract the relevant query from the plan and confirm that this issue persists even in simpler queries.
I ran this query on SF1k with a 2-node cluster and found that there is still an issue with large data transfer size. You'll see here the Java execution has an input size of
(86.03GB)
for fragment 1 while native has input of(335.08GB)
Java EXPLAIN ANALYZE
Native EXLPAIN ANALYZE
My first thought is that there's something wonky with the block encodings since the query results are correct, but haven't been able to confirm.
cc: @aditi-pandit @majetideepak @yingsu00
The text was updated successfully, but these errors were encountered: