-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I find blaze has no acceleration effect?why #426
Comments
did you run the benchmark on a isolated cluster? it looks like the sql time varies greatly, may be resources issue? |
thanks ,this maybe the reason.But on the other hand, there is a limit 100 at the end of each query sql. Could this also be the reason for the unstable running time? Because the limit 100 only needs to take the 100 pieces of data from the final calculation result. Maybe every time it is run, the last one obtained 100 pieces of data are all different |
i suggest testing with a larger dataset with isolated cluster to get stable benchmark result. most time is taken in driver side and the performance is not stable if the dataset is too small. |
By the way, what are the num executor and executor core values for Spark job during the following official testing https://github.com/kwai/blaze/blob/master/benchmark-results/20240202.md |
we use the setting |
spark.executor.cores is 5, and what is the number of executor? |
I use 15executors2core +1driver core=31 CPUs and 158g executor+2g driver=122g of memory to run spark jobs. The resources are sufficient. Each job is run three times, but the running time is still different each time. The speed difference between using and not using blaze is almost the same. Does this plugin really work? |
I built the environment of spark3.3.3 and blaze2.0.8, then i did some tests based on 100G tpcds data,however,I did not receive any benefits compared to not using blaze as well.This is my login command: |
By the way, the plan tree shows that the plugin is indeed effective,plans are converted to native plan but the query time does not decrease. |
it's likely related to some hard-written configurations, for example shuffle compression is fixed to zstd in blaze, while spark uses lz4 as default. in low IO latency environment blaze will take more time on compression and slow down the performance. |
spark version 3.3.3
this is spark conf with blaze
spark.executor.memory 5g
spark.executor.memoryOverhead 3072
spark.blaze.memoryFraction 0.7
spark.blaze.enable.caseconvert.functions true
spark.blaze.enable.smjInequalityJoin false
spark.blaze.enable.bhjFallbacksToSmj false
this is spark conf without blaze
spark.executor.memory 6g
spark.executor.memoryOverhead 2048
driver-memory 4G
num-executors 6
I find , There is not much difference between Spark sql job with and without blaze,
![1711691205813](https://private-user-images.githubusercontent.com/129055247/317964668-5a0d653b-163c-49dc-9445-d5fe0b97655c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg4NTgyMDcsIm5iZiI6MTcxODg1NzkwNywicGF0aCI6Ii8xMjkwNTUyNDcvMzE3OTY0NjY4LTVhMGQ2NTNiLTE2M2MtNDlkYy05NDQ1LWQ1ZmUwYjk3NjU1Yy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNjIwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDYyMFQwNDMxNDdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03N2ExYWIwZjg3NWIzOTA0NDUwMWEzOTgzMWMzOWI2MjRmZjk2M2VjZTBlMmVjNDBiYzQ1YzI0Y2JkODM2N2E5JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.wfuOFlYoLeH73t9BKFXMkX3kr7Jz8G4oRCKL32OULKA)
using 100G tpcds parquet data, running spark-sql like query4 query5, query16 query17 in the picture ,The speed difference is not significant between using blaze and not using.Even for the same spark-sql job with blaze or without blaze, running it three times, The time consumption is also different each time
The text was updated successfully, but these errors were encountered: