-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why Spark produces performance data based on csv dataset #235
Comments
Do you mean that Spark is not dealing well with CSV data? |
Thanks for investigation. You should try to cache data in memory before running first query, as we do in benchmark script. As of now your code is probably including time to load data from csv in the time of query. |
I don't think that your code caches data in memory. Spark interface is lazy and AFAIR calling persist alone does not force computation. |
See my code in the attachment, you can take a look at it as a whole. |
It may be easier if you just run db-benchmark script having data source replaced and compare timings. |
|
The performance of Spark based on the csv dataset is not optimal, but the current performance data of Spark is based on the csv dataset. What is the author's consideration?
The text was updated successfully, but these errors were encountered: