Why Spark produces performance data based on csv dataset #235

Tom-Deng · 2021-11-02T08:27:35Z

The performance of Spark based on the csv dataset is not optimal, but the current performance data of Spark is based on the csv dataset. What is the author's consideration?

jangorecki · 2021-11-02T08:39:06Z

Do you mean that Spark is not dealing well with CSV data?
It loads the data into memory so CSV is just a source for data. Later on spark operates on an in-memory object and not the CSV.

Tom-Deng · 2021-11-02T11:03:36Z

Do you mean that Spark is not dealing well with CSV data? It loads the data into memory so CSV is just a source for data. Later on spark operates on an in-memory object and not the CSV.
Spark has specifically optimized parquet files. Take q1 of 5G data groupby operation as an example，spark processing csv dataset takes 8.42s on average. but spark processing parquet dataset takes 555ms on average.
csv data:

parquet data:

jangorecki · 2021-11-03T12:01:14Z

Thanks for investigation. You should try to cache data in memory before running first query, as we do in benchmark script. As of now your code is probably including time to load data from csv in the time of query.

Tom-Deng · 2021-11-05T07:42:58Z

Thanks for investigation. You should try to cache data in memory before running first query, as we do in benchmark script. As of now your code is probably including time to load data from csv in the time of query.
Refer to the suggestions above，I try to cache data in memory before running first query，but the result is basically the same.

jangorecki · 2021-11-05T16:20:02Z

I don't think that your code caches data in memory. Spark interface is lazy and AFAIR calling persist alone does not force computation.

Tom-Deng · 2021-11-09T06:10:09Z

I don't think that your code caches data in memory. Spark interface is lazy and AFAIR calling persist alone does not force computation.

See my code in the attachment, you can take a look at it as a whole.
spark_csv_vs_parquet.ipynb.zip

jangorecki · 2021-11-09T21:56:44Z

It may be easier if you just run db-benchmark script having data source replaced and compare timings.

Tom-Deng · 2021-11-24T07:14:02Z

It may be easier if you just run db-benchmark script having data source replaced and compare timings.
Caching data in memory，the performance of csv and parquet is indeed similar in db-benchmark's scense.

Tom-Deng closed this as completed Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Spark produces performance data based on csv dataset #235

Why Spark produces performance data based on csv dataset #235

Tom-Deng commented Nov 2, 2021

jangorecki commented Nov 2, 2021

Tom-Deng commented Nov 2, 2021

jangorecki commented Nov 3, 2021 •

edited

Loading

Tom-Deng commented Nov 5, 2021

jangorecki commented Nov 5, 2021

Tom-Deng commented Nov 9, 2021

jangorecki commented Nov 9, 2021

Tom-Deng commented Nov 24, 2021

Why Spark produces performance data based on csv dataset #235

Why Spark produces performance data based on csv dataset #235

Comments

Tom-Deng commented Nov 2, 2021

jangorecki commented Nov 2, 2021

Tom-Deng commented Nov 2, 2021

jangorecki commented Nov 3, 2021 • edited Loading

Tom-Deng commented Nov 5, 2021

jangorecki commented Nov 5, 2021

Tom-Deng commented Nov 9, 2021

jangorecki commented Nov 9, 2021

Tom-Deng commented Nov 24, 2021

jangorecki commented Nov 3, 2021 •

edited

Loading