Parallel csv writer #3652

ritchie46 · 2022-06-10T10:15:49Z

Locally had a 3.5x speedup writing 5 million rows and 18 columns to an internal buffer.

cbilot · 2022-06-12T00:58:37Z

I thought it might be helpful to benchmark the new parallel write_csv against a known very fast csv file-writer: fwrite from the data.table in R.

Note: I realize you may have further work on write_csv, but all the same -- it may be useful to see how write_csv stacks up against fwrite at this point.

Set up

Test data

From the "Benchmarking data.table" guide:

I’m very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale.

The test dataset is 500 million records with 29 columns: mostly strings, a few dates, and one integer. The dataset was stored as a feather/IPC file to preserve datatypes across data.table and Polars.

Platform

System: Linux Mint 20.3 Una base: Ubuntu 20.04 focal
CPU: 32-Core (4-Die) model: AMD Ryzen Threadripper PRO 3975WX
RAM: 512 GB

Scripts

The R code used to benchmark fwrite:

library(data.table)
library(arrow)

df <- arrow::read_feather('write_csv_benchmark.ipc')
setDT(df)
system.time(fwrite(df, '/tmp/write_csv_benchmark.R.csv'))

The corresponding Python code to benchmark write_csv:

import polars as pl
import os
df = pl.read_ipc('write_csv_benchmark.ipc', use_pyarrow=True)

start = os.times()
df.write_csv('/tmp/write_csv_benchmark.Polars.csv')
end = os.times()
print("user", ":", end.user - start.user)
print("system", ":", end.system - start.system)
print("elapsed", ":", end.elapsed - start.elapsed)

Results

Both fwrite and write_csv created a csv file of 80GB. A diff comparing the csv file created by write_csv and the csv file created by fwrite showed no differences. (I was somewhat surprised as I thought there might be some minor differences.)

The times listed below are the elapsed time in seconds to write the csv file. To consider the impact of I/O bottlenecks, I ran benchmarks against three different storage media.

Media 1: a single 1 TB Samsung 980 Pro PCIe Gen4 NVMe

The partition written to is encrypted with LUKS, which provides some amount of overhead in writing. The output file was deleted after each run, and fstrim was run to prevent possible degradation in write times, due to repeatedly writing and deleting large files. I also allowed about 1 minute between runs to allow any heat to dissipate from the NVMe drive, as heat build-up might lead to throttling.

data.table fwrite: 102.0 sec (mean), 96.7 sec (min)
(104.039, 103.524, 96.734, 104.636, 101.24)

Polars write_csv: 95.6 sec (mean), 93.4 sec (min)
(95.55, 97.84, 94.86, 96.07, 93.44)

Polars write_csv is faster by 6%.

Media 2: four 1-TB PCIe Gen4 NVMe in RAID0

This is a project folder designed for very high I/0 throughput (and yes, backed up nightly). The output file was deleted after each run, and fstrim was run to prevent possible degradation in write times, due to repeatedly writing and deleting large files. I also allowed about 1 minute between runs to allow any heat to dissipate from the NVMe drive. (Since the workload is spread against 4 drives which are all housed in an enclosure with heat sinks and a dedicated fan, thermal throttling is far less likely.)

data.table fwrite: 80.9 sec (mean), 80.2 sec (min)
(81.648, 80.768, 81.161, 80.219, 80.468)

Polars write_csv: 76.8 sec (mean), 76.0 sec (min)
(77.17, 77.02, 76.5, 76.02, 77.17)

Polars write_csv is faster by 5%, which is similar to the single NVMe result.

Media 3: In-Memory file buffer

The Threadripper Pro platform has 8-channel memory, creating tremendous throughput between memory and CPU. As such, the in-memory buffer should reduce much of the I/O blocking.

data.table fwrite: 62.9 sec (mean), 61.5 sec (min)
(61.538, 66.242, 63.063, 61.978, 62.595, 62.383, 62.622)

Polars write_csv: 55.1 sec (mean), 53.4 sec (min)
(53.44, 53.99, 59.33, 54.72, 54.67, 54.67, 54.8)

Polars is faster by 12%. Perhaps when compared to the results above, this may indicate some small headroom for dealing with the effects of I/O blocking.

Additional Notes

Version Info

R: 4.2, data.table: 1.14.2
Python: 3.10.4, Polars: compiled through commit 24d0cd0

cbilot · 2022-06-12T14:37:56Z

I find this feature highly relevant and necessary as I have to read, manipulate, and save 100GB+ data constantly. It's a bit odd that data read supports full-fledged multithreading but not in saving. With my 36-core system, it takes about 20-50times to save the same-sized data than to read it.

@momentlost I designed the above benchmark partly with your use case in mind - a multi-core system writing large csv files .

Looking at the massive performance improvement and parallelization in write_csv, I think you’ll want to upgrade to Polars 0.13.45 released today.

momentlost · 2022-06-13T04:05:51Z

This is beautiful. Thank you for the update.

github-actions bot added the rust Related to Rust Polars label Jun 10, 2022

parallel csv writer

5bcaad7

ritchie46 force-pushed the par_csv_write branch from 8463f8b to 5bcaad7 Compare June 10, 2022 14:10

github-actions bot added the python Related to Python Polars label Jun 10, 2022

ritchie46 changed the title ~~WIP~~ Parallel csv writer Jun 10, 2022

ritchie46 merged commit 24d0cd0 into master Jun 10, 2022

ritchie46 deleted the par_csv_write branch June 10, 2022 14:47

cbilot mentioned this pull request Nov 18, 2023

Multithreaded write_csv() #3004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel csv writer #3652

Parallel csv writer #3652

ritchie46 commented Jun 10, 2022 •

edited

cbilot commented Jun 12, 2022

cbilot commented Jun 12, 2022

momentlost commented Jun 13, 2022

Parallel csv writer #3652

Parallel csv writer #3652

Conversation

ritchie46 commented Jun 10, 2022 • edited

cbilot commented Jun 12, 2022

Set up

Test data

Platform

Scripts

Results

Media 1: a single 1 TB Samsung 980 Pro PCIe Gen4 NVMe

Media 2: four 1-TB PCIe Gen4 NVMe in RAID0

Media 3: In-Memory file buffer

Additional Notes

Version Info

cbilot commented Jun 12, 2022

momentlost commented Jun 13, 2022

ritchie46 commented Jun 10, 2022 •

edited