Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel csv writer #3652

Merged
merged 1 commit into from Jun 10, 2022
Merged

Parallel csv writer #3652

merged 1 commit into from Jun 10, 2022

Conversation

ritchie46
Copy link
Member

@ritchie46 ritchie46 commented Jun 10, 2022

Locally had a 3.5x speedup writing 5 million rows and 18 columns to an internal buffer.

closes #3004

@github-actions github-actions bot added the rust Related to Rust Polars label Jun 10, 2022
@github-actions github-actions bot added the python Related to Python Polars label Jun 10, 2022
@ritchie46 ritchie46 changed the title WIP Parallel csv writer Jun 10, 2022
@ritchie46 ritchie46 merged commit 24d0cd0 into master Jun 10, 2022
@ritchie46 ritchie46 deleted the par_csv_write branch June 10, 2022 14:47
@cbilot
Copy link

cbilot commented Jun 12, 2022

I thought it might be helpful to benchmark the new parallel write_csv against a known very fast csv file-writer: fwrite from the data.table in R.

Note: I realize you may have further work on write_csv, but all the same -- it may be useful to see how write_csv stacks up against fwrite at this point.

Set up

Test data

From the "Benchmarking data.table" guide:

I’m very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale.

The test dataset is 500 million records with 29 columns: mostly strings, a few dates, and one integer. The dataset was stored as a feather/IPC file to preserve datatypes across data.table and Polars.

Platform

System: Linux Mint 20.3 Una base: Ubuntu 20.04 focal
CPU: 32-Core (4-Die) model: AMD Ryzen Threadripper PRO 3975WX
RAM: 512 GB

Scripts

The R code used to benchmark fwrite:

library(data.table)
library(arrow)

df <- arrow::read_feather('write_csv_benchmark.ipc')
setDT(df)
system.time(fwrite(df, '/tmp/write_csv_benchmark.R.csv'))

The corresponding Python code to benchmark write_csv:

import polars as pl
import os
df = pl.read_ipc('write_csv_benchmark.ipc', use_pyarrow=True)

start = os.times()
df.write_csv('/tmp/write_csv_benchmark.Polars.csv')
end = os.times()
print("user", ":", end.user - start.user)
print("system", ":", end.system - start.system)
print("elapsed", ":", end.elapsed - start.elapsed)

Results

Both fwrite and write_csv created a csv file of 80GB. A diff comparing the csv file created by write_csv and the csv file created by fwrite showed no differences. (I was somewhat surprised as I thought there might be some minor differences.)

The times listed below are the elapsed time in seconds to write the csv file. To consider the impact of I/O bottlenecks, I ran benchmarks against three different storage media.

Media 1: a single 1 TB Samsung 980 Pro PCIe Gen4 NVMe

The partition written to is encrypted with LUKS, which provides some amount of overhead in writing. The output file was deleted after each run, and fstrim was run to prevent possible degradation in write times, due to repeatedly writing and deleting large files. I also allowed about 1 minute between runs to allow any heat to dissipate from the NVMe drive, as heat build-up might lead to throttling.

data.table fwrite: 102.0 sec (mean), 96.7 sec (min)
(104.039, 103.524, 96.734, 104.636, 101.24)

Polars write_csv: 95.6 sec (mean), 93.4 sec (min)
(95.55, 97.84, 94.86, 96.07, 93.44)

Polars write_csv is faster by 6%.

Media 2: four 1-TB PCIe Gen4 NVMe in RAID0

This is a project folder designed for very high I/0 throughput (and yes, backed up nightly). The output file was deleted after each run, and fstrim was run to prevent possible degradation in write times, due to repeatedly writing and deleting large files. I also allowed about 1 minute between runs to allow any heat to dissipate from the NVMe drive. (Since the workload is spread against 4 drives which are all housed in an enclosure with heat sinks and a dedicated fan, thermal throttling is far less likely.)

data.table fwrite: 80.9 sec (mean), 80.2 sec (min)
(81.648, 80.768, 81.161, 80.219, 80.468)

Polars write_csv: 76.8 sec (mean), 76.0 sec (min)
(77.17, 77.02, 76.5, 76.02, 77.17)

Polars write_csv is faster by 5%, which is similar to the single NVMe result.

Media 3: In-Memory file buffer

The Threadripper Pro platform has 8-channel memory, creating tremendous throughput between memory and CPU. As such, the in-memory buffer should reduce much of the I/O blocking.

data.table fwrite: 62.9 sec (mean), 61.5 sec (min)
(61.538, 66.242, 63.063, 61.978, 62.595, 62.383, 62.622)

Polars write_csv: 55.1 sec (mean), 53.4 sec (min)
(53.44, 53.99, 59.33, 54.72, 54.67, 54.67, 54.8)

Polars is faster by 12%. Perhaps when compared to the results above, this may indicate some small headroom for dealing with the effects of I/O blocking.

Additional Notes

Version Info

R: 4.2, data.table: 1.14.2
Python: 3.10.4, Polars: compiled through commit 24d0cd0

@cbilot
Copy link

cbilot commented Jun 12, 2022

I find this feature highly relevant and necessary as I have to read, manipulate, and save 100GB+ data constantly. It's a bit odd that data read supports full-fledged multithreading but not in saving. With my 36-core system, it takes about 20-50times to save the same-sized data than to read it.

@momentlost I designed the above benchmark partly with your use case in mind - a multi-core system writing large csv files .

Looking at the massive performance improvement and parallelization in write_csv, I think you’ll want to upgrade to Polars 0.13.45 released today.

@momentlost
Copy link

This is beautiful. Thank you for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multithreaded write_csv()
3 participants