New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel csv writer #3652
Parallel csv writer #3652
Conversation
I thought it might be helpful to benchmark the new parallel Note: I realize you may have further work on Set upTest dataFrom the "Benchmarking data.table" guide:
The test dataset is 500 million records with 29 columns: mostly strings, a few dates, and one integer. The dataset was stored as a feather/IPC file to preserve datatypes across PlatformSystem: Linux Mint 20.3 Una base: Ubuntu 20.04 focal ScriptsThe R code used to benchmark library(data.table)
library(arrow)
df <- arrow::read_feather('write_csv_benchmark.ipc')
setDT(df)
system.time(fwrite(df, '/tmp/write_csv_benchmark.R.csv')) The corresponding Python code to benchmark import polars as pl
import os
df = pl.read_ipc('write_csv_benchmark.ipc', use_pyarrow=True)
start = os.times()
df.write_csv('/tmp/write_csv_benchmark.Polars.csv')
end = os.times()
print("user", ":", end.user - start.user)
print("system", ":", end.system - start.system)
print("elapsed", ":", end.elapsed - start.elapsed) ResultsBoth The times listed below are the elapsed time in seconds to write the csv file. To consider the impact of I/O bottlenecks, I ran benchmarks against three different storage media. Media 1: a single 1 TB Samsung 980 Pro PCIe Gen4 NVMeThe partition written to is encrypted with LUKS, which provides some amount of overhead in writing. The output file was deleted after each run, and data.table Polars Polars Media 2: four 1-TB PCIe Gen4 NVMe in RAID0This is a project folder designed for very high I/0 throughput (and yes, backed up nightly). The output file was deleted after each run, and data.table Polars Polars Media 3: In-Memory file bufferThe Threadripper Pro platform has 8-channel memory, creating tremendous throughput between memory and CPU. As such, the in-memory buffer should reduce much of the I/O blocking. data.table Polars Polars is faster by 12%. Perhaps when compared to the results above, this may indicate some small headroom for dealing with the effects of I/O blocking. Additional NotesVersion InfoR: 4.2, data.table: 1.14.2 |
@momentlost I designed the above benchmark partly with your use case in mind - a multi-core system writing large csv files . Looking at the massive performance improvement and parallelization in |
This is beautiful. Thank you for the update. |
Locally had a 3.5x speedup writing 5 million rows and 18 columns to an internal buffer.
closes #3004