Multithreaded write_csv() #3004

momentlost · 2022-03-30T00:46:06Z

Hi,

It appears currently write_csv() method does not exploit multithread properly as read_csv() does. I am not sure how easy it would be to implement, but the current write speed is limited at around 20MB/second. The system I use has 30+ physical cores and the storage speed is 4,000MB/s.

Is there any way to speed up the write speed using multithread and also specify the n_threads in write_csv() just as in read_csv()?

Thanks!

momentlost · 2022-03-30T00:47:45Z

BTW I used to use R's data.table() for my workflow and it fully exploits multicore/thread in its fwrite() function. I was expecting the same for polars.

momentlost · 2022-04-05T14:35:12Z

Apologies for nagging. Any chance that this feature would be added/supported in a near future? I find this feature highly relevant and necessary as I have to read, manipulate, and save 100GB+ data constantly. It's a bit odd that data read supports full-fledged multithreading but not in saving. With my 36-core system, it takes about 20-50times to save the same-sized data than to read it.

ritchie46 · 2022-04-05T14:54:14Z

I am spending a lot of time on this project. It will come eventually. Please don't push. Have you considered opening a PR yourself?

On the subject. Probably a binary file type is better suited for these quantities of data. CSV is very expensive.

momentlost · 2022-04-05T15:25:22Z

I do appreciate all your efforts on this great project and I'm grateful. I just wondered whether it's gonna come eventually or not. I do understand this may take time since it may have to modify the fundamental way how it handles the .csv file saving.

Since I have zero experience in Rust, I doubt if I could be helpful in improving on this. The reason why I did not consider opening a PR request myself.

Thanks again.

cbilot · 2022-04-06T14:53:36Z

I too have a multi-core system (Threadripper Pro with 32 cores with 512GB of RAM). Indeed, I have four 1-terabtye PCI-Gen4 NVMe drives in RAID0 as my work directory. (Yep, those drives make it blazingly fast to store and retrieve arrow/ipc files in and out of memory.) I also come from R and its impressive data.table package, but now prefer Python with Polars.

May I ask why you are creating such large csv files of 100GB+?

Are you creating these huge csv files as input to another system (e.g., a database loader)?

Or perhaps are you using these as input files for programs you previously wrote in R/data.table (and leveraging data.tables fread method for fast parsing).

momentlost · 2022-04-06T14:59:43Z

The raw data are terabytes and after some aggregation, extraction, and cleaning it's still 10-100GB. Then they are passed on to my collaborators and/or used for subsequent analyses. If not hundreds of GBs, having to write even 10GB+ data are pretty common I would say.

ritchie46 · 2022-04-07T16:20:09Z

#3086 should lead to ~2x perf improvement in writing csv. And #3085 should improve performance of writing datetimes/dates.

cbilot · 2022-04-07T17:29:29Z

Great! I'll compile polars through bbb8703, which includes #3085 and #3086 and take a look. I have a dataset of 100 million voter records handy. Lots of string and date columns.

ritchie46 · 2022-04-07T17:41:24Z

Great! I'll compile polars through bbb8703, which includes #3085 and #3086 and take a look. I have a dataset of 100 million voter records handy. Lots of string and date columns.

Release is on it's way. 1 hour. :)

cbilot · 2022-04-07T18:08:55Z

Ok, a sneak preview of the results. As a test case, I'm using a dataset of 97,468,922 (publicly available) voter records with 30 columns: mostly string fields and a few date fields. This results in an uncompressed csv file of 20.5 GB. My computing platform is as described above. I am writing the csv file to a RAM disk (memory).

start = time.perf_counter()
df.write_csv("/tmp/tmp.csv")
end = time.perf_counter()
print(end - start)

Polars 0.3.18: 228 seconds
Polars (through bbb8703): 156 seconds

The enhancements cut the time by about almost a third for my particular platform and string-heavy dataset.

cbilot · 2022-04-07T18:13:13Z

Just for benchmark purposes regarding the file format, writing this to an Arrow IPC (Feather) file:

start = time.perf_counter()
df.write_ipc("/tmp/tmp.ipc")
end = time.perf_counter()
print(end - start)

Polars (through bbb8703): 26 seconds

ritchie46 · 2022-04-07T18:28:46Z

Just for benchmark purposes regarding the file format, writing this to an Arrow IPC (Feather) file:
start = time.perf_counter()
df.write_ipc("/tmp/tmp.ipc")
end = time.perf_counter()
print(end - start)
Polars (thourgh bbb8703): 26 seconds

That should also have gotten faster compared to previous release? Has it? :)

cbilot · 2022-04-07T18:31:02Z

As an aside, I also tried this approach yesterday: writing the file to a Parquet file, then running R and data.tables as follows:

library(data.table)
library(arrow)
system.time(df <- read_parquet('/tmp/tmp.parquet'))
system.time(setDT(df))
system.time(fwrite(df, '/tmp/fwrite.csv'))

The hope was to leverage the insanely-fast fwrite method of the data.table package, at the minimal cost of writing and reading a Parquet file.

The results were hugely disappointing - for an unexpected reason. The reading/writing of a Parquet file was very fast. But the bottleneck turned out to be the setDT method - which converts a R data.frame to a data.table. Oddly enough, the conversion from an R data.frame to a data.table took over 150 seconds on my dataset.

From my experience with R and data.table, the setDT is typically instantaneous (which is why I was hopeful of a good result). I'm not sure why it took so long. I also tried the as.data.table, just to see if I was doing somehting wrong .. but that yielded no improvement.

Perhaps data.table uses a row-oriented format internally, and thus must make a conversion from column-oriented data. I have no idea.

cbilot · 2022-04-07T18:58:21Z

That should also have gotten faster compared to previous release? Has it? :)

Oddly, it hasn't improved on my system.

Doing a more careful benchmark of write_ipc over 5 iterations for each version:

Polars 0.3.18: 28.0 seconds
Polars (through bbb8703): 27.6 seconds

That said, it is blazingly fast for loading huge datasets. Indeed, I've converted to IPC/Feather format for file-based storage for all my datasets. I just loaded a 30-column dataset of 49 million records in a fraction of a second.

ritchie46 · 2022-04-07T19:05:21Z

Yeah.. as I understand it feather/IPC is just arrow on disk. So we are maximizing the IO bus capacity as we don't have compute anything. No compression, serialisation etc.

So I guess writing can be ~25x more expensive than reading. 🤔

cbilot · 2022-04-10T16:32:36Z

There may be a regression regarding this issue in 0.13.20 released today. I noticed it yesterday when I compiled polars from source code. (Perhaps I should have mentioned it yesterday, but I wasn't sure if it was a mistake on my end compiling the Polars code.)

Polars 0.13.19 vs 0.13.20

Using the same files and set-up as described above, the write_csv and write_ipc (re-tested today):

Polars 0.13.19:
write_csv: 154 seconds (average of 155, 155, 153)
write_ipc: 27.9 seconds (27.9, 28.0, 28.0, 27.9, 27.8)

Polars 0.13.20
write_csv: 228 seconds (227, 230, 227)
write_ipc: 16.3 (15.4, 16.5, 16.5, 16.7, 16.6)

One the one hand, we gained significant performance in the write_ipc method (love that), but seemingly at the expense of the write_csv method.

Polars 0.13.20, using the new `unset_large_os_pages` method

Since I noticed this yesterday, I suspected that it may have something to do with #3094. (None of the other commits between 0.13.19 and yesterday seemed relevant.) So I ran some tests this morning, and discovered the following.

Polars 0.13.20: calling the new unset_large_os_pages method directly before the write_csv method or write_ipc method:
write_csv: 227 seconds (227, 227, 228)
write_ipc: 27.9 seconds (27.9, 27.8, 28.0, 27.9, 27.9)

Polars 0.13.20: (calling the new unset_large_os_pages method directly after the import polars statement:
write_csv: 152 seconds (151, 152, 153)
write_ipc: 28.0 seconds (27.9, 28.0, 28.1, 28.0, 28.0)

Thus, it would seem that to maintain the level of performance of the write_csv method in Polars 0.13.20, the unset_large_os_pages method must be called early in the program, before any data is loaded (and long before the write_csv method is called).

Not sure what to make of the above. But I wanted to give a heads up.

ritchie46 · 2022-04-10T16:56:21Z

Hmm.. maybe we must not set large os pages as default. It showed a significant performance improvement on the db benchmark queries. But I think its performance really depends on the memory load, so I will yank that release and setting until we have more data in this.

cbilot · 2022-04-10T17:16:02Z

Are there other areas where the new large OS page setting is expected to affect performance? It might give me a chance to look at other areas. (Filter performance, sorts, writing other types of formats, etc…).

The new setting certainly improved the writing of IPC/Feather files.

ritchie46 · 2022-04-10T18:05:11Z

Every operation where we interact with memory allocation. So that's everywhere.

ghuls · 2022-04-14T21:35:39Z

@ritchie46 A description of how fwrite() of Data.table() writes with multiple threads:
https://h2o.ai/blog/fast-csv-writing-for-r/
https://rdrr.io/cran/data.table/man/fwrite.html

universalmind303 · 2022-04-22T17:17:09Z

If we introduced partitioned write feature like mentioned in #2257, we should be able to parallelize the writers as they are each writing to a unique file.

something like

let df = df!(
  "foo" => [1,2,3],
  "bar" => ["a","b","c"]
)

// path of ./out.csv
let dir = Path::new("./out.csv");
let partition_key = "bar"
df.partition_by(&[partition_key])
  .par_iter()
  .for_each(|df| {
    // you could just get `df[partition_key][0]` since we know it is already partitioned accordingly. 
    let path = resolve_partitioned_path(&dir, partition_key)
    // path of [./out/a.csv, ./out/b.csv, ./out/c.csv]
    CsvWriter::from_path(path).finish(&df).unwrap()
  });

It should be pretty easy as we already have the partition_by function

jangorecki · 2023-11-17T19:23:54Z

The results were hugely disappointing - for an unexpected reason. The reading/writing of a Parquet file was very fast. But the bottleneck turned out to be the setDT method - which converts a R data.frame to a data.table. Oddly enough, the conversion from an R data.frame to a data.table took over 150 seconds on my dataset.

From my experience with R and data.table, the setDT is typically instantaneous (which is why I was hopeful of a good result). I'm not sure why it took so long. I also tried the as.data.table, just to see if I was doing somehting wrong .. but that yielded no improvement.

setDT is instantaneous, and not just typically, but by design. It doesn't alter the data, just changes their class attribute, and overallocates (empty!) pointers for new columns that can be added in future without changing pointer to DT.
The root cause I believe was that object supplied to setDT was not a data.frame, but some kind of lazy data.frame. R's evaluate lazily therefore when object was about to be used in setDT, then actual conversion to (non-lazy) data.frame happened.

@cbilot you remember what was fwrite time then?

data.table, and data.frame are just a list of vectors that are required to have exact length each, and some attributes (class and rownames). Therefore it is column oriented memory model.

cbilot · 2023-11-18T15:25:42Z

#3652 provides some timings.

jangorecki · 2023-11-18T16:05:38Z

Well made then, happy to so good timings. That keeps csv format still competitive to modern binary formats.

momentlost added the feature label Mar 30, 2022

ritchie46 mentioned this issue Jun 10, 2022

Parallel csv writer #3652

Merged

ritchie46 closed this as completed in #3652 Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded write_csv() #3004

Multithreaded write_csv() #3004

momentlost commented Mar 30, 2022

momentlost commented Mar 30, 2022

momentlost commented Apr 5, 2022

ritchie46 commented Apr 5, 2022

momentlost commented Apr 5, 2022

cbilot commented Apr 6, 2022

momentlost commented Apr 6, 2022

ritchie46 commented Apr 7, 2022

cbilot commented Apr 7, 2022

ritchie46 commented Apr 7, 2022

cbilot commented Apr 7, 2022 •

edited

cbilot commented Apr 7, 2022 •

edited

ritchie46 commented Apr 7, 2022

cbilot commented Apr 7, 2022

cbilot commented Apr 7, 2022 •

edited

ritchie46 commented Apr 7, 2022

cbilot commented Apr 10, 2022 •

edited

ritchie46 commented Apr 10, 2022

cbilot commented Apr 10, 2022

ritchie46 commented Apr 10, 2022

ghuls commented Apr 14, 2022 •

edited

universalmind303 commented Apr 22, 2022

jangorecki commented Nov 17, 2023 •

edited

cbilot commented Nov 18, 2023

jangorecki commented Nov 18, 2023

Multithreaded write_csv() #3004

Multithreaded write_csv() #3004

Comments

momentlost commented Mar 30, 2022

momentlost commented Mar 30, 2022

momentlost commented Apr 5, 2022

ritchie46 commented Apr 5, 2022

momentlost commented Apr 5, 2022

cbilot commented Apr 6, 2022

momentlost commented Apr 6, 2022

ritchie46 commented Apr 7, 2022

cbilot commented Apr 7, 2022

ritchie46 commented Apr 7, 2022

cbilot commented Apr 7, 2022 • edited

cbilot commented Apr 7, 2022 • edited

ritchie46 commented Apr 7, 2022

cbilot commented Apr 7, 2022

cbilot commented Apr 7, 2022 • edited

ritchie46 commented Apr 7, 2022

cbilot commented Apr 10, 2022 • edited

Polars 0.13.19 vs 0.13.20

Polars 0.13.20, using the new unset_large_os_pages method

ritchie46 commented Apr 10, 2022

cbilot commented Apr 10, 2022

ritchie46 commented Apr 10, 2022

ghuls commented Apr 14, 2022 • edited

universalmind303 commented Apr 22, 2022

jangorecki commented Nov 17, 2023 • edited

cbilot commented Nov 18, 2023

jangorecki commented Nov 18, 2023

cbilot commented Apr 7, 2022 •

edited

cbilot commented Apr 7, 2022 •

edited

cbilot commented Apr 7, 2022 •

edited

cbilot commented Apr 10, 2022 •

edited

Polars 0.13.20, using the new `unset_large_os_pages` method

ghuls commented Apr 14, 2022 •

edited

jangorecki commented Nov 17, 2023 •

edited