Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded write_csv() #3004

Closed
momentlost opened this issue Mar 30, 2022 · 24 comments · Fixed by #3652
Closed

Multithreaded write_csv() #3004

momentlost opened this issue Mar 30, 2022 · 24 comments · Fixed by #3652

Comments

@momentlost
Copy link

Hi,

It appears currently write_csv() method does not exploit multithread properly as read_csv() does. I am not sure how easy it would be to implement, but the current write speed is limited at around 20MB/second. The system I use has 30+ physical cores and the storage speed is 4,000MB/s.

Is there any way to speed up the write speed using multithread and also specify the n_threads in write_csv() just as in read_csv()?

Thanks!

@momentlost
Copy link
Author

BTW I used to use R's data.table() for my workflow and it fully exploits multicore/thread in its fwrite() function. I was expecting the same for polars.

@momentlost
Copy link
Author

Apologies for nagging. Any chance that this feature would be added/supported in a near future? I find this feature highly relevant and necessary as I have to read, manipulate, and save 100GB+ data constantly. It's a bit odd that data read supports full-fledged multithreading but not in saving. With my 36-core system, it takes about 20-50times to save the same-sized data than to read it.

@ritchie46
Copy link
Member

I am spending a lot of time on this project. It will come eventually. Please don't push. Have you considered opening a PR yourself?

On the subject. Probably a binary file type is better suited for these quantities of data. CSV is very expensive.

@momentlost
Copy link
Author

I do appreciate all your efforts on this great project and I'm grateful. I just wondered whether it's gonna come eventually or not. I do understand this may take time since it may have to modify the fundamental way how it handles the .csv file saving.

Since I have zero experience in Rust, I doubt if I could be helpful in improving on this. The reason why I did not consider opening a PR request myself.

Thanks again.

@cbilot
Copy link

cbilot commented Apr 6, 2022

I too have a multi-core system (Threadripper Pro with 32 cores with 512GB of RAM). Indeed, I have four 1-terabtye PCI-Gen4 NVMe drives in RAID0 as my work directory. (Yep, those drives make it blazingly fast to store and retrieve arrow/ipc files in and out of memory.) I also come from R and its impressive data.table package, but now prefer Python with Polars.

May I ask why you are creating such large csv files of 100GB+?

Are you creating these huge csv files as input to another system (e.g., a database loader)?

Or perhaps are you using these as input files for programs you previously wrote in R/data.table (and leveraging data.tables fread method for fast parsing).

@momentlost
Copy link
Author

The raw data are terabytes and after some aggregation, extraction, and cleaning it's still 10-100GB. Then they are passed on to my collaborators and/or used for subsequent analyses. If not hundreds of GBs, having to write even 10GB+ data are pretty common I would say.

@ritchie46
Copy link
Member

#3086 should lead to ~2x perf improvement in writing csv. And #3085 should improve performance of writing datetimes/dates.

@cbilot
Copy link

cbilot commented Apr 7, 2022

Great! I'll compile polars through bbb8703, which includes #3085 and #3086 and take a look. I have a dataset of 100 million voter records handy. Lots of string and date columns.

@ritchie46
Copy link
Member

Great! I'll compile polars through bbb8703, which includes #3085 and #3086 and take a look. I have a dataset of 100 million voter records handy. Lots of string and date columns.

Release is on it's way. 1 hour. :)

@cbilot
Copy link

cbilot commented Apr 7, 2022

Ok, a sneak preview of the results. As a test case, I'm using a dataset of 97,468,922 (publicly available) voter records with 30 columns: mostly string fields and a few date fields. This results in an uncompressed csv file of 20.5 GB. My computing platform is as described above. I am writing the csv file to a RAM disk (memory).

start = time.perf_counter()
df.write_csv("/tmp/tmp.csv")
end = time.perf_counter()
print(end - start)

Polars 0.3.18: 228 seconds
Polars (through bbb8703): 156 seconds

The enhancements cut the time by about almost a third for my particular platform and string-heavy dataset.

@cbilot
Copy link

cbilot commented Apr 7, 2022

Just for benchmark purposes regarding the file format, writing this to an Arrow IPC (Feather) file:

start = time.perf_counter()
df.write_ipc("/tmp/tmp.ipc")
end = time.perf_counter()
print(end - start)

Polars (through bbb8703): 26 seconds

@ritchie46
Copy link
Member

Just for benchmark purposes regarding the file format, writing this to an Arrow IPC (Feather) file:

start = time.perf_counter()
df.write_ipc("/tmp/tmp.ipc")
end = time.perf_counter()
print(end - start)

Polars (thourgh bbb8703): 26 seconds

That should also have gotten faster compared to previous release? Has it? :)

@cbilot
Copy link

cbilot commented Apr 7, 2022

As an aside, I also tried this approach yesterday: writing the file to a Parquet file, then running R and data.tables as follows:

library(data.table)
library(arrow)
system.time(df <- read_parquet('/tmp/tmp.parquet'))
system.time(setDT(df))
system.time(fwrite(df, '/tmp/fwrite.csv'))

The hope was to leverage the insanely-fast fwrite method of the data.table package, at the minimal cost of writing and reading a Parquet file.

The results were hugely disappointing - for an unexpected reason. The reading/writing of a Parquet file was very fast. But the bottleneck turned out to be the setDT method - which converts a R data.frame to a data.table. Oddly enough, the conversion from an R data.frame to a data.table took over 150 seconds on my dataset.

From my experience with R and data.table, the setDT is typically instantaneous (which is why I was hopeful of a good result). I'm not sure why it took so long. I also tried the as.data.table, just to see if I was doing somehting wrong .. but that yielded no improvement.

Perhaps data.table uses a row-oriented format internally, and thus must make a conversion from column-oriented data. I have no idea.

@cbilot
Copy link

cbilot commented Apr 7, 2022

That should also have gotten faster compared to previous release? Has it? :)

Oddly, it hasn't improved on my system.

Doing a more careful benchmark of write_ipc over 5 iterations for each version:

Polars 0.3.18: 28.0 seconds
Polars (through bbb8703): 27.6 seconds

That said, it is blazingly fast for loading huge datasets. Indeed, I've converted to IPC/Feather format for file-based storage for all my datasets. I just loaded a 30-column dataset of 49 million records in a fraction of a second.

@ritchie46
Copy link
Member

Yeah.. as I understand it feather/IPC is just arrow on disk. So we are maximizing the IO bus capacity as we don't have compute anything. No compression, serialisation etc.

So I guess writing can be ~25x more expensive than reading. 🤔

@cbilot
Copy link

cbilot commented Apr 10, 2022

There may be a regression regarding this issue in 0.13.20 released today. I noticed it yesterday when I compiled polars from source code. (Perhaps I should have mentioned it yesterday, but I wasn't sure if it was a mistake on my end compiling the Polars code.)

Polars 0.13.19 vs 0.13.20

Using the same files and set-up as described above, the write_csv and write_ipc (re-tested today):

Polars 0.13.19:
write_csv: 154 seconds (average of 155, 155, 153)
write_ipc: 27.9 seconds (27.9, 28.0, 28.0, 27.9, 27.8)

Polars 0.13.20
write_csv: 228 seconds (227, 230, 227)
write_ipc: 16.3 (15.4, 16.5, 16.5, 16.7, 16.6)

One the one hand, we gained significant performance in the write_ipc method (love that), but seemingly at the expense of the write_csv method.

Polars 0.13.20, using the new unset_large_os_pages method

Since I noticed this yesterday, I suspected that it may have something to do with #3094. (None of the other commits between 0.13.19 and yesterday seemed relevant.) So I ran some tests this morning, and discovered the following.

Polars 0.13.20: calling the new unset_large_os_pages method directly before the write_csv method or write_ipc method:
write_csv: 227 seconds (227, 227, 228)
write_ipc: 27.9 seconds (27.9, 27.8, 28.0, 27.9, 27.9)

Polars 0.13.20: (calling the new unset_large_os_pages method directly after the import polars statement:
write_csv: 152 seconds (151, 152, 153)
write_ipc: 28.0 seconds (27.9, 28.0, 28.1, 28.0, 28.0)

Thus, it would seem that to maintain the level of performance of the write_csv method in Polars 0.13.20, the unset_large_os_pages method must be called early in the program, before any data is loaded (and long before the write_csv method is called).

Not sure what to make of the above. But I wanted to give a heads up.

@ritchie46
Copy link
Member

Hmm.. maybe we must not set large os pages as default. It showed a significant performance improvement on the db benchmark queries. But I think its performance really depends on the memory load, so I will yank that release and setting until we have more data in this.

@cbilot
Copy link

cbilot commented Apr 10, 2022

Are there other areas where the new large OS page setting is expected to affect performance? It might give me a chance to look at other areas. (Filter performance, sorts, writing other types of formats, etc…).

The new setting certainly improved the writing of IPC/Feather files.

@ritchie46
Copy link
Member

Every operation where we interact with memory allocation. So that's everywhere.

@ghuls
Copy link
Collaborator

ghuls commented Apr 14, 2022

@ritchie46 A description of how fwrite() of Data.table() writes with multiple threads:
https://h2o.ai/blog/fast-csv-writing-for-r/
https://rdrr.io/cran/data.table/man/fwrite.html

@universalmind303
Copy link
Collaborator

If we introduced partitioned write feature like mentioned in #2257, we should be able to parallelize the writers as they are each writing to a unique file.

something like

let df = df!(
  "foo" => [1,2,3],
  "bar" => ["a","b","c"]
)

// path of ./out.csv
let dir = Path::new("./out.csv");
let partition_key = "bar"
df.partition_by(&[partition_key])
  .par_iter()
  .for_each(|df| {
    // you could just get `df[partition_key][0]` since we know it is already partitioned accordingly. 
    let path = resolve_partitioned_path(&dir, partition_key)
    // path of [./out/a.csv, ./out/b.csv, ./out/c.csv]
    CsvWriter::from_path(path).finish(&df).unwrap()
  });

It should be pretty easy as we already have the partition_by function

@jangorecki
Copy link

jangorecki commented Nov 17, 2023

The results were hugely disappointing - for an unexpected reason. The reading/writing of a Parquet file was very fast. But the bottleneck turned out to be the setDT method - which converts a R data.frame to a data.table. Oddly enough, the conversion from an R data.frame to a data.table took over 150 seconds on my dataset.

From my experience with R and data.table, the setDT is typically instantaneous (which is why I was hopeful of a good result). I'm not sure why it took so long. I also tried the as.data.table, just to see if I was doing somehting wrong .. but that yielded no improvement.

setDT is instantaneous, and not just typically, but by design. It doesn't alter the data, just changes their class attribute, and overallocates (empty!) pointers for new columns that can be added in future without changing pointer to DT.
The root cause I believe was that object supplied to setDT was not a data.frame, but some kind of lazy data.frame. R's evaluate lazily therefore when object was about to be used in setDT, then actual conversion to (non-lazy) data.frame happened.

@cbilot you remember what was fwrite time then?

data.table, and data.frame are just a list of vectors that are required to have exact length each, and some attributes (class and rownames). Therefore it is column oriented memory model.

@cbilot
Copy link

cbilot commented Nov 18, 2023

#3652 provides some timings.

@jangorecki
Copy link

Well made then, happy to so good timings. That keeps csv format still competitive to modern binary formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants