perf: Refactor CSV serialization to not go thorough `AnyValue` #15576

ChayimFriedman2 · 2024-04-10T10:22:02Z

While convenient, this harms performance. Avoiding it gives a boost of 40% on my machine, and even more if I limit to single thread.

I did several other things to improve performance, like pre-compilation of date/time formatting and speeding up strings with quotes that need to be escaped. Those aren't strictly related, but they are nice.

Benchmark (the POOL.install() is because otherwise the switch to Polars' pool has a measurable overhead, but this should be only present in benchmarking and not in real-world cases):

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use mimalloc::MiMalloc;
use polars::prelude::*;
use polars_core::POOL;

#[global_allocator]
static MIMALLOC: MiMalloc = MiMalloc;

fn my_benchmark(_c: &mut Criterion) {
    POOL.install(|| {
        let mut c = Criterion::default().configure_from_args();
        const SIZE: i32 = 10_000;
        let mut df = df![
            "a" => (0..SIZE).collect::<Vec<_>>(),
            "b" => (0..SIZE).map(|v| v.to_string()).collect::<Vec<_>>(),
        ]
        .unwrap();

        let mut csv = Vec::new();
        c.bench_function("CSV serialization", |b| {
            b.iter(|| {
                csv.clear();
                CsvWriter::new(&mut csv).finish(&mut df).unwrap();
                black_box(&csv);
            })
        });
    });
}

criterion_group!(benches, my_benchmark);
criterion_main!(benches);

codecov · 2024-04-10T11:19:46Z

Codecov Report

Attention: Patch coverage is 95.30387% with 34 lines in your changes are missing coverage. Please review.

Project coverage is 81.19%. Comparing base (97c61fe) to head (2388a22).
Report is 10 commits behind head on main.

Files	Patch %	Lines
crates/polars-io/src/csv/write_impl/serializer.rs	96.33%	19 Missing ⚠️
crates/polars-io/src/csv/write_impl/mod.rs	92.71%	15 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15576      +/-   ##
==========================================
+ Coverage   81.08%   81.19%   +0.11%     
==========================================
  Files        1367     1368       +1     
  Lines      174872   175289     +417     
  Branches     2530     2530              
==========================================
+ Hits       141799   142333     +534     
+ Misses      32598    32482     -116     
+ Partials      475      474       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codspeed-hq · 2024-04-10T11:24:54Z

CodSpeed Performance Report

Merging #15576 will not alter performance

_{Comparing ChayimFriedman2:speedup-csv-writer (2388a22) with main (05d980f)}

Summary

✅ 22 untouched benchmarks

ritchie46 · 2024-04-10T18:23:55Z

Thanks for the PR. This certainly looks interesting. Will take a look tomorrow.

While convenient, this harms performance. Avoiding it gives a boost of 40% on my machine, and even more if I limit to single thread. I did several other things to improve performance, like pre-compilation of date/time formatting and speeding up strings with quotes that need to be escaped. Those aren't strictly related, but they are nice.

ritchie46 · 2024-04-12T07:23:08Z

I went through this carefully and this looks great! 🙌 Really neat monomorphization and constant usage to ensure we get an optimal serializer. Thanks a lot.

ChayimFriedman2 requested review from ritchie46, stinodego, orlp and c-peters as code owners April 10, 2024 10:22

ChayimFriedman2 changed the title ~~Refactor CSV serialization to not go thorough AnyValue~~ perf: Refactor CSV serialization to not go thorough AnyValue Apr 10, 2024

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Apr 10, 2024

ChayimFriedman2 marked this pull request as draft April 10, 2024 10:56

ChayimFriedman2 marked this pull request as ready for review April 10, 2024 13:03

ChayimFriedman2 added 2 commits April 10, 2024 23:31

Fix Write::write() to write_all()

2388a22

ritchie46 approved these changes Apr 12, 2024

View reviewed changes

ritchie46 merged commit 002de7b into pola-rs:main Apr 12, 2024
19 checks passed

ChayimFriedman2 deleted the speedup-csv-writer branch April 14, 2024 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Refactor CSV serialization to not go thorough `AnyValue` #15576

perf: Refactor CSV serialization to not go thorough `AnyValue` #15576

ChayimFriedman2 commented Apr 10, 2024

codecov bot commented Apr 10, 2024 •

edited

Loading

codspeed-hq bot commented Apr 10, 2024 •

edited

Loading

ritchie46 commented Apr 10, 2024

ritchie46 commented Apr 12, 2024

perf: Refactor CSV serialization to not go thorough AnyValue #15576

perf: Refactor CSV serialization to not go thorough AnyValue #15576

Conversation

ChayimFriedman2 commented Apr 10, 2024

codecov bot commented Apr 10, 2024 • edited Loading

Codecov Report

codspeed-hq bot commented Apr 10, 2024 • edited Loading

CodSpeed Performance Report

Merging #15576 will not alter performance

Summary

ritchie46 commented Apr 10, 2024

ritchie46 commented Apr 12, 2024

perf: Refactor CSV serialization to not go thorough `AnyValue` #15576

perf: Refactor CSV serialization to not go thorough `AnyValue` #15576

codecov bot commented Apr 10, 2024 •

edited

Loading

codspeed-hq bot commented Apr 10, 2024 •

edited

Loading