Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Refactor CSV serialization to not go thorough AnyValue #15576

Merged
merged 2 commits into from
Apr 12, 2024
Merged

perf: Refactor CSV serialization to not go thorough AnyValue #15576

merged 2 commits into from
Apr 12, 2024

Conversation

ChayimFriedman2
Copy link
Contributor

While convenient, this harms performance. Avoiding it gives a boost of 40% on my machine, and even more if I limit to single thread.

I did several other things to improve performance, like pre-compilation of date/time formatting and speeding up strings with quotes that need to be escaped. Those aren't strictly related, but they are nice.

Benchmark (the POOL.install() is because otherwise the switch to Polars' pool has a measurable overhead, but this should be only present in benchmarking and not in real-world cases):

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use mimalloc::MiMalloc;
use polars::prelude::*;
use polars_core::POOL;

#[global_allocator]
static MIMALLOC: MiMalloc = MiMalloc;

fn my_benchmark(_c: &mut Criterion) {
    POOL.install(|| {
        let mut c = Criterion::default().configure_from_args();
        const SIZE: i32 = 10_000;
        let mut df = df![
            "a" => (0..SIZE).collect::<Vec<_>>(),
            "b" => (0..SIZE).map(|v| v.to_string()).collect::<Vec<_>>(),
        ]
        .unwrap();

        let mut csv = Vec::new();
        c.bench_function("CSV serialization", |b| {
            b.iter(|| {
                csv.clear();
                CsvWriter::new(&mut csv).finish(&mut df).unwrap();
                black_box(&csv);
            })
        });
    });
}

criterion_group!(benches, my_benchmark);
criterion_main!(benches);

@ChayimFriedman2 ChayimFriedman2 changed the title Refactor CSV serialization to not go thorough AnyValue perf: Refactor CSV serialization to not go thorough AnyValue Apr 10, 2024
@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Apr 10, 2024
@ChayimFriedman2 ChayimFriedman2 marked this pull request as draft April 10, 2024 10:56
Copy link

codecov bot commented Apr 10, 2024

Codecov Report

Attention: Patch coverage is 95.30387% with 34 lines in your changes are missing coverage. Please review.

Project coverage is 81.19%. Comparing base (97c61fe) to head (2388a22).
Report is 10 commits behind head on main.

Files Patch % Lines
crates/polars-io/src/csv/write_impl/serializer.rs 96.33% 19 Missing ⚠️
crates/polars-io/src/csv/write_impl/mod.rs 92.71% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15576      +/-   ##
==========================================
+ Coverage   81.08%   81.19%   +0.11%     
==========================================
  Files        1367     1368       +1     
  Lines      174872   175289     +417     
  Branches     2530     2530              
==========================================
+ Hits       141799   142333     +534     
+ Misses      32598    32482     -116     
+ Partials      475      474       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

codspeed-hq bot commented Apr 10, 2024

CodSpeed Performance Report

Merging #15576 will not alter performance

Comparing ChayimFriedman2:speedup-csv-writer (2388a22) with main (05d980f)

Summary

✅ 22 untouched benchmarks

@ChayimFriedman2 ChayimFriedman2 marked this pull request as ready for review April 10, 2024 13:03
@ritchie46
Copy link
Member

Thanks for the PR. This certainly looks interesting. Will take a look tomorrow.

While convenient, this harms performance. Avoiding it gives a boost of 40% on my machine, and even more if I limit to single thread.

I did several other things to improve performance, like pre-compilation of date/time formatting and speeding up strings with quotes that need to be escaped. Those aren't strictly related, but they are nice.
@ritchie46
Copy link
Member

I went through this carefully and this looks great! 🙌 Really neat monomorphization and constant usage to ensure we get an optimal serializer. Thanks a lot.

@ritchie46 ritchie46 merged commit 002de7b into pola-rs:main Apr 12, 2024
19 checks passed
@ChayimFriedman2 ChayimFriedman2 deleted the speedup-csv-writer branch April 14, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants