New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cut parse time almost in half #40

Closed
wants to merge 2 commits into
base: rust
from

Conversation

Projects
None yet
3 participants
@dtolnay

dtolnay commented Sep 4, 2018

I used the following program to generate a 1.6 gigabyte sample input that resembles what you show in your presentation. Hopefully this is sufficiently representative to perform useful optimizations as far as parsing. In the presentation it looked like these StatemapInputDatum values make up the bulk of the data.

for i in 0..30000000 {
    let t = 20000000000u64 + i;
    let e = 50000 + i % 1000;
    let s = i % 3;
    println!(r#"{{ "time": "{}", "entity": "{}", "state": {} }}"#, t, e, s);
}

First commit

Remove homegrown JSON lexing code

Much of the ingest time is being spent in your json_start and json_end functions, not in serde_json. I replaced those with using serde_json::StreamDeserializer to read one JSON element at a time without computing length in advance.

On my sample input this commit drops parse time from 10.1 seconds to 6.8 seconds. I isolated parse time by inserting an early return just below the match statement that contains the "illegal time value" error.


Second commit

Avoid allocating String for datum time

A huge amount of the parse time is spent allocating Strings. The string associated with the time of each datum is never used as a string but just parsed to a u64. This commit avoids allocating those as strings in the first place.

On my sample input this commit drops parse time from 6.8 seconds to 5.7 seconds. I isolated parse time by inserting an early return just below the line that contains let nstates: u32.


dtolnay added some commits Sep 4, 2018

Remove homegrown JSON lexing code
Much of the ingest time is being spent in these json_start and json_end
functions, not in serde_json.

I used the following program to generate a 1.6 gigabyte sample input
that resembles what you show in your presentation.
https://www.youtube.com/watch?v=aWbGPMxs0AM

    for i in 0..30000000 {
        let time = 20000000000u64 + i;
        let entity = 50000 + i % 1000;
        let state = i % 3;
        println!(r#"{{ "time": "{}", "entity": "{}", "state": {} }}"#, time, entity, state);
    }

On this input, this commit drops parse time from 10.1 seconds to 6.8
seconds. I isolated parse time by inserting an early return just below
the match statement that contains the "illegal time value" error.
Avoid allocating String for datum time
A huge amount of the parse time is spent allocating Strings. The string
associated with the time of each datum is never used as a string but
just parsed to a u64. This commit avoids allocating those as strings in
the first place.

I used the following program to generate a 1.6 gigabyte sample input
that resembles what you show in your presentation.
https://www.youtube.com/watch?v=aWbGPMxs0AM

    for i in 0..30000000 {
        let time = 20000000000u64 + i;
        let entity = 50000 + i % 1000;
        let state = i % 3;
        println!(r#"{{ "time": "{}", "entity": "{}", "state": {} }}"#, time, entity, state);
    }

On this input, this commit drops parse time from 6.8 seconds to 5.7
seconds. I isolated parse time by inserting an early return just below
the line that contains "let nstates: u32".
@dtolnay

This comment has been minimized.

Show comment
Hide comment
@dtolnay

dtolnay Sep 4, 2018

Serde and serde_json author here by the way. 😜 Feel free to just close the PR if the new code is too crazy. I know I used one or two Rust features that were not present in the code before.

dtolnay commented Sep 4, 2018

Serde and serde_json author here by the way. 😜 Feel free to just close the PR if the new code is too crazy. I know I used one or two Rust features that were not present in the code before.

@bcantrill

This comment has been minimized.

Show comment
Hide comment
@bcantrill

bcantrill Sep 4, 2018

Member

This is awesome -- thank you! For whatever it's worth, here is the exact input file I used in my demo (~30MB compressed). As I mentioned there, I really haven't done any analysis on this, so it's not a surprise that there is some low-hanging fruit. (And I definitely missed serde_json's ability to handle concatenated JSON!) That said, I would love to know the methodology here: if you don't mind me asking, what tools do you/did you use for profiling? Thanks again for both serde and serde_json (truly awesome work) -- and for your interest in my little goober!

Member

bcantrill commented Sep 4, 2018

This is awesome -- thank you! For whatever it's worth, here is the exact input file I used in my demo (~30MB compressed). As I mentioned there, I really haven't done any analysis on this, so it's not a surprise that there is some low-hanging fruit. (And I definitely missed serde_json's ability to handle concatenated JSON!) That said, I would love to know the methodology here: if you don't mind me asking, what tools do you/did you use for profiling? Thanks again for both serde and serde_json (truly awesome work) -- and for your interest in my little goober!

@dtolnay

This comment has been minimized.

Show comment
Hide comment
@dtolnay

dtolnay Sep 4, 2018

Just callgrind in this case.

$ valgrind --tool=callgrind ./target/release/statemap sample.out >rust.svg
$ callgrind_annotate --auto=yes callgrind.out.*

Here is the top of the callgrind output before this PR. (Again isolated to parsing only.)

3,532,171,764  PROGRAM TOTALS

1,093,993,901  statemap::statemap::Statemap::json_end
  486,996,711  <serde_json::read::StrRead<'a> as serde_json::Read<'a>>::parse_str
  458,993,115  <&'a mut serde_json::de::Deserializer<R> as serde::de::Deserializer<'de>>::deserialize_struct
  138,999,100  statemap::statemap::Statemap::ingest
  111,998,799  <&'a mut serde_json::de::Deserializer<R> as serde::de::Deserializer<'de>>::deserialize_string
   90,998,635  core::num::<impl core::str::FromStr for u64>::from_str
   90,002,430  jemalloc.c:mallocx
   72,001,944  jemalloc.c:sdallocx
   65,896,974  jemalloc.c:isfree
   60,158,832  tcache.h:isfree
   60,071,035  tcache.h:mallocx
   56,999,145  serde_json::de::from_str
   51,999,220  <core::marker::PhantomData<T> as serde::de::DeserializeSeed<'de>>::deserialize
   49,999,328  statemap::statemap::Statemap::json_start

From this I know:

  • 30% of instruction reads are in json_end. The json_end and json_start aren't even necessary so if they show up this high it warrants a different approach. *can fix*
  • The parse_str is serde_json tokenizing JSON strings. Makes sense, practically the entire input line consists of strings. This should be the bulk of the actual useful work.
  • The output of deserialization consists almost entirely of structs and strings so deserialize_struct and deserialize_string are not surprising.
  • u64::from_str is slow. If your numbers weren't quoted, serde_json would handle this without using u64::from_str and be faster. I could change your dtrace scripts but meh.
  • There really shouldn't be any malloc in this workload. Serde_json internally isn't mallocing unless you have strings containing escape sequences, so it's only showing up because the output object holds a String which forces an allocation and memcpy. *can fix*

Then I confirm that what callgrind suggests likely to help does actually help (it doesn't always). The old-fashioned way:

$ time ./target/release/statemap sample.out >rust.svg

dtolnay commented Sep 4, 2018

Just callgrind in this case.

$ valgrind --tool=callgrind ./target/release/statemap sample.out >rust.svg
$ callgrind_annotate --auto=yes callgrind.out.*

Here is the top of the callgrind output before this PR. (Again isolated to parsing only.)

3,532,171,764  PROGRAM TOTALS

1,093,993,901  statemap::statemap::Statemap::json_end
  486,996,711  <serde_json::read::StrRead<'a> as serde_json::Read<'a>>::parse_str
  458,993,115  <&'a mut serde_json::de::Deserializer<R> as serde::de::Deserializer<'de>>::deserialize_struct
  138,999,100  statemap::statemap::Statemap::ingest
  111,998,799  <&'a mut serde_json::de::Deserializer<R> as serde::de::Deserializer<'de>>::deserialize_string
   90,998,635  core::num::<impl core::str::FromStr for u64>::from_str
   90,002,430  jemalloc.c:mallocx
   72,001,944  jemalloc.c:sdallocx
   65,896,974  jemalloc.c:isfree
   60,158,832  tcache.h:isfree
   60,071,035  tcache.h:mallocx
   56,999,145  serde_json::de::from_str
   51,999,220  <core::marker::PhantomData<T> as serde::de::DeserializeSeed<'de>>::deserialize
   49,999,328  statemap::statemap::Statemap::json_start

From this I know:

  • 30% of instruction reads are in json_end. The json_end and json_start aren't even necessary so if they show up this high it warrants a different approach. *can fix*
  • The parse_str is serde_json tokenizing JSON strings. Makes sense, practically the entire input line consists of strings. This should be the bulk of the actual useful work.
  • The output of deserialization consists almost entirely of structs and strings so deserialize_struct and deserialize_string are not surprising.
  • u64::from_str is slow. If your numbers weren't quoted, serde_json would handle this without using u64::from_str and be faster. I could change your dtrace scripts but meh.
  • There really shouldn't be any malloc in this workload. Serde_json internally isn't mallocing unless you have strings containing escape sequences, so it's only showing up because the output object holds a String which forces an allocation and memcpy. *can fix*

Then I confirm that what callgrind suggests likely to help does actually help (it doesn't always). The old-fashioned way:

$ time ./target/release/statemap sample.out >rust.svg
@bcantrill

This comment has been minimized.

Show comment
Hide comment
@bcantrill

bcantrill Sep 18, 2018

Member

Just wanted to follow up over here to let you know that I haven't dropped this! I've been looking at the performance of your suggestions (both of which improve performance, but I'm trying to quantify them separately and in a lab environment), and taking apart why Rust is performing so much better than C. ;) Anyway, more to come -- and thank you again for your time and help!

Member

bcantrill commented Sep 18, 2018

Just wanted to follow up over here to let you know that I haven't dropped this! I've been looking at the performance of your suggestions (both of which improve performance, but I'm trying to quantify them separately and in a lab environment), and taking apart why Rust is performing so much better than C. ;) Anyway, more to come -- and thank you again for your time and help!

@gnzlbg

This comment has been minimized.

Show comment
Hide comment
@gnzlbg

gnzlbg Sep 19, 2018

@dtolnay

The old-fashioned way:

$ time ./target/release/statemap sample.out >rust.svg

You are probably already aware of this but @bcantrill might not be. There is a neat tool for benchmarking cli applications called hyperfine that's available in some package repositories (brew install hyperfine). Its written in Rust, so a cargo install hyperfine will do as well. I found it significantly more reliable than time x when benchmarking cli programs recently (you can do a warm up, it quantifies shell start time, you can control how often the cli app gets executed, etc.).

gnzlbg commented Sep 19, 2018

@dtolnay

The old-fashioned way:

$ time ./target/release/statemap sample.out >rust.svg

You are probably already aware of this but @bcantrill might not be. There is a neat tool for benchmarking cli applications called hyperfine that's available in some package repositories (brew install hyperfine). Its written in Rust, so a cargo install hyperfine will do as well. I found it significantly more reliable than time x when benchmarking cli programs recently (you can do a warm up, it quantifies shell start time, you can control how often the cli app gets executed, etc.).

@bcantrill

This comment has been minimized.

Show comment
Hide comment
@bcantrill

bcantrill Sep 22, 2018

Member

Analysis of these improvements can be found in #42; I have integrated them under the aegis of #43 and #43, and am therefore closing this pull request. Thank you again, @dtolnay! (And thanks, too to @gnzlbg for the hyperfine pointer!)

Member

bcantrill commented Sep 22, 2018

Analysis of these improvements can be found in #42; I have integrated them under the aegis of #43 and #43, and am therefore closing this pull request. Thank you again, @dtolnay! (And thanks, too to @gnzlbg for the hyperfine pointer!)

@bcantrill bcantrill closed this Sep 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment