Performance degredation with noddy csv parser #269
Comments
As reported in #273 -- believed to be due to regex recompilation. I can presumably cache these. |
OK, this is a problem we really have to fix, obviously. There are two options. One option, which preserves the API, is to create a thread-local value containing the compiled regular expressions that we can reuse between executions. Another option would be to modify the API so that the user is responsible for instantiating the tokenizer and giving it to us (giving them control of how long it lives). I'm inclined towards the first option -- or, perhaps, ultimately, a hybrid. That is, we could generate an "easy entry point", that uses the thread-local, and later have a second entry point that takes an explicit tokenizer, and you can call the one you prefer. So let's start with the thead-local. The internal tokenizer is created here: lalrpop/lalrpop/src/lexer/intern_token/mod.rs Lines 46 to 51 in 5246910 You can see that it generates a bunch of calls into the lalrpop/lalrpop/src/lexer/intern_token/mod.rs Lines 104 to 105 in 5246910 We could modify this code to use thread_local! {
static regex_set: RefCell<Option<RegexSet>> = None
} then, if it is The only downsides of this is it will never get freed. But of course people always have the option of making their own lexer (and we can also add the more flexible option later). |
Just curious, why was this faster before 13.1? |
If I may suggest something:
(The only case where I think it's worthwhile to actually free the regexps is if a program were generating parsers dynamically at runtime... an interactive program to let you explore grammars?) :) |
I tried to write use Do we know what changed? It seems like we have two options: cache regexes in lazy_static, or cache regexes in thread_local. Is there a third option: revert whatever made 13.1 slower? I also started trying to implement the lazy_static approach, but ran into a question: where is the crate root for the generated code? Is it the main fail of the user's crate? I'm asking because master...willmurphyscode:try-staticdiff-aaa35664da81cd904ea0c482dca7dd60R52 won't compile, failing with, for example:
|
There's actually another option, which is to change the interface to the parser. Like: pub mod csv;
use std::io::BufReader;
use std::io::BufRead;
use std::fs::File;
fn main() {
let fpath = ::std::env::args().nth(1).unwrap();
let f = File::open(fpath).unwrap();
let file = BufReader::new(&f);
let mut sum = 0;
let parser = csv::Parser::new();
for line in file.lines() {
let l = line.unwrap();
let rec = parser.parse_Record(&l).unwrap();
sum += rec.len();
}
println!("{}", sum);
} After seeing the contentiousness of the choice to use TLS in futures, I think it's best to avoid that in favor of 'simpler' solutions that more directly address the issue. That way people don't show up in two years confused about why they have random bugs or can't understand why the parser doesn't work the way they expect. |
Ah, hmm, that is an obstacle. We'd have to generate our own code.
Yes, I've been wanting to avoid that, just because I continue to want writing LALRPOP code to be "drop dead" simple where possible, but to be honest now that I see it written out it doesn't seem "more complex" per se. And I do like giving users the control. (It might still be nice to offer a convenience form, but that can wait.) Of course if we do this we have to update the docs and tests and so forth. It might also be a good idea to offer the existing methods as wrappers, I suppose, perhaps with a "deprecation" warning: #[deprecated("prefer the new `FooParser::new().parse(...)` style of invocation")]
fn parse_Foo(...) {
FooParser::new().parse(...)
} |
@willmurphyscode would you be interested in changing to match what @ahmedcharles is suggesting? This is the rough area of the code that would have to be changed. The lalrpop/lalrpop/src/lr1/codegen/base.rs Line 101 in d740f50 Similarly the lalrpop/lalrpop/src/lr1/codegen/base.rs Lines 151 to 179 in d740f50 They are invoked from both the recursive ascent and parse table driven code roughly around here. Finally, the code that defines the lalrpop/lalrpop/src/lexer/intern_token/mod.rs Lines 18 to 23 in d740f50 So I imagine we would want do the following:
|
I imagine the turning point was actually release 0.13; that is when we changed from generating our own regex to using the regex library. I imagine this is what made it slower (but it also vastly improved compilation time). |
Ah, that makes sense. |
All of that said, I do feel like some convenience that uses I am imagining that even with a |
|
Well, my general concern is that integrating LALRPOP doesn't involve a lot of ceremony to achieve decent performance etc. That said, I definitely agree that caching values across invocations is an "orthogonal concern", and ideally we would not force the use of thread-local or other means to achieve it (though I'd be ok with offering it as a convenience measure). The main thing is to do the refactoring I described earlier. Whether or not we offer |
I agree. @willmurphyscode If you want to work on this, respond here. If not, I'll take a look when I get a chance. |
Hi. Any updates here? |
@ehiggs I believe you can create the |
What's the status of this? |
Regexes can be reused now so this ought to be fixed. Please re-open or file a new issue for this (or any other performance problems). |
The performance of this csv parser code went from reading 1m lines in ~1second to taking minutes with 13.1.
The grammar is:
The test program counts the number of fields encountered:
Results:
The text was updated successfully, but these errors were encountered: