line_col method N times slow in large file. #560

huacnlee · 2021-10-25T10:41:59Z

When use pest in a large content file (about 700 KB JSON file), the Pair<R> to call line_col looks like has performance issue O(N).

Source code ref:
https://github.com/huacnlee/autocorrect/blob/v1.4.4/src/lib/src/code.rs#L56

The large file case:

https://github.com/microsoft/vscode-loc/blob/release/1.62.0/i18n/vscode-language-pack-zh-hans/translations/main.i18n.json

// 200ns
let span = item.as_span();
// 48ns
let start_pos = span.start_pos();
let start = std::time::Instant::now();
let (part_line, part_col) = start_pos.line_col();
println!(
    "line_col: {}ns, {}ms",
    start.elapsed().as_nanos(),
    start.elapsed().as_millis()
);

The benchmark result:

line_col: 1545291ns, 1ms
line_col: 1533083ns, 1ms
line_col: 1581541ns, 1ms
line_col: 1537250ns, 1ms
line_col: 1543208ns, 1ms
....
line_col: 2226208ns, 2ms
line_col: 2244875ns, 2ms
line_col: 2229541ns, 2ms
line_col: 2299000ns, 2ms
line_col: 2235958ns, 2ms
....
line_col: 3397791ns, 3ms
line_col: 3398291ns, 3ms
line_col: 3402250ns, 3ms
line_col: 3433666ns, 3ms
line_col: 3409250ns, 3ms
....
line_col: 4089000ns, 4ms
line_col: 4078750ns, 4ms
line_col: 4084083ns, 4ms
line_col: 4128250ns, 4ms
line_col: 4095708ns, 4ms
....
line_col: 5021375ns, 5ms
line_col: 5118083ns, 5ms
line_col: 5028333ns, 5ms
line_col: 5065166ns, 5ms
line_col: 5030750ns, 5ms
...
line_col: 6076458ns, 6ms
line_col: 6124375ns, 6ms
line_col: 6077041ns, 6ms
line_col: 6078041ns, 6ms
line_col: 6081791ns, 6ms
line_col: 6082208ns, 6ms
line_col: 6119625ns, 6ms
...
line_col: 7365125ns, 7ms
line_col: 7368541ns, 7ms
line_col: 7374416ns, 7ms
line_col: 7373250ns, 7ms
line_col: 7366041ns, 7ms
line_col: 7372750ns, 7ms
...
line_col: 8363541ns, 8ms
line_col: 8379583ns, 8ms
line_col: 8375083ns, 8ms
line_col: 8373750ns, 8ms
line_col: 8378791ns, 8ms
...
line_col: 9096541ns, 9ms
line_col: 9101041ns, 9ms
line_col: 9097375ns, 9ms
line_col: 9101500ns, 9ms
line_col: 9092708ns, 9ms
line_col: 9107291ns, 9ms
....
line_col: 10150208ns, 10ms
line_col: 10144625ns, 10ms
line_col: 10311083ns, 10ms
line_col: 10617958ns, 10ms
line_col: 10224125ns, 10ms
line_col: 10167583ns, 10ms
line_col: 10229416ns, 10ms
line_col: 10168041ns, 10ms

The text was updated successfully, but these errors were encountered:

pest-parser/pest#560

huacnlee · 2021-10-25T14:44:27Z

I have rewritten a way to get line, col for improve a lot of speed.

This new way just save the cursor (line, col) on iter the pairs.

huacnlee/autocorrect@6c9da4b

By a large file (5000 lines JSON case, duration from 12595ms down to 400ms)

matthew-dean · 2021-12-12T22:30:21Z

@huacnlee Are you planning to turn that into a PR for Pest?

Kilerd · 2022-07-07T10:33:45Z

any update?

CAD97 · 2022-07-07T15:23:56Z

It is currently a design choice of Position to just contain a usize index, rather than additionally the (usize, usize) line/col pair. With the current implementation, this IIRC results in significant space savings, as Position is stored (and expected to be stored by the user) in a significant number of places (everywhere Token is).

This does necessarily make line_col() $O(N)$. The expectation here is that line_col is a relatively operation, though; that most if not all internal operations deal in the usize byte offset and line:col is only used for rendering positions for user-facing diagnostics.

The $O(N)$ could be improved for large strings by using something like memrchr to find the preceding newline, then bytecount to calculate the line:col. However, this would likely increase the overhead for small strings to improve the growth factor for large strings.

(I suspect a marginally better approach is to rfind to the newline and count columns in one normal loop (e.g. rather than separate memrchr and num_chars calls), since the column length is usually small, and then a bytecount for the line number.)

I would, however, be happy to r+ a patch which incrementally tracked line:col in cursors (e.g. Pair) so long as the underlying token buffer only stores byte offsets. (But see also #606 w.r.t. review+merge+publish.)

note that this may have extra overhead for small inputs and requires two extra dependencies, hence the faster `line_col` was put under the optional `fast-line-col` feature flag. closes pest-parser#560

note that this may have extra overhead for small inputs and requires two extra dependencies, hence the faster `line_col` was put under the optional `fast-line-col` feature flag. closes #560 Co-authored-by: Tomas Tauber <me@tomtau.be>

…terator. ref: pest-parser#707, pest-parser#560

…terator. (#754) * Improve line, col calculate performance by use move cursor on Pairs Iterator. ref: #707, #560 * Add benchmark for pair.line_col vs position.line_cole * Fix flat_pairs and pairs.next_back to use position.line_col * Merge line_col method to use `position::line_col`. * Fix `pair.line_col` for supports skiped characters, and add test for rev iter.

huacnlee added a commit to huacnlee/autocorrect that referenced this issue Oct 25, 2021

Fix slow performance issues for get line and col.

c44dc8c

pest-parser/pest#560

huacnlee added a commit to huacnlee/autocorrect that referenced this issue Oct 25, 2021

Fix slow performance issues for get line and col.

6c9da4b

pest-parser/pest#560

Kilerd mentioned this issue Jul 8, 2022

Tracing issue of low efficiency of calculating line_col info with pest zhang-accounting/zhang#67

Closed

tomtau assigned huacnlee Jul 14, 2022

tomtau added bug runtime-performance labels Jul 14, 2022

This was referenced Jul 14, 2022

Calculate line, col from another Position #414

Closed

Refactor/optimize validation #559

Closed

tomtau mentioned this issue Sep 10, 2022

fix: speed up Position::line_col for large inputs using SIMD #707

Merged

tomtau closed this as completed in #707 Sep 12, 2022

huacnlee added a commit to huacnlee/pest that referenced this issue Dec 27, 2022

Improve line, col calculate performance by use move cursor on Pairs I…

6434877

…terator. ref: pest-parser#707, pest-parser#560

huacnlee mentioned this issue Dec 27, 2022

Improve line, col calculate performance by use move cursor on Pairs Iterator. #754

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

line_col method N times slow in large file. #560

line_col method N times slow in large file. #560

huacnlee commented Oct 25, 2021 •

edited

Loading

huacnlee commented Oct 25, 2021 •

edited

Loading

matthew-dean commented Dec 12, 2021

Kilerd commented Jul 7, 2022

CAD97 commented Jul 7, 2022

line_col method N times slow in large file. #560

line_col method N times slow in large file. #560

Comments

huacnlee commented Oct 25, 2021 • edited Loading

huacnlee commented Oct 25, 2021 • edited Loading

matthew-dean commented Dec 12, 2021

Kilerd commented Jul 7, 2022

CAD97 commented Jul 7, 2022

huacnlee commented Oct 25, 2021 •

edited

Loading

huacnlee commented Oct 25, 2021 •

edited

Loading