Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

line_col method N times slow in large file. #560

Closed
huacnlee opened this issue Oct 25, 2021 · 4 comments · Fixed by #707
Closed

line_col method N times slow in large file. #560

huacnlee opened this issue Oct 25, 2021 · 4 comments · Fixed by #707

Comments

@huacnlee
Copy link
Member

huacnlee commented Oct 25, 2021

When use pest in a large content file (about 700 KB JSON file), the Pair<R> to call line_col looks like has performance issue O(N).

Source code ref:
https://github.com/huacnlee/autocorrect/blob/v1.4.4/src/lib/src/code.rs#L56

The large file case:

https://github.com/microsoft/vscode-loc/blob/release/1.62.0/i18n/vscode-language-pack-zh-hans/translations/main.i18n.json

// 200ns
let span = item.as_span();
// 48ns
let start_pos = span.start_pos();
let start = std::time::Instant::now();
let (part_line, part_col) = start_pos.line_col();
println!(
    "line_col: {}ns, {}ms",
    start.elapsed().as_nanos(),
    start.elapsed().as_millis()
);

The benchmark result:

line_col: 1545291ns, 1ms
line_col: 1533083ns, 1ms
line_col: 1581541ns, 1ms
line_col: 1537250ns, 1ms
line_col: 1543208ns, 1ms
....
line_col: 2226208ns, 2ms
line_col: 2244875ns, 2ms
line_col: 2229541ns, 2ms
line_col: 2299000ns, 2ms
line_col: 2235958ns, 2ms
....
line_col: 3397791ns, 3ms
line_col: 3398291ns, 3ms
line_col: 3402250ns, 3ms
line_col: 3433666ns, 3ms
line_col: 3409250ns, 3ms
....
line_col: 4089000ns, 4ms
line_col: 4078750ns, 4ms
line_col: 4084083ns, 4ms
line_col: 4128250ns, 4ms
line_col: 4095708ns, 4ms
....
line_col: 5021375ns, 5ms
line_col: 5118083ns, 5ms
line_col: 5028333ns, 5ms
line_col: 5065166ns, 5ms
line_col: 5030750ns, 5ms
...
line_col: 6076458ns, 6ms
line_col: 6124375ns, 6ms
line_col: 6077041ns, 6ms
line_col: 6078041ns, 6ms
line_col: 6081791ns, 6ms
line_col: 6082208ns, 6ms
line_col: 6119625ns, 6ms
...
line_col: 7365125ns, 7ms
line_col: 7368541ns, 7ms
line_col: 7374416ns, 7ms
line_col: 7373250ns, 7ms
line_col: 7366041ns, 7ms
line_col: 7372750ns, 7ms
...
line_col: 8363541ns, 8ms
line_col: 8379583ns, 8ms
line_col: 8375083ns, 8ms
line_col: 8373750ns, 8ms
line_col: 8378791ns, 8ms
...
line_col: 9096541ns, 9ms
line_col: 9101041ns, 9ms
line_col: 9097375ns, 9ms
line_col: 9101500ns, 9ms
line_col: 9092708ns, 9ms
line_col: 9107291ns, 9ms
....
line_col: 10150208ns, 10ms
line_col: 10144625ns, 10ms
line_col: 10311083ns, 10ms
line_col: 10617958ns, 10ms
line_col: 10224125ns, 10ms
line_col: 10167583ns, 10ms
line_col: 10229416ns, 10ms
line_col: 10168041ns, 10ms
huacnlee added a commit to huacnlee/autocorrect that referenced this issue Oct 25, 2021
huacnlee added a commit to huacnlee/autocorrect that referenced this issue Oct 25, 2021
@huacnlee
Copy link
Member Author

huacnlee commented Oct 25, 2021

I have rewritten a way to get line, col for improve a lot of speed.

This new way just save the cursor (line, col) on iter the pairs.

huacnlee/autocorrect@6c9da4b

By a large file (5000 lines JSON case, duration from 12595ms down to 400ms)

@matthew-dean
Copy link

@huacnlee Are you planning to turn that into a PR for Pest?

@Kilerd
Copy link

Kilerd commented Jul 7, 2022

any update?

@CAD97
Copy link
Contributor

CAD97 commented Jul 7, 2022

It is currently a design choice of Position to just contain a usize index, rather than additionally the (usize, usize) line/col pair. With the current implementation, this IIRC results in significant space savings, as Position is stored (and expected to be stored by the user) in a significant number of places (everywhere Token is).

This does necessarily make line_col() $O(N)$. The expectation here is that line_col is a relatively operation, though; that most if not all internal operations deal in the usize byte offset and line:col is only used for rendering positions for user-facing diagnostics.

The $O(N)$ could be improved for large strings by using something like memrchr to find the preceding newline, then bytecount to calculate the line:col. However, this would likely increase the overhead for small strings to improve the growth factor for large strings.

(I suspect a marginally better approach is to rfind to the newline and count columns in one normal loop (e.g. rather than separate memrchr and num_chars calls), since the column length is usually small, and then a bytecount for the line number.)

I would, however, be happy to r+ a patch which incrementally tracked line:col in cursors (e.g. Pair) so long as the underlying token buffer only stores byte offsets. (But see also #606 w.r.t. review+merge+publish.)

tomtau pushed a commit to tomtau/pest that referenced this issue Sep 10, 2022
note that this may have extra overhead for small inputs and
requires two extra dependencies, hence the faster `line_col`
was put under the optional `fast-line-col` feature flag.

closes pest-parser#560
tomtau added a commit that referenced this issue Sep 12, 2022
note that this may have extra overhead for small inputs and
requires two extra dependencies, hence the faster `line_col`
was put under the optional `fast-line-col` feature flag.

closes #560

Co-authored-by: Tomas Tauber <me@tomtau.be>
tomtau pushed a commit that referenced this issue Dec 29, 2022
…terator. (#754)

* Improve line, col calculate performance by use move cursor on Pairs Iterator.

ref: #707, #560

* Add benchmark for pair.line_col vs position.line_cole

* Fix flat_pairs and pairs.next_back to use position.line_col

* Merge line_col method to use `position::line_col`.

* Fix `pair.line_col` for supports skiped characters, and add test for rev iter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants