Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex separators for split row/list/column #8707

Merged
merged 1 commit into from Apr 7, 2023
Merged

Add regex separators for split row/list/column #8707

merged 1 commit into from Apr 7, 2023

Conversation

fnordpig
Copy link
Contributor

@fnordpig fnordpig commented Apr 2, 2023

Description

Verified on discord with maintainer

Change adds regex separators in split rows/column/list. The primary motivating reason was to make it easier to split on separators with unbounded whitespace without requiring a lot of trim jiggery. But, secondary motivation is the same as the set of all motivations for adding split regex features to most languages.

User-Facing Changes

Adds -r option to split rows/column/list.

Tests + Formatting

Ran tests, however tests.nu fails with unrelated errors:

~/src/nushell> cargo run -- crates/nu-utils/standard_library/tests.nu                                                                                                                                                          04/02/2023 02:07:25 AM
    Finished dev [unoptimized + debuginfo] target(s) in 0.24s
     Running `target/debug/nu crates/nu-utils/standard_library/tests.nu`
INF|2023-04-02T02:07:27.060|Running tests in test_asserts
INF|2023-04-02T02:07:27.141|Running tests in test_dirs
Error:
  × list is just pwd after initialization

INF|2023-04-02T02:07:27.167|Running tests in test_logger
INF|2023-04-02T02:07:27.286|Running tests in test_std
Error:
  × some tests did not pass (see complete errors above):
  │
  │       test_asserts test_assert
  │       test_asserts test_assert_equal
  │       test_asserts test_assert_error
  │       test_asserts test_assert_greater
  │       test_asserts test_assert_greater_or_equal
  │       test_asserts test_assert_length
  │       test_asserts test_assert_less
  │       test_asserts test_assert_less_or_equal
  │       test_asserts test_assert_not_equal
  │     ⨯ test_dirs test_dirs_command
  │       test_logger test_critical
  │       test_logger test_debug
  │       test_logger test_error
  │       test_logger test_info
  │       test_logger test_warning
  │       test_std test_path_add
  │

Upon investigating seeing this difference:

╭───┬─────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 0 │ /var/folders/1f/ltbr1m8s5s1811k6n1rhpc0r0000gn/T/test_dirs_c1ed89d6-19f7-47c7-9e1f-74c39f3623b5         │
│ 1 │ /private/var/folders/1f/ltbr1m8s5s1811k6n1rhpc0r0000gn/T/test_dirs_c1ed89d6-19f7-47c7-9e1f-74c39f3623b5 │
╰───┴─────────────────────────────────────────────────────────────────────────────────────────────────────────╯

This seems unrelated to my changes, but can investigate further if desired.

After Submitting

If your PR had any user-facing changes, update the documentation after the PR is merged, if necessary. This will help us keep the docs up to date.

@fdncred
Copy link
Collaborator

fdncred commented Apr 2, 2023

I've seen those test failures before. I got them myself on my Mac M1. Are you sure you were on the latest main?

@fdncred
Copy link
Collaborator

fdncred commented Apr 2, 2023

I'm wondering if this is a breaking change now that it appears to exclusively use regex for matching? I see the tests pass but I wonder if there's a scenario where some usage wouldn't work.

@fnordpig
Copy link
Contributor Author

fnordpig commented Apr 2, 2023

This is on a M2 Max 16 pro, so it's possible related to the arch.

Re: single character, the escape call removes any interaction with the regex parser. It should behave as a single character regex split. I can add some more tests to verify regex control characters don't trip it up if desired. It could be re-done similar to the Matcher I did to distinguish between the literal and the regex case, but it felt a little heavy weight. The Matcher made sense given the generalness of the Value comparison. A more substantial change would be either restructuring how the function is applied in the map - ultimately I didn't want to make the regex inside the function as it's relatively expensive to compile the regex, and the surgery to pass in a regex or a character was a bit more complicated than simply escaping character. But, yeah, same thoughts crossed my mind.

@rgwood
Copy link
Contributor

rgwood commented Apr 2, 2023

This is a fantastic first PR, thanks!

We'll take a close look at the code soon. We're a little too close to the next release (0.78, 2 days away) to merge this now but I'm not seeing any obvious reasons why this couldn't be merged shortly after the release.

Copy link
Member

@sholderbach sholderbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

I think your PR looks good from a correctness perspective also thanks to your added tests.

I can see two areas where the performance could be possibly improved:

  1. (needs benchmarking) I am not sure if the use of regex::escape for just literal matching is as efficient as the bare matching for the non regex case.
  2. (not as relevant as the most likely usage of split .. is doing an operation once and leveraging the command internal broadcasting over a list) we have a global regex cache in the engine state we use to store precompiled regexes for tight loops.

But that shouldn't block this PR if we don't regress dramatically on the non regex example.

@fdncred
Copy link
Collaborator

fdncred commented Apr 6, 2023

+1 to landing

@fdncred
Copy link
Collaborator

fdncred commented Apr 7, 2023

(needs benchmarking) I am not sure if the use of regex::escape for just literal matching is as efficient as the bare matching for the non regex case.

"as efficient" - I'm guessing that if the regex::escape still has to compile a regex, then it's not as efficient. However, I'm no regex crate expert so I'm not sure what it's doing under the hood.

I think we should give this a spin and see if we notice a difference.

Thanks for the PR!

@fdncred fdncred merged commit 4fda6d7 into nushell:main Apr 7, 2023
15 checks passed
@fnordpig
Copy link
Contributor Author

fnordpig commented Apr 7, 2023

Reading the implementation regex::escape replaces any command character with its escaped version - I.e., [ is replaced with [, and \ with \ unless it's already \. Note this is compiled outside the loop. Almost certainly it'll be slower than a direct str split, but as a fully anchored character it should be relatively efficient. I didn't get a chance to trace through what it compiles to (I'm traveling through SE Asia at the moment with spotty internet) but there's a regex-debugger you can compile out of the regex crate. I'll wager the escaping itself compiles away into a single character match machine - after all the control characters lead to a complex machine, and the escaping is to inform the regex compiler to not be more complex than a single character.

I can do some benchmarking if desired.

Could you provide an example of where the global regex cache is utilized? I don't think it's unreasonable to use it, although realistically I imagine in a pipeline the regex compile step is greatly dwarfed by the pipeline evaluation step.

@sholderbach
Copy link
Member

Yeah I wholeheartedly believe that the resulting compiled version for the literal will still be pretty efficient. Just a bit heavier than the effectively just memchr.

Could you provide an example of where the global regex cache is utilized?

We primarily use it through the Value::regex_match() API which is relevant for the ~= operator and friends. But in this case the Value coercion is not relevant.
https://docs.rs/nu-protocol/latest/src/nu_protocol/value/mod.rs.html#2992-3066

@fnordpig
Copy link
Contributor Author

Hi, I wrote a little Criterion benchmark test that tested the single character replacements using an escaped character with various sizes of strings. The upshot is the regex version is roughly 45-50% slower - for example, for a 1mb string split into 100 pieces, it takes 40ns for the str.split() version vs the regex.split(str) version at 80ns.

https://gist.github.com/fnordpig/4a5268d937413e2435e99fd904815621

@fnordpig fnordpig deleted the split-regex branch April 10, 2023 12:54
@fdncred
Copy link
Collaborator

fdncred commented Apr 10, 2023

@fnordpig Thanks for the investigation and work on this. If I understand correctly, we now know the impact of using regex to split. However, since are defaulting to the faster split and only use the regex split when called upon with -r, we don't have to make any changes to this. Does that sound right?

@fnordpig
Copy link
Contributor Author

No, I split with regex in all cases for column split. It simplified the code path significantly.

While it does decrease performance overall I suspect in practical reality the degradation is imperceptible.

But if folks think it's worth it I'm happy to redress with a different more efficient if less parsimonious version.

@fdncred
Copy link
Collaborator

fdncred commented Apr 11, 2023

No, ...

oh, ya. Regex::new() vs regex::escape, now I remember, thanks. 50% slower is kind of a big deal. My preference is to default to non-regex and only invoke regex with the -r flag. I'm not sure how others feel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants