Add regex separators for split row/list/column #8707

fnordpig · 2023-04-02T09:29:55Z

Description

Verified on discord with maintainer

Change adds regex separators in split rows/column/list. The primary motivating reason was to make it easier to split on separators with unbounded whitespace without requiring a lot of trim jiggery. But, secondary motivation is the same as the set of all motivations for adding split regex features to most languages.

User-Facing Changes

Adds -r option to split rows/column/list.

Tests + Formatting

Ran tests, however tests.nu fails with unrelated errors:

~/src/nushell> cargo run -- crates/nu-utils/standard_library/tests.nu                                                                                                                                                          04/02/2023 02:07:25 AM
    Finished dev [unoptimized + debuginfo] target(s) in 0.24s
     Running `target/debug/nu crates/nu-utils/standard_library/tests.nu`
INF|2023-04-02T02:07:27.060|Running tests in test_asserts
INF|2023-04-02T02:07:27.141|Running tests in test_dirs
Error:
  × list is just pwd after initialization

INF|2023-04-02T02:07:27.167|Running tests in test_logger
INF|2023-04-02T02:07:27.286|Running tests in test_std
Error:
  × some tests did not pass (see complete errors above):
  │
  │       test_asserts test_assert
  │       test_asserts test_assert_equal
  │       test_asserts test_assert_error
  │       test_asserts test_assert_greater
  │       test_asserts test_assert_greater_or_equal
  │       test_asserts test_assert_length
  │       test_asserts test_assert_less
  │       test_asserts test_assert_less_or_equal
  │       test_asserts test_assert_not_equal
  │     ⨯ test_dirs test_dirs_command
  │       test_logger test_critical
  │       test_logger test_debug
  │       test_logger test_error
  │       test_logger test_info
  │       test_logger test_warning
  │       test_std test_path_add
  │

Upon investigating seeing this difference:

╭───┬─────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ 0 │ /var/folders/1f/ltbr1m8s5s1811k6n1rhpc0r0000gn/T/test_dirs_c1ed89d6-19f7-47c7-9e1f-74c39f3623b5         │
│ 1 │ /private/var/folders/1f/ltbr1m8s5s1811k6n1rhpc0r0000gn/T/test_dirs_c1ed89d6-19f7-47c7-9e1f-74c39f3623b5 │
╰───┴─────────────────────────────────────────────────────────────────────────────────────────────────────────╯

This seems unrelated to my changes, but can investigate further if desired.

After Submitting

If your PR had any user-facing changes, update the documentation after the PR is merged, if necessary. This will help us keep the docs up to date.

fdncred · 2023-04-02T12:20:53Z

I've seen those test failures before. I got them myself on my Mac M1. Are you sure you were on the latest main?

fdncred · 2023-04-02T12:29:08Z

I'm wondering if this is a breaking change now that it appears to exclusively use regex for matching? I see the tests pass but I wonder if there's a scenario where some usage wouldn't work.

fnordpig · 2023-04-02T14:30:25Z

This is on a M2 Max 16 pro, so it's possible related to the arch.

Re: single character, the escape call removes any interaction with the regex parser. It should behave as a single character regex split. I can add some more tests to verify regex control characters don't trip it up if desired. It could be re-done similar to the Matcher I did to distinguish between the literal and the regex case, but it felt a little heavy weight. The Matcher made sense given the generalness of the Value comparison. A more substantial change would be either restructuring how the function is applied in the map - ultimately I didn't want to make the regex inside the function as it's relatively expensive to compile the regex, and the surgery to pass in a regex or a character was a bit more complicated than simply escaping character. But, yeah, same thoughts crossed my mind.

rgwood · 2023-04-02T15:19:41Z

This is a fantastic first PR, thanks!

We'll take a close look at the code soon. We're a little too close to the next release (0.78, 2 days away) to merge this now but I'm not seeing any obvious reasons why this couldn't be merged shortly after the release.

sholderbach

Thank you very much!

I think your PR looks good from a correctness perspective also thanks to your added tests.

I can see two areas where the performance could be possibly improved:

(needs benchmarking) I am not sure if the use of regex::escape for just literal matching is as efficient as the bare matching for the non regex case.
(not as relevant as the most likely usage of split .. is doing an operation once and leveraging the command internal broadcasting over a list) we have a global regex cache in the engine state we use to store precompiled regexes for tight loops.

But that shouldn't block this PR if we don't regress dramatically on the non regex example.

fdncred · 2023-04-06T21:57:34Z

+1 to landing

fdncred · 2023-04-07T11:46:07Z

(needs benchmarking) I am not sure if the use of regex::escape for just literal matching is as efficient as the bare matching for the non regex case.

"as efficient" - I'm guessing that if the regex::escape still has to compile a regex, then it's not as efficient. However, I'm no regex crate expert so I'm not sure what it's doing under the hood.

I think we should give this a spin and see if we notice a difference.

Thanks for the PR!

fnordpig · 2023-04-07T13:41:12Z

Reading the implementation regex::escape replaces any command character with its escaped version - I.e., [ is replaced with [, and \ with \ unless it's already \. Note this is compiled outside the loop. Almost certainly it'll be slower than a direct str split, but as a fully anchored character it should be relatively efficient. I didn't get a chance to trace through what it compiles to (I'm traveling through SE Asia at the moment with spotty internet) but there's a regex-debugger you can compile out of the regex crate. I'll wager the escaping itself compiles away into a single character match machine - after all the control characters lead to a complex machine, and the escaping is to inform the regex compiler to not be more complex than a single character.

I can do some benchmarking if desired.

Could you provide an example of where the global regex cache is utilized? I don't think it's unreasonable to use it, although realistically I imagine in a pipeline the regex compile step is greatly dwarfed by the pipeline evaluation step.

sholderbach · 2023-04-07T21:09:22Z

Yeah I wholeheartedly believe that the resulting compiled version for the literal will still be pretty efficient. Just a bit heavier than the effectively just memchr.

Could you provide an example of where the global regex cache is utilized?

We primarily use it through the Value::regex_match() API which is relevant for the ~= operator and friends. But in this case the Value coercion is not relevant.
https://docs.rs/nu-protocol/latest/src/nu_protocol/value/mod.rs.html#2992-3066

fnordpig · 2023-04-10T12:54:03Z

Hi, I wrote a little Criterion benchmark test that tested the single character replacements using an escaped character with various sizes of strings. The upshot is the regex version is roughly 45-50% slower - for example, for a 1mb string split into 100 pieces, it takes 40ns for the str.split() version vs the regex.split(str) version at 80ns.

https://gist.github.com/fnordpig/4a5268d937413e2435e99fd904815621

fdncred · 2023-04-10T13:41:15Z

@fnordpig Thanks for the investigation and work on this. If I understand correctly, we now know the impact of using regex to split. However, since are defaulting to the faster split and only use the regex split when called upon with -r, we don't have to make any changes to this. Does that sound right?

fnordpig · 2023-04-11T04:26:38Z

No, I split with regex in all cases for column split. It simplified the code path significantly.

While it does decrease performance overall I suspect in practical reality the degradation is imperceptible.

But if folks think it's worth it I'm happy to redress with a different more efficient if less parsimonious version.

fdncred · 2023-04-11T12:05:44Z

No, ...

oh, ya. Regex::new() vs regex::escape, now I remember, thanks. 50% slower is kind of a big deal. My preference is to default to non-regex and only invoke regex with the -r flag. I'm not sure how others feel.

Add regex separators for split row/list/column

5b4df81

rgwood added the wait-until-after-nushell-release label Apr 2, 2023

sholderbach reviewed Apr 6, 2023

View reviewed changes

fdncred merged commit 4fda6d7 into nushell:main Apr 7, 2023
15 checks passed

fnordpig deleted the split-regex branch April 10, 2023 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regex separators for split row/list/column #8707

Add regex separators for split row/list/column #8707

fnordpig commented Apr 2, 2023

fdncred commented Apr 2, 2023 •

edited

fdncred commented Apr 2, 2023

fnordpig commented Apr 2, 2023

rgwood commented Apr 2, 2023

sholderbach left a comment

fdncred commented Apr 6, 2023

fdncred commented Apr 7, 2023

fnordpig commented Apr 7, 2023

sholderbach commented Apr 7, 2023

fnordpig commented Apr 10, 2023

fdncred commented Apr 10, 2023

fnordpig commented Apr 11, 2023

fdncred commented Apr 11, 2023

Add regex separators for split row/list/column #8707

Add regex separators for split row/list/column #8707

Conversation

fnordpig commented Apr 2, 2023

Description

User-Facing Changes

Tests + Formatting

After Submitting

fdncred commented Apr 2, 2023 • edited

fdncred commented Apr 2, 2023

fnordpig commented Apr 2, 2023

rgwood commented Apr 2, 2023

sholderbach left a comment

Choose a reason for hiding this comment

fdncred commented Apr 6, 2023

fdncred commented Apr 7, 2023

fnordpig commented Apr 7, 2023

sholderbach commented Apr 7, 2023

fnordpig commented Apr 10, 2023

fdncred commented Apr 10, 2023

fnordpig commented Apr 11, 2023

fdncred commented Apr 11, 2023

fdncred commented Apr 2, 2023 •

edited