New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regex separators for split row/list/column #8707
Conversation
I've seen those test failures before. I got them myself on my Mac M1. Are you sure you were on the latest main? |
I'm wondering if this is a breaking change now that it appears to exclusively use regex for matching? I see the tests pass but I wonder if there's a scenario where some usage wouldn't work. |
This is on a M2 Max 16 pro, so it's possible related to the arch. Re: single character, the escape call removes any interaction with the regex parser. It should behave as a single character regex split. I can add some more tests to verify regex control characters don't trip it up if desired. It could be re-done similar to the Matcher I did to distinguish between the literal and the regex case, but it felt a little heavy weight. The Matcher made sense given the generalness of the Value comparison. A more substantial change would be either restructuring how the function is applied in the map - ultimately I didn't want to make the regex inside the function as it's relatively expensive to compile the regex, and the surgery to pass in a regex or a character was a bit more complicated than simply escaping character. But, yeah, same thoughts crossed my mind. |
This is a fantastic first PR, thanks! We'll take a close look at the code soon. We're a little too close to the next release (0.78, 2 days away) to merge this now but I'm not seeing any obvious reasons why this couldn't be merged shortly after the release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much!
I think your PR looks good from a correctness perspective also thanks to your added tests.
I can see two areas where the performance could be possibly improved:
- (needs benchmarking) I am not sure if the use of
regex::escape
for just literal matching is as efficient as the bare matching for the non regex case. - (not as relevant as the most likely usage of
split ..
is doing an operation once and leveraging the command internal broadcasting over a list) we have a global regex cache in the engine state we use to store precompiled regexes for tight loops.
But that shouldn't block this PR if we don't regress dramatically on the non regex example.
+1 to landing |
"as efficient" - I'm guessing that if the regex::escape still has to compile a regex, then it's not as efficient. However, I'm no regex crate expert so I'm not sure what it's doing under the hood. I think we should give this a spin and see if we notice a difference. Thanks for the PR! |
Reading the implementation regex::escape replaces any command character with its escaped version - I.e., [ is replaced with [, and \ with \ unless it's already \. Note this is compiled outside the loop. Almost certainly it'll be slower than a direct str split, but as a fully anchored character it should be relatively efficient. I didn't get a chance to trace through what it compiles to (I'm traveling through SE Asia at the moment with spotty internet) but there's a regex-debugger you can compile out of the regex crate. I'll wager the escaping itself compiles away into a single character match machine - after all the control characters lead to a complex machine, and the escaping is to inform the regex compiler to not be more complex than a single character. I can do some benchmarking if desired. Could you provide an example of where the global regex cache is utilized? I don't think it's unreasonable to use it, although realistically I imagine in a pipeline the regex compile step is greatly dwarfed by the pipeline evaluation step. |
Yeah I wholeheartedly believe that the resulting compiled version for the literal will still be pretty efficient. Just a bit heavier than the effectively just
We primarily use it through the |
Hi, I wrote a little Criterion benchmark test that tested the single character replacements using an escaped character with various sizes of strings. The upshot is the regex version is roughly 45-50% slower - for example, for a 1mb string split into 100 pieces, it takes 40ns for the str.split() version vs the regex.split(str) version at 80ns. https://gist.github.com/fnordpig/4a5268d937413e2435e99fd904815621 |
@fnordpig Thanks for the investigation and work on this. If I understand correctly, we now know the impact of using regex to split. However, since are defaulting to the faster split and only use the regex split when called upon with -r, we don't have to make any changes to this. Does that sound right? |
No, I split with regex in all cases for column split. It simplified the code path significantly. While it does decrease performance overall I suspect in practical reality the degradation is imperceptible. But if folks think it's worth it I'm happy to redress with a different more efficient if less parsimonious version. |
oh, ya. Regex::new() vs regex::escape, now I remember, thanks. 50% slower is kind of a big deal. My preference is to default to non-regex and only invoke regex with the -r flag. I'm not sure how others feel. |
Description
Verified on discord with maintainer
Change adds regex separators in split rows/column/list. The primary motivating reason was to make it easier to split on separators with unbounded whitespace without requiring a lot of trim jiggery. But, secondary motivation is the same as the set of all motivations for adding split regex features to most languages.
User-Facing Changes
Adds -r option to split rows/column/list.
Tests + Formatting
Ran tests, however tests.nu fails with unrelated errors:
Upon investigating seeing this difference:
This seems unrelated to my changes, but can investigate further if desired.
After Submitting
If your PR had any user-facing changes, update the documentation after the PR is merged, if necessary. This will help us keep the docs up to date.