Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[\s\S] doesn't seem to work #4

Closed
Twigpig opened this issue Jun 16, 2017 · 9 comments
Closed

[\s\S] doesn't seem to work #4

Twigpig opened this issue Jun 16, 2017 · 9 comments

Comments

@Twigpig
Copy link

Twigpig commented Jun 16, 2017

Using Regex "Test:\s*([\s\S]?)\s;" (without quotes, obviously) with an input of "Test: hello ;" correctly Returns "hello" on other Regex tools (e.g. http://www.regexr.com/) but returns no results using TRegExpr.

image

Using "Test:\s*(.?)\s;" works for this case in TRegExpr but obviously wouldn't do the same job if you were using a multi-line input string.

image

vs

image

Unless I'm mistaken, the below should return "hell\nlo":

image

andgineer added a commit that referenced this issue Oct 22, 2018
@andgineer
Copy link
Owner

It looks the same issue as https://bugs.freepascal.org/view.php?id=34130

In this implementation of regexpr you cannot use 'not-space' inside character class.
The compiler expects just simple chars or intervals (like [a-z]).

You can invert character class as [^a-z] but I do not understand what do you want to do.

for example expression '(?m)Test:\s*(.*?)\s;' will be found in input text 'Test: hel'#$d#$a'lo ;' and returns 'Test: hel'#$d#$a'lo ;'
3bf2b36

@Twigpig
Copy link
Author

Twigpig commented Oct 22, 2018

The problem with using (.*?) instead of ([\s\S]*?) is that it doesn't return results that include line breaks, even with the multiline flag enabled (as demonstrated in my second to last screenshot above).

@andgineer
Copy link
Owner

andgineer commented Oct 22, 2018 via email

@Twigpig
Copy link
Author

Twigpig commented Oct 23, 2018

[\s\S] just won't work because TRegExpr does not work with metachars inside char class (square brackets '[]').TRegExpr expects inside char class just chars or intervals (like [a-z]).

I was classing this as a bug because I wrongly expected this kind of syntax to be consistent across all variations of RegEx and other implementations of RegEx allow it. However it sounds like you're suggesting it's intentionally not supported in TRegExpr. If you don't mind me asking, is there a reason? Was it by design?

My aim was to produce a variation of the above RegEx that could be used in both TRegExpr and http://www.regexr.com/ but it seems that the required syntax for each is incompatible with the other (and I suppose that's okay).

Thank you for taking the time to look into this and getting back to me. Much appreciated.

@andgineer
Copy link
Owner

andgineer commented Oct 23, 2018

This library I wrote 20 years ago and at the time I implemented just the re subset that I need.
You know, at the moment there was no other libraries for regular expressions in Delphi.
So I implemented it myself but just for my tasks.

Now I do not use pascal in my everyday life so if we are going to continue development we need somebody with current pascal skills and wish to join TRegExpr development.

In fact that's the main reason why I published it on github.

Meanwhile I am going to fix bugs .. if you can find one, after 20 years of the library exposure ;)

@Twigpig
Copy link
Author

Twigpig commented Oct 23, 2018

Ah, I see. So it's just something it doesn't currently handle, rather than something it shouldn't allow on principle? If that's the case, perhaps I'll implement the feature myself if I can find the time.

Thanks again.

@andgineer
Copy link
Owner

andgineer commented Oct 23, 2018

As I see there no such thing in POSIX basic standard
https://en.wikipedia.org/wiki/Regular_expression#POSIX_basic_and_extended

As for extensions this is [:blank:] in POSIX, \s in vim and no such thing in perl.

Or this is [:digit:] for POSIX and \d for vim and perl.

May be perl way is better because POSIX is too verbose.
But with perl you have to understand that for example \r means one character and \w means character class.
Do not see any problems with that..

@totyaxy
Copy link

totyaxy commented Jul 22, 2019

This library I wrote 20 years ago

It's very intresting, because the latest trunk version works correctly with UTF8 chars in Lazarus 2.0.3 / fpc 3.0.5. , and no need anymore the complicate UTF8<->unicodestring conversion.

Thank you for this library!

@Alexey-T
Copy link
Collaborator

Alexey-T commented Nov 15, 2019

\S \D \W not allowed in [], they are handled here

                case regparse^ of // r.e.extensions
                  'd': EmitRangeStr ('0123456789');
                  'w': EmitRangeStr (WordChars);
                  's': EmitRangeStr (SpaceChars);
                  else EmitSimpleRangeC (UnQuoteChar (regparse));
                 end; { of case}

we cannot add here handling of \D\ W\ S - too much chars needed in param (65k minus few).

andgineer pushed a commit that referenced this issue Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants