Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PCRE with UTF-8 data on Windows #145

Open
cdornan opened this issue Jun 4, 2017 · 2 comments
Open

Fix PCRE with UTF-8 data on Windows #145

cdornan opened this issue Jun 4, 2017 · 2 comments
Assignees

Comments

@cdornan
Copy link
Contributor

@cdornan cdornan commented Jun 4, 2017

  • regex-pcre has never worked with UTF8 data due to #141 (and it was never guaranteed).

  • Currently it is not working on Windows (at least on AppVeyor) and the Windows UTF-8/PCRE tests have been suspended.

  • The current method of fixing up the offsets in regex is hacky and inefficient.

@cdornan cdornan changed the title Fix PCRE with UTF-8 data Fix PCRE with UTF-8 data on Windows Jun 8, 2017
@cdornan
Copy link
Contributor Author

@cdornan cdornan commented Jun 8, 2017

BTW, this issue was reported at regex-pcre-builtin.

@cdornan cdornan modified the milestone: v1.0.2.0 Jun 10, 2017
@cdornan cdornan added the in progress label Jun 10, 2017
@cdornan cdornan self-assigned this Jun 10, 2017
@cdornan cdornan closed this Dec 14, 2018
@cdornan cdornan reopened this Jan 16, 2019
@cdornan cdornan removed the stale label Jan 16, 2019
@goertzenator
Copy link

@goertzenator goertzenator commented Aug 20, 2020

I ran into issues using PCRE.Text in the presence of unicode ligatures. Platform is Windows.

*Main Lib Text.RE.PCRE.Text> "a first hello to everyone"  *=~/ [ed|$(hello)///"$1"|]
"a first \"hello\" to everyone"  -- OK
*Main Lib Text.RE.PCRE.Text> "a first hello to everyone"  *=~/ [ed|$(hello)///"$1"|]
"a fir\64262 \"llo t\" to everyone"  -- Uh oh

ByteString sort of works, but it looks like it chews up my ligature:

*Main Lib Text.RE.PCRE.ByteString> "a first hello to everyone"  *=~/ [ed|$(hello)///"$1"|]
"a fir\ACK \"hello\" to everyone"

And String just crashes:

*Main Lib Text.RE.PCRE.String> "a first hello to everyone"  *=~/ [ed|$(hello)///"$1"|]
"*** Exception: utf8_correct_bs: UTF-8 decoding error
CallStack (from HasCallStack):
  error, called at .\Text\RE\ZeInternals\Types\Match.lhs:248:13 in regex-1.1.0.0-H1FPxX1khLGKIhuhwowTFL:Text.RE.ZeInternals.Types.Match

This does work correctly in the TDFA module, however my use case requires non-greedy matching which only appears to be supported by PCRE. My current work around is to use TDFA where I can and then manual non-regex search and replace where I require non-greedy behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.