Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex inconsistencies #28

Open
PgBiel opened this issue Jan 10, 2024 · 2 comments
Open

Regex inconsistencies #28

PgBiel opened this issue Jan 10, 2024 · 2 comments

Comments

@PgBiel
Copy link

PgBiel commented Jan 10, 2024

Hello,
I've observed several inconsistencies between the regex pandoc uses when reading Typst documents and the regex Typst uses.

Here are a few of them:

  1. Not all flags are supported. Typst regex supports the flags i, m, s, u, x. Of those, only i appears to be supported by Pandoc.
    For example, #(regex("(?m)a") in "A") compiles in Typst, but doesn't in Pandoc (3.1.11.1 via try.pandoc.org), with the error (line 1, column 2): parseRegex for Text.Regex.TDFA.Text failed:"({0,1}m)a" (line 1, column 4): unexpected '0' expecting an atom.
    • I especially miss the m (multiline) flag in order to be able to match the start of a line with ^ and the end of a line with $.
  2. Unnamed capture groups are not supported: #(regex("(?:x)") in "x") compiles in Typst, but not in pandoc ((line 1, column 2): parseRegex for Text.Regex.TDFA.Text failed:"({0,1}:x)" (line 1, column 4): unexpected '0' expecting an atom).
    • This is needed to avoid unnecessary capture groups in the output, and is frequently used across my packages.
  3. Explicitly named capture groups are not supported: #(regex("(?P<a>x)") in "x") compiles in Typst, but not in Pandoc ((line 1, column 2): parseRegex for Text.Regex.TDFA.Text failed:"({0,1}P<a>x)" (line 1, column 4): unexpected '0' expecting an atom).

Besides non-compilation, there are inconsistencies in the results of regex matching as well.

  1. #(regex("[\s\S]+") in "x") returns true in Typst, but false in Pandoc.
  2. #("a \n b" == "a \n b".match(regex("[^.]+")).text) returns true in Typst, but false in Pandoc. In general, [ ] seems to unable to accept newlines, when it should.

There are probably inconsistencies I haven't found yet as well, but they could be added to this issue as they are found.

@jgm
Copy link
Owner

jgm commented Jan 10, 2024

Yes. Problem is that we can't just use the regex engine typst uses. We are limited to the Haskell ecosystem. So what I do is use the regex-tdfa package for the basics, and try to supplement it when possible for things it is missing. E.g. it is missing \d \w \s, ?, and +, so I just replace these with equivalents. Of course, this isn't 100% reliable, and we can already see a place where it produces bad results in your #1 -- (?m) is a special construction; ? here doesn't mean "0 or 1", but my hack just replaces the ? with {0,1} with terrible results.

I could switch to using another regex engine. Hackage has regex-pcre-builtin, which comes with the C sources so that an external dependency isn't introduced. I've tried to avoid using wrapped C libraries in pandoc, but maybe could reconcsider in this case. I imagine pcre would be pretty close.

I also reimplemented as much as I needed of the regex engine used by KDE for my skylighting library. This isn't currently published as a separate package, though.

@jgm
Copy link
Owner

jgm commented Jan 10, 2024

Oh, I see there is now https://hackage.haskell.org/package/regex-rure
But this would make pandoc depend on an installation of librure.so/dylib somewhere; I want to avoid that and have a perfectly self-contained static binary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants