Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: regexp/syntax: improve printing of flags #57950

Open
theseion opened this issue Jan 22, 2023 · 3 comments
Open

proposal: regexp/syntax: improve printing of flags #57950

theseion opened this issue Jan 22, 2023 · 3 comments
Labels
Milestone

Comments

@theseion
Copy link

Parsing and printing expressions with flags produces pretty bad output. Example: (?i:ab*c|d?e) -> (?i:A)(?i:B)*(?i:C)|(?i:D)?(?i:E). While the output is semantically correct, it causes the resulting expression to grow in size by approximately factor 3.

I am a core developer of https://github.com/coreruleset and we use Go to generate regular expressions that can become very large (hundreds of KB). A growth of an expression by factor 3 is not acceptable for us. The issue for us is mostly about the fold case flag (i) but other flags are affected as well, e.g., ab*c.|d?e. -> ab*c(?-s:.)|d?e(?-s:.) (better: (?:ab*c|d?e)(?-s:.), or (?-s)(?:ab*c|d?e).).

I propose to improve printing of regular expressions, so that flags are only repeated if necessary.

@seankhliao
Copy link
Member

why wouldn't you just keep the original, like regexp.Regex ?

cc @rsc

@robpike
Copy link
Contributor

robpike commented Jan 22, 2023

How are you printing it? The existing code does exactly what you want: https://go.dev/play/p/R5sPE02FYKJ

package main

import (
	"fmt"
	"regexp"
)

func main() {
	fmt.Println(regexp.MustCompile(`(?i:ab*c|d?e)`))
}

@theseion
Copy link
Author

theseion commented Jan 23, 2023

We generate a single regular expression out of multiple smaller expressions. Here's an example:

\s+__\$
command1
command2

Each line is treated as an alternation. The result would be \s+__\$|command1|command2. We do more complicated things too, such as substitutions and concatenation. In the end, we always end up with a list of alternations as above. We then feed these to a library (https://github.com/itchyny/rassemble-go) for final assembly and optimisation.

This process is also recursive in some cases, so the output from rassemble-go may become one of the alternations.

rassemble-go uses regex/syntax to parse and process the expression:

import "regexp/syntax"

r, _ := syntax.Parse("(?i:ab*c|d?e)", syntax.ClassNL|syntax.PerlX)
println(r.String())

The result is (?i:A)(?i:B)*(?i:C)|(?i:D)?(?i:E).

To finalize:

  1. the "original" expression may be embedded in another expression and I don't know the position in the string where I would need to insert the flags
  2. rassemble-go uses regexp/syntax to parse and manipulate the regular expression, hence the output differs from the one produced by regexp.Regexp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

4 participants