Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: make optional rules mandatory #21

Open
SKalt opened this issue Aug 28, 2021 · 4 comments
Open

feature request: make optional rules mandatory #21

SKalt opened this issue Aug 28, 2021 · 4 comments

Comments

@SKalt
Copy link
Contributor

SKalt commented Aug 28, 2021

I've got a grammar where some rules match the empty string. I'd like to transform those rules such that they no longer match the empty string and have all references to the rules made optional. Example:

// before
my_rule: foo opt_bar baz;
opt_bar: bar | ;
// after
my_rule: foo opt_bar? baz;
opt_bar: bar;

More difficult example:

// before
foo: bar | opt_baz | quux; // <- foo matches the empty string too
opt_baz: baz | ;
// after
foo: bar | opt_baz? | quux;
opt_baz: baz; 

If you point me to where this feature should go, I'd be happy to take a stab at it.

@SKalt
Copy link
Contributor Author

SKalt commented Sep 26, 2021

Update: I'm able to identify the rules which include | ; using

setup
cat <<EOF > ./temp.g4
grammar temp;
foo: bar | opt_baz | quux;
opt_baz: baz | ; // <-- I'd like to delete `| ;`
bar: 'bar';
baz: 'baz';
quux: 'quux';
EOF
#!/bin/bash
xpath='//parserRuleSpec//alternative[not(element)]'
trparse ./temp.g4 | trxgrep "$xpath" | trxml
# Result size 1
# <alternative>
# </alternative>

and the names of rules which have an alternate branch | ; by

xpath='//parserRuleSpec//alternative[not(element)]/ancestor::parserRuleSpec/RULE_REF'
trparse ./temp.g4 | trxgrep "$xpath" | trxml
# Result size 1
# opt_baz

EDIT: I'm working with the following bash function:

# an xpath query to seek and destroy `| ;` tokens that weren't grouped into
# populated `<alternative>` nodes
xpath_del="//parserRuleSpec[RULE_REF/text() = '<RULE_NAME>']//alternative[not(element)]/../../*[text() = '|']"
# an XPath query to find all relevant references to a rule 
xpath_rename="//parserRuleSpec//ruleBlock//RULE_REF[text()='<RULE_NAME>']"

make_mandatory() {
    local rule_name="${1:?rule name required}"
    # read IR json from stdin
    trdelete "${xpath_del//<RULE_NAME>/$rule_name}" |                     # delete the offending `| ;`
        trreplace "${xpath_rename//<RULE_NAME>/$rule_name}" "$rule_name?" # rename references to `rule_name` to `rule_name?`
}

# which works as follows:
trparse ./temp.g4 | make_mandatory "opt_baz" | trprint
# grammar temp;
# foo: bar | opt_baz? | quux;
# opt_baz: baz ;
# bar: 'bar';
# baz: 'baz';
# quux: 'quux'; 

@kaby76
Copy link
Owner

kaby76 commented Sep 26, 2021

I think one way to do this would be to (1) find a LHS symbol of a rule that contains an empty top-level alt. (2) Construct an xpath expression that finds all uses of any of the symbol names found in step 1. Then, use trinsert to add a "?" after the symbol name. I don't think I have an "insert after" option for trinsert, so this would need to be added. (3) After you made the inserts, go back to find all the rules with empty top-level alts, and use trdelete to nuke those empty alts.

So, there would be a little bit of work fixing trinsert. I already added some function to the "TrashBase" to have InsertBefore() and InsertAfter(). I'm not sure what the best xpath expression would be for step (1).

@kaby76
Copy link
Owner

kaby76 commented Sep 27, 2021

Hi @SKalt ,

I finally figured out how to do this, and wrote this Bash script to do it.

names=`trparse temp.g4 | trxgrep ' //parserRuleSpec[*//alternative[count(*)=0]]/RULE_REF' | trtext`
cond=`echo "$names" | tr '\n' ' ' | tr -d '\r' | sed "s:\([a-zA-Z_][a-zA-Z_]*\):or text()='\1':g" | sed "s:^or ::"`
trparse temp.g4 | trinsert -a " //ruleBlock//RULE_REF[$cond]" '?' | trsponge -o out -c

In order to follow along at what this does, you have to really look at the parse tree (trparse temp.g4 | trtree).

( grammarSpec
  ( grammarDecl
    ( grammarType
      ( GRAMMAR i=0 txt=grammar tt=19 DEFAULT_TOKEN_CHANNEL
    ) ) 
    ( identifier
      ( RULE_REF i=8 txt=temp tt=2 DEFAULT_TOKEN_CHANNEL
    ) ) 
    ( SEMI i=12 txt=; tt=32 DEFAULT_TOKEN_CHANNEL
  ) ) 
  ( rules
    ( ruleSpec
      ( parserRuleSpec
	( RULE_REF i=15 txt=my_rule tt=2 DEFAULT_TOKEN_CHANNEL
	) 
	( COLON i=22 txt=: tt=29 DEFAULT_TOKEN_CHANNEL
	) 
	( ruleBlock
	  ( ruleAltList
	    ( labeledAlt
	      ( alternative
		( element
		  ( atom
		    ( ruleref
		      ( RULE_REF i=24 txt=foo tt=2 DEFAULT_TOKEN_CHANNEL
		) ) ) ) 
		( element
		  ( atom
		    ( ruleref
		      ( RULE_REF i=28 txt=opt_bar tt=2 DEFAULT_TOKEN_CHANNEL
		) ) ) ) 
		( element
		  ( atom
		    ( ruleref
		      ( RULE_REF i=36 txt=baz tt=2 DEFAULT_TOKEN_CHANNEL
	) ) ) ) ) ) ) ) 
	( SEMI i=39 txt=; tt=32 DEFAULT_TOKEN_CHANNEL
	) 
	( exceptionGroup
    ) ) ) 
    ( ruleSpec
      ( parserRuleSpec
	( RULE_REF i=42 txt=opt_bar tt=2 DEFAULT_TOKEN_CHANNEL
	) 
	( COLON i=49 txt=: tt=29 DEFAULT_TOKEN_CHANNEL
	) 
	( ruleBlock
	  ( ruleAltList
	    ( labeledAlt
	      ( alternative
		( element
		  ( atom
		    ( ruleref
		      ( RULE_REF i=51 txt=bar tt=2 DEFAULT_TOKEN_CHANNEL
	    ) ) ) ) ) ) 
	    ( OR i=55 txt=| tt=45 DEFAULT_TOKEN_CHANNEL
	    ) 
	    ( labeledAlt
	      ( alternative
	) ) ) ) 
	( SEMI i=57 txt=; tt=32 DEFAULT_TOKEN_CHANNEL
	) 
	( exceptionGroup
  ) ) ) ) 
  ( EOF i=62 txt= tt=-1 DEFAULT_TOKEN_CHANNEL
) ) 

The first line of this script parses the grammar then finds the names of rules that have an alternative that is empty. One of the problems was that I forgot what // means. Unless it is immediately preceeded with a name as in parserRuleSpec//alternative, or a symbol as in *//alternative, the step always starts back at the root of the tree. So, //parserRuleSpec[//alternative] would find all parser rules irrelevant of whether it contained a descendant of node type alternative, because the test [//alternative] would always be true--there are plent of rules in the entire grammar with alternative. But, //parserRuleSpec[*//alternative] would find all parser rules with an alternative descendant. We want to actually limit this further with the test [count(*)=0], which filters the alternative nodes that have no children.

The second line is fairly grungy because I don't know of a great way in Bash to take a string of rule names separated by spaces, and turn it into an xpath test. E.g., in the grammar cond would become text()='opt_bar'.

The last line finds all occurrences of the symbol opt_bar on the right hand side of a rule (because it is descended from a ruleBlock). and inserts the string ? after the symbol in the rule. The parse tree is serialized into a string and reparsed by trinsert itself. The output is piped to trsponge, which then outputs the file to the directory out/.

To get this to work, I had to modify trinsert. I noticed that some of the tools were crashing because dependent assemblies where not loading, so I checked and fixed all the tools, and have a new release tomorrow (v0.11.3).

This is a wonderful example of what the tool chain can do. Thanks so much for bringing it up.

@SKalt
Copy link
Contributor Author

SKalt commented Sep 28, 2021

Nice use of tr '\n' $replacement and tr -d; I didn't know think of that.

The second line is fairly grungy

Do you think it's worth writing this transform out in C# rather than bash?


In the meantime I found a new, interesting variation of this problem: how to

  1. take a rule that has two alternatives, one of which is the empty string, and
  2. replace all references to the rule with an optional reference to the populated arm.

So

grammar temp
foo: bar opt_baz quux;
opt_baz: bar | ;
bar: 'bar';
quux: 'quux';

would be transformed like

+foo: bar baz? quux;
-foo: bar opt_baz quux;
-opt_baz: bar | ;

I solved that with

# an xpath to get the text of the non-optional branch of a trivially optional rule
populated="//parserRuleSpec[RULE_REF/text()='<RULE_NAME>']" # find a specific rule
populated="${populated}/ruleBlock/ruleAltList"              # with a top-level list of alternatives
populated="${populated}[count(labeledAlt) = 2]"             # check the alt list has only 2 branches
populated="${populated}/labeledAlt[alternative[element]]"   # select the populated branch
# ^ this is largely paranoia; in my pipeline I've already used a variation of this
# query to find the names of the relevant rules. 

trivial_ref="//parserRuleSpec//ruleref[RULE_REF/text()='<RULE_NAME>']"
trivial_def="//parserRuleSpec[RULE_REF/text()='<RULE_NAME>']"
dissolve_trivially_opt_rule() {
    local rule_name="${1:?rule name required}"
    local buffer
    local text

    buffer="$(cat -)"
    text="$(
        echo "$buffer" |
            trxgrep "${populated//<RULE_NAME>/$rule_name}" |
            trtext
    )"
    echo "$buffer" |
        trreplace "${trivial_ref//<RULE_NAME>/$rule_name}" "($text)?" |
        trdelete "${trivial_def//<RULE_NAME>/$rule_name}"
    echo "done: $rule_name" >&2
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants