Scrape Dart Spec

This project extracts the Context-Free Grammar (CFG) for Dart from the Dart2 language specification, written in Latex, and outputs a working, high-quality Antlr4 grammar for Dart. The reason for the tool is to ease construction of an Antlr grammar as the Spec changes.

The scraper, implemented in refactor.sh, works in two phases.

In the first phase, the Spec is scanned for \begin{grammar} ... \end{grammar} blocks, which contain groups of EBNF rules. The raw CFG from the Spec is outputted (orig.g4). This grammar is in Antlr4 syntax, but does not work.

In the second phase, this grammar is transformed into a working Antlr4 grammar for Dart, Dart2Parser.g4, and Dart2Lexer.g4. There are approximately three dozen modifications that are perform, which are described below. These refactoring are generally through Trash, but there are a few that require Bash shell glue to complete the operation.

Refactoring of the CFG in the Spec

In order to get a working Antlr4 grammar for Dart, the extracted EBNF rules require a number of edits.

1. "fragment" lexer rules

The Spec describes the lexical structure of a Dart program accoding to the Antlr4 symbol syntax: lowercase and uppercase names in the CFG are parser and lexer rules, respectively ("lexer"). However, the Spec does not not differentiate lexer rules that can be used in a parser rule vs. lexer rules that should never be used in a parser rule.

In order to have a functioning Antlr4 grammar for Dart, lexer rules that should never be used in a parser rule must be marked as "fragment". Otherwise, it would be possible for the lexer to recognize those strings, and return a token that cannot be used.

BUILT_IN_IDENTIFIER
DIGIT
ESCAPE_SEQUENCE
EXPONENT
HEX_DIGIT
HEX_DIGIT_SEQUENCE
IDENTIFIER_NO_DOLLAR
IDENTIFIER_PART
IDENTIFIER_PART_NO_DOLLAR
IDENTIFIER_START
IDENTIFIER_START_NO_DOLLAR
LETTER
NEWLINE
OTHER_IDENTIFIER

Marking these rules in Trash is through the trinsert command.

trparse orig.g4 | trinsert "//ruleSpec/lexerRuleSpec/TOKEN_REF[text()='BUILT_IN_IDENTIFIER']" "fragment " | trsponge -c

2. WHITESPACE

The Spec describes "whitespace" (Section 5) as strings that should be ignored. It is described in 20.1.1, but Antlr4 requires the rule be marked up so that it can be ignored.

trparse orig.g4 | trinsert "//ruleSpec/lexerRuleSpec[TOKEN_REF/text()='WHITESPACE']/SEMI" " -> skip" | trsponge -c

3. SINGLE_LINE_COMMENT and MULTI_LINE_COMMENT

The Dart Language Specification has rules for single and multi-line comments (20.1.2). These rules must be marked so as to be "ignored" by the parser.

trparse orig.g4 | trreplace "//ruleSpec/lexerRuleSpec[TOKEN_REF/text()='SINGLE_LINE_COMMENT']/lexerRuleBlock/lexerAltList/lexerAlt/lexerElements" "'//' ~[\r\n]* -> skip" | trsponge -c

For the MULTI_LINE_COMMENT, the rule in the Spec doesn't work for Antlr because it is "greedy", meaning that it will recognize too much as a multi-line comment.

MULTI_LINE_COMMENT : '/*' ( MULTI_LINE_COMMENT | ~ '*/' )* '*/' ;
=>
MULTI_LINE_COMMENT : '/*' ( MULTI_LINE_COMMENT | . )*? '*/'  -> skip ;

4. Refactoring problematic tokens

In the Spec, there are rules that contain string literals like '[]' and '>>'. In Antlr, these are declared as tokens. There can be a problem in parsing if there are substring literals that are also referenced, e.g. '[' and '>'. For example >> can be used in either the end of a generic, or as a shift operator in an expression:

    final typedCharCodes = unsafeCast<List<int>>(charCodes);
    ...
    sliceStart = bits >> _lengthBits;

The usual workaround for these problems is to refactor the single string literals into multiple string literals:

trparse orig.g4 | trreplace //ruleSpec/parserRuleSpec\[RULE_REF/text\(\)=\'shiftOperator\'\]//STRING_LITERAL\[text\(\)=\"\'\>\>\'\"\] "'>' '>'" | trsponge -c

'>>' => '>' '>'
'>>>' => '>' '>' '>'
'>>=' => '>' '>' '='
'>>>=' => '>' '>' '>' '='
'>=' => '>' '='
'[]' => '[' ']'
'[]=' => '[' ']' '='

5. Missing token rules for split grammar

The grammar for Dart requires a few semantic predicates to examine lookahead. This is implemented in "target agnostic format" in Antlr, which uses a split grammar, and the semantic predicates implemented in a base class. But, Antlr does not allow string literals in a parser grammar without a lexer rule that for the string literal.

The script adds keyword and punctuation rules for the lexer grammar. The following lexer rules are added to the grammar.

A: '&';
AA: '&&';
AE: '&=';
AT: '@';
C: ',';
CB: ']';
CBC: '}';
CIR: '^';
CIRE: '^=';
CO: ':';
CP: ')';
D: '.';
DD: '..';
DDD: '...';
DDDQ: '...?';
EE: '==';
EG: '=>';
EQ: '=';
GT: '>';
LT: '<';
LTE: '<=';
LTLT: '<<';
LTLTE: '<<=';
ME: '-=';
MINUS: '-';
MM: '--';
NE: '!=';
NOT: '!';
OB: '[';
OBC: '{';
OP: '(';
P: '|';
PC: '%';
PE: '%=';
PL: '+';
PLE: '+=';
PLPL: '++';
PO: '#';
POE: '|=';
PP: '||';
QU: '?';
QUD: '?.';
QUDD: '?..';
QUQU: '??';
QUQUEQ: '??=';
SC: ';';
SE: '/=';
SL: '/';
SQS: '~/';
SQSE: '~/=';
SQUIG: '~';
ST: '*';
STE: '*=';

The script collects all keywords, and creates a rule for each. Lexer "fragment" rules are not considered.

trparse orig.g4 | trxgrep "//STRING_LITERAL[not(ancestor::lexerRuleSpec/FRAGMENT) or ancestor::lexerRuleSpec/TOKEN_REF/text()='BUILT_IN_IDENTIFIER' or ancestor::lexerRuleSpec/TOKEN_REF/text()='OTHER_IDENTIFIER']/text()" | grep -E "'[a-zA-Z]+'" > temporary.txt
cat temporary.txt | sed "s/'//g" | sed 's/$/_/' | tr [:lower:] [:upper:] > temporary2.txt
paste -d ": " temporary2.txt temporary.txt | sed 's/$/;/' | sort -u > lexer_prods.txt

6. Delete problematic rules

MULTI_LINE_STRING_DQ_BEGIN_END
MULTI_LINE_STRING_DQ_BEGIN_MID
MULTI_LINE_STRING_DQ_MID_END
MULTI_LINE_STRING_DQ_MID_MID
MULTI_LINE_STRING_SQ_BEGIN_END
MULTI_LINE_STRING_SQ_BEGIN_MID
MULTI_LINE_STRING_SQ_MID_END
MULTI_LINE_STRING_SQ_MID_MID
QUOTES_DQ
QUOTES_SQ
RAW_MULTI_LINE_STRING
RAW_SINGLE_LINE_STRING
SIMPLE_STRING_INTERPOLATION
SINGLE_LINE_STRING_DQ_BEGIN_END
SINGLE_LINE_STRING_DQ_BEGIN_MID
SINGLE_LINE_STRING_DQ_MID_END
SINGLE_LINE_STRING_DQ_MID_MID
SINGLE_LINE_STRING_DQ_MID_MID
SINGLE_LINE_STRING_SQ_BEGIN_END
SINGLE_LINE_STRING_SQ_BEGIN_MID
SINGLE_LINE_STRING_SQ_MID_END
SINGLE_LINE_STRING_SQ_MID_MID
STRING_CONTENT_COMMON
STRING_CONTENT_DQ
STRING_CONTENT_SQ
STRING_CONTENT_TDQ
STRING_CONTENT_TSQ
multilineString
scriptTag
singleLineString
stringInterpolation

7. Nuke references to EOF

For all rules, remove the reference to EOF since it should only appear on the start rule.

trparse orig.g4 | trdelete "//TOKEN_REF[text()='EOF']" | trsponge -c

8. Add start rule

The Spec does not give a start rule for the grammar. A start rule is added:

trparse orig.g4 | trinsert "//parserRuleSpec[RULE_REF/text()='letExpression']" "compilationUnit: (libraryDeclaration | partDeclaration | expression | statement) EOF ;" | trsponge -c

9. Add replacement string literal rules

The string literal rules are lexer rules, but they reference parser rules. You cannot do this directly in Antlr (one would need to call the parser as an action in the lexer rule).

multilineString : MultiLineString;
singleLineString : SingleLineString;
MultiLineString : '\"\"\"' StringContentTDQ*? '\"\"\"' | '\'\'\'' StringContentTSQ*? '\'\'\'' | 'r\"\"\"' (~'\"' | '\"' ~'\"' | '\"\"' ~'\"')* '\"\"\"' | 'r\'\'\'' (~'\'' | '\'' ~'\'' | '\'\'' ~'\'')* '\'\'\'' ;
SingleLineString : StringDQ | StringSQ | 'r\'' (~('\'' | '\n' | '\r'))* '\'' | 'r\"' (~('\"' | '\n' | '\r'))* '\"' ;
fragment StringDQ : '\"' StringContentDQ*? '\"' ;
fragment StringContentDQ : ~('\\\\' | '\"' | '\n' | '\r' | '\$') | '\\\\' ~('\n' | '\r') | StringDQ | '\${' StringContentDQ*? '}' | '\$' { CheckNotOpenBrace() }? ;
fragment StringSQ : '\'' StringContentSQ*? '\'' ;
fragment StringContentSQ : ~('\\\\' | '\'' | '\n' | '\r' | '\$') | '\\\\' ~('\n' | '\r') | StringSQ | '\${' StringContentSQ*? '}' | '\$' { CheckNotOpenBrace() }? ;
fragment StringContentTDQ : ~('\\\\' | '\"') | '\"' ~'\"' | '\"\"' ~'\"' ;
fragment StringContentTSQ : '\'' ~'\'' | '\'\'' ~'\'' | . ;

10. partDeclaration

partDeclaration : partHeader topLevelDeclaration* EOF ;
=>
partDeclaration : partHeader  (metadata topLevelDeclaration)*  ;

11. declaration

declaration : 'external' factoryConstructorSignature | 'external' constantConstructorSignature | 'external' constructorSignature | ( 'external' 'static'? )? getterSignature | ( 'external' 'static'? )? setterSignature | ( 'external' 'static'? )? functionSignature | 'external'? operatorSignature | 'static' 'const' type? staticFinalDeclarationList | 'static' 'final' type? staticFinalDeclarationList | 'static' 'late' 'final' type? initializedIdentifierList | 'static' 'late'? varOrType initializedIdentifierList | 'covariant' 'late' 'final' type? identifierList | 'covariant' 'late'? varOrType initializedIdentifierList | 'late'? 'final' type? initializedIdentifierList | 'late'? varOrType initializedIdentifierList | redirectingFactoryConstructorSignature | constantConstructorSignature ( redirection | initializers )? | constructorSignature ( redirection | initializers )? ;
=>
declaration :ABSTRACT_? ( EXTERNAL_ factoryConstructorSignature | EXTERNAL_ constantConstructorSignature | EXTERNAL_ constructorSignature | ( EXTERNAL_ STATIC_? )? getterSignature | ( EXTERNAL_ STATIC_? )? setterSignature | ( EXTERNAL_ STATIC_? )? functionSignature | EXTERNAL_? operatorSignature | STATIC_ CONST_ type? staticFinalDeclarationList | STATIC_ FINAL_ type? staticFinalDeclarationList | STATIC_ LATE_ FINAL_ type? initializedIdentifierList | STATIC_ LATE_? varOrType initializedIdentifierList | COVARIANT_ LATE_ FINAL_ type? identifierList | COVARIANT_ LATE_? varOrType initializedIdentifierList | LATE_? FINAL_ type? initializedIdentifierList | LATE_? varOrType initializedIdentifierList | redirectingFactoryConstructorSignature | constantConstructorSignature ( redirection | initializers )? | constructorSignature ( redirection | initializers )? );

12. functionBody

functionBody : 'async'? '=>' expression ';' | ( 'async' '*'? | 'sync' '*' )? block ;
=>
functionBody :NATIVE_ stringLiteral? SC |  ASYNC_? EG expression SC | ( ASYNC_ ST? | SYNC_ ST )? block ;

13. reserved_word

RESERVED_WORD : 'assert' | 'break' | 'case' | 'catch' | 'class' | 'const' | 'continue' | 'default' | 'do' | 'else' | 'enum' | 'extends' | 'false' | 'final' | 'finally' | 'for' | 'if' | 'in' | 'is' | 'new' | 'null' | 'rethrow' | 'return' | 'super' | 'switch' | 'this' | 'throw' | 'true' | 'try' | 'var' | 'void' | 'while' | 'with' ;
=>
reserved_word : ASSERT_ | BREAK_ | CASE_ | CATCH_ | CLASS_ | CONST_ | CONTINUE_ | DEFAULT_ | DO_ | ELSE_ | ENUM_ | EXTENDS_ | FALSE_ | FINAL_ | FINALLY_ | FOR_ | IF_ | IN_ | IS_ | NEW_ | NULL_ | RETHROW_ | RETURN_ | SUPER_ | SWITCH_ | THIS_ | THROW_ | TRUE_ | TRY_ | VAR_ | VOID_ | WHILE_ | WITH_ ;

14. identifier

identifier : IDENTIFIER | BUILT_IN_IDENTIFIER | OTHER_IDENTIFIER ;
=>
identifier : IDENTIFIER | ABSTRACT_ | AS_ | COVARIANT_ | DEFERRED_ | DYNAMIC_ | EXPORT_ | EXTERNAL_ | EXTENSION_ | FACTORY_ | FUNCTION_ | GET_ | IMPLEMENTS_ | IMPORT_ | INTERFACE_ | LATE_ | LIBRARY_ | MIXIN_ | OPERATOR_ | PART_ | REQUIRED_ | SET_ | STATIC_ | TYPEDEF_ | FUNCTION_ | ASYNC_ | HIDE_ | OF_ | ON_ | SHOW_ | SYNC_ | AWAIT_ | YIELD_ | DYNAMIC_ | NATIVE_ ;

15. typeIdentifier

typeIdentifier : IDENTIFIER | OTHER_IDENTIFIER ;
=>
typeIdentifier : IDENTIFIER | ASYNC_ | HIDE_ | OF_ | ON_ | SHOW_ | SYNC_ | AWAIT_ | YIELD_ | DYNAMIC_ | NATIVE_ | FUNCTION_;

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
antlr-grammars		antlr-grammars
specs		specs
support		support
tex-scraper		tex-scraper
.gitignore		.gitignore
Dart2Lexer.g4		Dart2Lexer.g4
Dart2Parser.g4		Dart2Parser.g4
makefile		makefile
orig.g4		orig.g4
readme.md		readme.md
refactor.sh		refactor.sh
test-new.out		test-new.out
test-new.sh		test-new.sh
test-old.out		test-old.out
test-old.sh		test-old.sh
test-ref.out		test-ref.out
test-ref.sh		test-ref.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape Dart Spec

Refactoring of the CFG in the Spec

1. "fragment" lexer rules

2. WHITESPACE

3. SINGLE_LINE_COMMENT and MULTI_LINE_COMMENT

4. Refactoring problematic tokens

5. Missing token rules for split grammar

6. Delete problematic rules

7. Nuke references to EOF

8. Add start rule

9. Add replacement string literal rules

10. partDeclaration

11. declaration

12. functionBody

13. reserved_word

14. identifier

15. typeIdentifier

About

Releases

Packages

Languages

kaby76/ScrapeDartSpec

Folders and files

Latest commit

History

Repository files navigation

Scrape Dart Spec

Refactoring of the CFG in the Spec

1. "fragment" lexer rules

2. WHITESPACE

3. SINGLE_LINE_COMMENT and MULTI_LINE_COMMENT

4. Refactoring problematic tokens

5. Missing token rules for split grammar

6. Delete problematic rules

7. Nuke references to EOF

8. Add start rule

9. Add replacement string literal rules

10. partDeclaration

11. declaration

12. functionBody

13. reserved_word

14. identifier

15. typeIdentifier

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages