This tool converts the formal syntax section of an IEEE language standard document into an ANTLR4 grammar file format.
pip install git+https://github.com/msagca/syntax-scraper.git
scrape-ieee [-h] -n grammar_name [-s start_page] [-e end_page] [--split] input_file
positional arguments:
input_file IEEE language standard document (format: PDF)
options:
-h, --help show this help message and exit
-n grammar_name ANTLR4 grammar name
-s start_page formal syntax start page (default: first page)
-e end_page formal syntax end page (default: last page)
--split create a split grammar
scrape-ieee -n SystemVerilog -s 1136 -e 1180 1800-2017.pdf
To complete the resulting grammar, please follow these steps:
- Open the generated .g4 file in Visual Studio Code.
- Install the ANTLR4 extension for Visual Studio Code.
- Address the highlighted issues or errors indicated by the ANTLR4 extension.
- Remove any title text, such as
'A.8.7' 'Numbers'
, from the rules. - Remove trailing numbers in rule identifiers, for example, change
time_literal44
totime_literal
unless the spec specifically mentions such rule. - Identify the rules that span multiple pages in the spec, and manually add any missing parts if necessary.
- Make sure to append
EOF
to the start rule(s), such aslibrary_text
andsource_text
, to mark the end of input. - Locate the rules that describe lexical tokens like white space, comments, identifiers, numbers, etc., and convert them to lexer rules for proper tokenization.
If the --split
option is specified, the tool will create a lexer rule for each keyword or punctuation symbol encountered during the parse tree walk. It will also generate rules for common lexical structures like identifiers, white spaces, and comments. These rules can be extended by the user later. However, due to limitations in the tool or parse errors, some of these automatically generated rules may be incorrect or invalid. For example, most tokens that begin with a capital letter need to be removed or modified. Additionally, certain symbols highlighted in bold in the specification document may not directly correspond to lexical tokens and require manual handling. For instance, in the autogenerated lexer grammar for SystemVerilog, the rule LPASRP: '(*)';
does not accept white spaces. However, it is clear that the input ( * )
should also be valid since parentheses act as argument delimiters in this context. To address this, LPASRP
should be rewritten as three separate rules: LP: '(';
, AS: '*';
, and RP: ')';
. Consequently, occurrences of the character sequence (*)
should be modified to '(' '*' ')'
in the parser grammar to accommodate the changes made to the lexer grammar.