Production separators on multiple lines #16

hembergerik · 2016-09-01T16:25:23Z

I have an edit of the read_bnf for production separators on multiple lines. If you want to accept it you can see if it passes test you write (works on my computer...) (I find it a lot easier to read grammars were at least each production can be on a separate line)

def read_bnf_file(self, file_name):
"""Read a grammar file in BNF format.

    :param file_name: BNF grammar file
    :type file_name: str

    """
    assert file_name.endswith('.bnf')

    rule_separator = "::="
    # Don't allow space in NTs, and use lookbehind to match "<"
    # and ">" only if not preceded by backslash. Group the whole
    # thing with capturing parentheses so that split() will return
    # all NTs and Ts. TODO does this handle quoted NT symbols?
    non_terminal_pattern = r"((?<!\\)<\S+?(?<!\\)>)"
    # Use lookbehind again to match "|" only if not preceded by
    # backslash. Don't group, so split() will return only the
    # productions, not the separators.
    production_separator = r"(?<!\\)\|"
    lhs = None

    # Read the grammar file
    for line in open(file_name, 'r'):
        if not line.startswith("#") and \
                        line.strip() != "":
            # Split rules.
            has_rule_separator = line.find(rule_separator) > 0
            if has_rule_separator:
                lhs, productions = line.split(rule_separator, 1)  # 1 split
                lhs = lhs.strip()
                if not re.search(non_terminal_pattern, lhs):
                    raise ValueError("lhs is not a NT:", lhs)

                self.non_terminals.add(lhs)
                if self.start_rule is None:
                    self.start_rule = (lhs, self.NT)

            else:
                productions = line.strip()

            # Find terminals and non-terminals
            tmp_productions = []
            # TODO make handling of multilines nicer, some edge cases are
            # passing, e.g. line ends with | and starts with |
            production_split = re.split(production_separator, productions)
            for production in production_split:
                production = production.strip().replace(r"\|", "|")
                tmp_production = []
                for symbol in re.split(non_terminal_pattern,
                                       production):
                    symbol = symbol.replace(r"\<", "<").replace(r"\>",
                                                                ">")
                    if len(symbol) == 0:
                        continue
                    elif re.match(non_terminal_pattern, symbol):
                        tmp_production.append((symbol, self.NT))
                    else:
                        self.terminals.add(symbol)
                        tmp_production.append((symbol, self.T))

                if tmp_production:
                    tmp_productions.append(tmp_production)

            assert lhs is not None, "No lhs"

            # Create a rule
            if lhs not in self.rules:
                self.rules[lhs] = tmp_productions
            else:
                if len(production_split) > 1 or\
                        last_character == "|":
                    self.rules[lhs].extend(tmp_productions)

            # Remember the last charachter of the line
            last_character = productions[-1]

        print(line,)

    for k, v in self.rules.items():
        print("%s; %s" % (k, v))

The text was updated successfully, but these errors were encountered:

t-h-e · 2016-09-09T17:06:35Z

Wouldn't it be easiest to let regular expressions do the dirty work? I use it for a C# implementation

You can have:

production separators in multiple lines
comments at the end of any line
single quotation within double quotation and vice versa
any characters can be used in quotation, even separators '|'

Additionally, the code becomes more readable as well as maintainable and it is not as error prone.

Examples on parsing some of my grammars can be found here:

Feel free to improve the regular expressions and give feedback.

from re import finditer, DOTALL, MULTILINE

    ruleregex = '(?P<rulename><\S+>)\s*::=\s*(?P<production>(?:(?=\#)\#[^\r\n]*|(?!<\S+>\s*::=).+?)+)'
    productionregex = '(?=\#)(?:\#.*$)|(?!\#)\s*(?P<production>(?:[^\'\"\|\#]+|\'.*?\'|".*?")+)'
    productionpartsregex = '\ *([\r\n]+)\ *|([^\'"<\r\n]+)|\'(.*?)\'|"(.*?)"|(?P<subrule><[^>|\s]+>)|([<]+)'

    def read_bnf_file(self, file_name):
        with open(file_name, 'r') as bnf:
            content = bnf.read()
            for rule in finditer(self.ruleregex, content, DOTALL):
                if self.start_rule is None:
                    self.start_rule = (rule.group('rulename'), self.NT)
                self.non_terminals[rule.group('rulename')] = {'id': rule.group('rulename'),
                                                              'min_steps': 9999999999999,
                                                              'expanded': False,
                                                              'recursive': True,
                                                              'permutations': None,
                                                              'b_factor': 0}
                tmp_productions = []
                for p in finditer(self.productionregex, rule.group('production'), MULTILINE):
                    if p.group('production') is None or p.group('production').isspace():
                        continue
                    tmp_production = []
                    terminalparts = ''
                    for sub_p in finditer(self.productionpartsregex, p.group('production').strip()):
                        if sub_p.group('subrule'):
                            if terminalparts:
                                symbol = [terminalparts, self.T, 0, False]
                                tmp_production.append(symbol)
                                self.terminals.append(terminalparts)
                                terminalparts = ''
                            tmp_production.append([sub_p.group('subrule'), self.NT])
                        else:
                            terminalparts += ''.join([part for part in sub_p.groups() if part])

                    if terminalparts:
                        symbol = [terminalparts, self.T, 0, False]
                        tmp_production.append(symbol)
                        self.terminals.append(terminalparts)
                    tmp_productions.append(tmp_production)

                if not rule.group('rulename') in self.rules:
                    self.rules[rule.group('rulename')] = tmp_productions
                    if len(tmp_productions) == 1:
                        print("Warning: Grammar contains unit production "
                              "for production rule", rule.group('rulename'))
                        print("       Unit productions consume GE codons.")
                else:
                    raise ValueError("lhs should be unique", rule.group('rulename'))

mikefenton · 2016-09-15T13:48:23Z

Any consensus on these suggestions? James/Dave, what are your thoughts?

dvpfagan · 2016-09-15T16:01:24Z

Stefans seems good

On Thursday, 15 September 2016, Michael Fenton notifications@github.com
wrote:

Any consensus on these suggestions? James/Dave, what are your thoughts?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/jmmcd/PonyGE2/issues/16#issuecomment-247332856, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA8zuk6Du0lR9SztCbAl3TTQN1R6epFRks5qqUyngaJpZM4Jy4Vm
.

jmmcd · 2016-09-15T16:14:56Z

If we can replace some manual parsing with an RE then we should.

On 15 September 2016 at 17:01, Dave notifications@github.com wrote:

Stefans seems good

On Thursday, 15 September 2016, Michael Fenton notifications@github.com
wrote:

Any consensus on these suggestions? James/Dave, what are your thoughts?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/jmmcd/PonyGE2/issues/16#issuecomment-247332856, or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
AA8zuk6Du0lR9SztCbAl3TTQN1R6epFRks5qqUyngaJpZM4Jy4Vm>
.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/jmmcd/PonyGE2/issues/16#issuecomment-247371563, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAKQBlNDrRLnUXqxzFMfTJyx2Mnsfvzdks5qqWvUgaJpZM4Jy4Vm
.

Dr. James McDermott
Lecturer in Business Analytics
Programme Director, MSc in Business Analytics
D209 UCD Michael Smurfit Graduate Business School
College of Business, University College Dublin, Ireland.
Phone +353 1 716 8031
http://jmmcd.net
http://www.ucd.ie/cba/members/jamesmcdermott/

mikefenton · 2016-09-16T15:23:51Z

Stefan's RegEx version is now implemented. The old system still exists under then name old_read_bnf_file(), can be removed easily.

mikefenton closed this as completed Sep 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production separators on multiple lines #16

Production separators on multiple lines #16

hembergerik commented Sep 1, 2016

t-h-e commented Sep 9, 2016

mikefenton commented Sep 15, 2016

dvpfagan commented Sep 15, 2016

jmmcd commented Sep 15, 2016

mikefenton commented Sep 16, 2016

Production separators on multiple lines #16

Production separators on multiple lines #16

Comments

hembergerik commented Sep 1, 2016

t-h-e commented Sep 9, 2016

mikefenton commented Sep 15, 2016

dvpfagan commented Sep 15, 2016

jmmcd commented Sep 15, 2016

mikefenton commented Sep 16, 2016