Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production separators on multiple lines #16

Closed
hembergerik opened this issue Sep 1, 2016 · 5 comments
Closed

Production separators on multiple lines #16

hembergerik opened this issue Sep 1, 2016 · 5 comments

Comments

@hembergerik
Copy link
Collaborator

I have an edit of the read_bnf for production separators on multiple lines. If you want to accept it you can see if it passes test you write (works on my computer...) (I find it a lot easier to read grammars were at least each production can be on a separate line)

def read_bnf_file(self, file_name):
"""Read a grammar file in BNF format.

    :param file_name: BNF grammar file
    :type file_name: str

    """
    assert file_name.endswith('.bnf')

    rule_separator = "::="
    # Don't allow space in NTs, and use lookbehind to match "<"
    # and ">" only if not preceded by backslash. Group the whole
    # thing with capturing parentheses so that split() will return
    # all NTs and Ts. TODO does this handle quoted NT symbols?
    non_terminal_pattern = r"((?<!\\)<\S+?(?<!\\)>)"
    # Use lookbehind again to match "|" only if not preceded by
    # backslash. Don't group, so split() will return only the
    # productions, not the separators.
    production_separator = r"(?<!\\)\|"
    lhs = None

    # Read the grammar file
    for line in open(file_name, 'r'):
        if not line.startswith("#") and \
                        line.strip() != "":
            # Split rules.
            has_rule_separator = line.find(rule_separator) > 0
            if has_rule_separator:
                lhs, productions = line.split(rule_separator, 1)  # 1 split
                lhs = lhs.strip()
                if not re.search(non_terminal_pattern, lhs):
                    raise ValueError("lhs is not a NT:", lhs)

                self.non_terminals.add(lhs)
                if self.start_rule is None:
                    self.start_rule = (lhs, self.NT)

            else:
                productions = line.strip()

            # Find terminals and non-terminals
            tmp_productions = []
            # TODO make handling of multilines nicer, some edge cases are
            # passing, e.g. line ends with | and starts with |
            production_split = re.split(production_separator, productions)
            for production in production_split:
                production = production.strip().replace(r"\|", "|")
                tmp_production = []
                for symbol in re.split(non_terminal_pattern,
                                       production):
                    symbol = symbol.replace(r"\<", "<").replace(r"\>",
                                                                ">")
                    if len(symbol) == 0:
                        continue
                    elif re.match(non_terminal_pattern, symbol):
                        tmp_production.append((symbol, self.NT))
                    else:
                        self.terminals.add(symbol)
                        tmp_production.append((symbol, self.T))

                if tmp_production:
                    tmp_productions.append(tmp_production)

            assert lhs is not None, "No lhs"

            # Create a rule
            if lhs not in self.rules:
                self.rules[lhs] = tmp_productions
            else:
                if len(production_split) > 1 or\
                        last_character == "|":
                    self.rules[lhs].extend(tmp_productions)

            # Remember the last charachter of the line
            last_character = productions[-1]

        print(line,)

    for k, v in self.rules.items():
        print("%s; %s" % (k, v))
@t-h-e
Copy link
Collaborator

t-h-e commented Sep 9, 2016

Wouldn't it be easiest to let regular expressions do the dirty work? I use it for a C# implementation

You can have:

  • production separators in multiple lines
  • comments at the end of any line
  • single quotation within double quotation and vice versa
  • any characters can be used in quotation, even separators '|'

Additionally, the code becomes more readable as well as maintainable and it is not as error prone.

Examples on parsing some of my grammars can be found here:

Feel free to improve the regular expressions and give feedback.

from re import finditer, DOTALL, MULTILINE

    ruleregex = '(?P<rulename><\S+>)\s*::=\s*(?P<production>(?:(?=\#)\#[^\r\n]*|(?!<\S+>\s*::=).+?)+)'
    productionregex = '(?=\#)(?:\#.*$)|(?!\#)\s*(?P<production>(?:[^\'\"\|\#]+|\'.*?\'|".*?")+)'
    productionpartsregex = '\ *([\r\n]+)\ *|([^\'"<\r\n]+)|\'(.*?)\'|"(.*?)"|(?P<subrule><[^>|\s]+>)|([<]+)'

    def read_bnf_file(self, file_name):
        with open(file_name, 'r') as bnf:
            content = bnf.read()
            for rule in finditer(self.ruleregex, content, DOTALL):
                if self.start_rule is None:
                    self.start_rule = (rule.group('rulename'), self.NT)
                self.non_terminals[rule.group('rulename')] = {'id': rule.group('rulename'),
                                                              'min_steps': 9999999999999,
                                                              'expanded': False,
                                                              'recursive': True,
                                                              'permutations': None,
                                                              'b_factor': 0}
                tmp_productions = []
                for p in finditer(self.productionregex, rule.group('production'), MULTILINE):
                    if p.group('production') is None or p.group('production').isspace():
                        continue
                    tmp_production = []
                    terminalparts = ''
                    for sub_p in finditer(self.productionpartsregex, p.group('production').strip()):
                        if sub_p.group('subrule'):
                            if terminalparts:
                                symbol = [terminalparts, self.T, 0, False]
                                tmp_production.append(symbol)
                                self.terminals.append(terminalparts)
                                terminalparts = ''
                            tmp_production.append([sub_p.group('subrule'), self.NT])
                        else:
                            terminalparts += ''.join([part for part in sub_p.groups() if part])

                    if terminalparts:
                        symbol = [terminalparts, self.T, 0, False]
                        tmp_production.append(symbol)
                        self.terminals.append(terminalparts)
                    tmp_productions.append(tmp_production)

                if not rule.group('rulename') in self.rules:
                    self.rules[rule.group('rulename')] = tmp_productions
                    if len(tmp_productions) == 1:
                        print("Warning: Grammar contains unit production "
                              "for production rule", rule.group('rulename'))
                        print("       Unit productions consume GE codons.")
                else:
                    raise ValueError("lhs should be unique", rule.group('rulename'))

@mikefenton
Copy link
Collaborator

Any consensus on these suggestions? James/Dave, what are your thoughts?

@dvpfagan
Copy link
Collaborator

Stefans seems good

On Thursday, 15 September 2016, Michael Fenton notifications@github.com
wrote:

Any consensus on these suggestions? James/Dave, what are your thoughts?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/jmmcd/PonyGE2/issues/16#issuecomment-247332856, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA8zuk6Du0lR9SztCbAl3TTQN1R6epFRks5qqUyngaJpZM4Jy4Vm
.

@jmmcd
Copy link
Collaborator

jmmcd commented Sep 15, 2016

If we can replace some manual parsing with an RE then we should.

On 15 September 2016 at 17:01, Dave notifications@github.com wrote:

Stefans seems good

On Thursday, 15 September 2016, Michael Fenton notifications@github.com
wrote:

Any consensus on these suggestions? James/Dave, what are your thoughts?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/jmmcd/PonyGE2/issues/16#issuecomment-247332856, or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
AA8zuk6Du0lR9SztCbAl3TTQN1R6epFRks5qqUyngaJpZM4Jy4Vm>
.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/jmmcd/PonyGE2/issues/16#issuecomment-247371563, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAKQBlNDrRLnUXqxzFMfTJyx2Mnsfvzdks5qqWvUgaJpZM4Jy4Vm
.

Dr. James McDermott
Lecturer in Business Analytics
Programme Director, MSc in Business Analytics
D209 UCD Michael Smurfit Graduate Business School
College of Business, University College Dublin, Ireland.
Phone +353 1 716 8031
http://jmmcd.net
http://www.ucd.ie/cba/members/jamesmcdermott/

@mikefenton
Copy link
Collaborator

Stefan's RegEx version is now implemented. The old system still exists under then name old_read_bnf_file(), can be removed easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants