syntax for multiline comments #78

goulu · 2018-02-01T09:47:07Z

what would be the grammar to parse expressions like (Delphi):

(* this is a multiline comment
// containing a single line comment
{ and possibly another
multiline comment, and so on recursively}
*)

thanks !
(I'll share the Delphi grammar as soon as it works)

The text was updated successfully, but these errors were encountered:

erezsh · 2018-02-01T09:57:48Z

Something along the lines of:

COMMENT: "(*" /(.|\n)+/ "*)"
       | "{" /(.|\n)+/ "}"

Does that answer your question?

goulu · 2018-02-01T14:39:44Z

it helped, thanks ;-) but it's still not ok ...

on my fork https://github.com/goulu/lark you will see a pascal/delphi grammar here : https://github.com/goulu/lark/blob/master/lark/grammars/delphi.g , with:

COMMENT                 :  "(*" /(.|\n|\r)+/ "*)"     
                        |  "{" /(.|\n|\r)+/ "}"      
                        |  "//" /(.)+/ NEWLINE

%ignore COMMENT

which works in simple cases, BUT doesn't with my test file https://github.com/goulu/lark/blob/master/tests/test_delphi/cclasses.pas line 3135 :

procedure tbitset.include(index: longint);
      var
        dataindex: longint;
      begin
        { don't use bitpacked array, not endian-safe }
        dataindex:=index shr 3;
        if (dataindex>=datasize) then
          grow(dataindex+16);
        fdata[dataindex]:=fdata[dataindex] or (1 shl (index and 7));
      end;

causes a parse error:

File "c:\dev\python\lark\lark\parsers\xearley.py", line 115, in scan
raise UnexpectedInput(stream, i, text_line, text_column, to_scan)
lark.lexer.UnexpectedInput: No token defined for: 'd' in 'datai' at line 3136 col 8

but if I convert it as a single line comment like this :
// don't use bitpacked array, not endian-safe }
it is parsed correctly

Any idea ?

erezsh · 2018-02-01T17:35:48Z

Maybe, try to remove the terminal definition of LCURLY.

goulu · 2018-02-02T09:45:41Z

did it, but no change ...

is there a way to get a clearer error message, like a context, or which rules/terminals are expected in the "unexpected input" ?

erezsh · 2018-02-02T10:57:13Z

Okay, you're right! I added some more information. It now says:

lark.lexer.UnexpectedInput: No token defined for: 'd' in 'datai' at line 3136 col 8

Expecting: {'FUNCTION', 'LBRACK', 'THREADVAR', 'PROCEDURE', 'CONST', 'PACKAGE', 'BEGIN', 'LIBRARY', 'DOT', 'CONSTRUCTOR', 'CLASS', 'DESTRUCTOR', 'LABEL', 'USES', 'VAR', 'RESOURCESTRING', 'UNIT', 'PROGRAM', 'ASM', 'TYPE', 'EXPORTS'}

I'm not sure why this happens. It might be a bug in the parser? See if you can get it to happen for a smaller input and grammar. It will help you debug, and if it looks like a bug, it will help me debug too.

goulu · 2018-02-02T15:06:53Z

Ok I just modified the "Hello World ! " example like this:

from lark import Lark
l = Lark('''start: WORD "," WORD "!"
            %import common.WORD
            %ignore " "
            
            COMMENT :  "{" /(.|\n)+/ "}"
            %ignore COMMENT 
         ''')
print( l.parse("Hello, {comment} World!") )

but it crashes :

> Traceback (most recent call last):
>   File "C:\Dev\Python\lark\tests\test_hello.py", line 8, in <module>
>     ''')
>   File "c:\dev\python\lark\lark\lark.py", line 153, in __init__
>     self.grammar = load_grammar(grammar, source)
>   File "c:\dev\python\lark\lark\load_grammar.py", line 574, in load_grammar
>     raise GrammarError("Unexpected input %r at line %d column %d in %s" % (e.context, e.line, e.column, name))
> lark.common.GrammarError: Unexpected input '/(.|\n' at line 5 column 27 in <string>

if I remove the \n from COMMENT : "{" /(.|\n)+/ "}" // then it works.

So I suspected the \n should be in some way escaped and tried \\n instead : IT WORKS ! :-)

from lark import Lark
l = Lark('''start: WORD "," WORD "!"
            %import common.WORD
            %import common.NEWLINE
            %ignore " "
            
            COMMENT     : "{" /(.|\\n|\\r)+/ "}"    
                        | "(*" /(.|\\n|\\r)+/ "*)"  
                        |  "//" /(.)+/ NEWLINE
            %ignore COMMENT 
         ''')
print( l.parse("Hello, {comment} World!") )
print( l.parse("Hello, {multiline \n comment} World!") )
print( l.parse("{------header-------}Hello, (* comment *) World!//footer\n") )

mcondarelli · 2018-02-06T16:54:08Z

I'm having the same kind of problems with a basic Token definition:
WHATEVER: /[^\r\n]+/
rises an exception (lark.common.GrammarError: Unexpected input '/[^\r\n' at line 8 column 14 in ) while the following works as expected.
WHATEVER: /[^\\r\\n]+/
I am unsure if this is a bug or a feature.
Note that things like:
HEX_NUMBER: /0x[\da-f]*l?/i
seem to work OK, so I would suggest this is a problem with normal string escapes (and definitely an inconsistent handling).

erezsh · 2018-02-09T08:11:57Z

@mcondarelli If you're defining the grammar in a Python file, you should define it as a raw string:

parser = Lark( r""" grammar """, ...)

erezsh · 2018-02-09T08:25:28Z

@goulu I'm glad you figure it out. You can also use raw strings (r""" grammar """)

Btw, how is the performance? Using Earley with such a big grammar and big input files can lead to some serious runtime! Generally I recommend using LALR for such tasks.

delzac · 2019-01-23T14:07:07Z

Hi, just want to feedback back that defining grammar in a python file should be done as a raw string was a gotcha to me. Perhaps it might be helpful to add it in the examples or something.

Anyhow, lark is a joy to use, thanks for maintaining it :)

erezsh added a commit that referenced this issue Feb 2, 2018

Added more information in UnexpectedInput exception (Issue #78)

710cb6d

erezsh closed this as completed Mar 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syntax for multiline comments #78

syntax for multiline comments #78

goulu commented Feb 1, 2018 •

edited

erezsh commented Feb 1, 2018 •

edited

goulu commented Feb 1, 2018

erezsh commented Feb 1, 2018

goulu commented Feb 2, 2018

erezsh commented Feb 2, 2018

goulu commented Feb 2, 2018 •

edited

mcondarelli commented Feb 6, 2018 •

edited

erezsh commented Feb 9, 2018 •

edited

erezsh commented Feb 9, 2018 •

edited

delzac commented Jan 23, 2019

syntax for multiline comments #78

syntax for multiline comments #78

Comments

goulu commented Feb 1, 2018 • edited

erezsh commented Feb 1, 2018 • edited

goulu commented Feb 1, 2018

erezsh commented Feb 1, 2018

goulu commented Feb 2, 2018

erezsh commented Feb 2, 2018

goulu commented Feb 2, 2018 • edited

mcondarelli commented Feb 6, 2018 • edited

erezsh commented Feb 9, 2018 • edited

erezsh commented Feb 9, 2018 • edited

delzac commented Jan 23, 2019

goulu commented Feb 1, 2018 •

edited

erezsh commented Feb 1, 2018 •

edited

goulu commented Feb 2, 2018 •

edited

mcondarelli commented Feb 6, 2018 •

edited

erezsh commented Feb 9, 2018 •

edited

erezsh commented Feb 9, 2018 •

edited