Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syntax for multiline comments #78

Closed
goulu opened this issue Feb 1, 2018 · 10 comments
Closed

syntax for multiline comments #78

goulu opened this issue Feb 1, 2018 · 10 comments

Comments

@goulu
Copy link

goulu commented Feb 1, 2018

what would be the grammar to parse expressions like (Delphi):

(* this is a multiline comment
// containing a single line comment
{ and possibly another
multiline comment, and so on recursively}
*)

thanks !
(I'll share the Delphi grammar as soon as it works)

@erezsh
Copy link
Member

erezsh commented Feb 1, 2018

Something along the lines of:

COMMENT: "(*" /(.|\n)+/ "*)"
       | "{" /(.|\n)+/ "}"

Does that answer your question?

@goulu
Copy link
Author

goulu commented Feb 1, 2018

it helped, thanks ;-) but it's still not ok ...

on my fork https://github.com/goulu/lark you will see a pascal/delphi grammar here : https://github.com/goulu/lark/blob/master/lark/grammars/delphi.g , with:

COMMENT                 :  "(*" /(.|\n|\r)+/ "*)"     
                        |  "{" /(.|\n|\r)+/ "}"      
                        |  "//" /(.)+/ NEWLINE

%ignore COMMENT 

which works in simple cases, BUT doesn't with my test file https://github.com/goulu/lark/blob/master/tests/test_delphi/cclasses.pas line 3135 :

procedure tbitset.include(index: longint);
      var
        dataindex: longint;
      begin
        { don't use bitpacked array, not endian-safe }
        dataindex:=index shr 3;
        if (dataindex>=datasize) then
          grow(dataindex+16);
        fdata[dataindex]:=fdata[dataindex] or (1 shl (index and 7));
      end;

causes a parse error:

File "c:\dev\python\lark\lark\parsers\xearley.py", line 115, in scan
raise UnexpectedInput(stream, i, text_line, text_column, to_scan)
lark.lexer.UnexpectedInput: No token defined for: 'd' in 'datai' at line 3136 col 8

but if I convert it as a single line comment like this :
// don't use bitpacked array, not endian-safe }
it is parsed correctly

Any idea ?

@erezsh
Copy link
Member

erezsh commented Feb 1, 2018

Maybe, try to remove the terminal definition of LCURLY.

@goulu
Copy link
Author

goulu commented Feb 2, 2018

did it, but no change ...

is there a way to get a clearer error message, like a context, or which rules/terminals are expected in the "unexpected input" ?

@erezsh
Copy link
Member

erezsh commented Feb 2, 2018

Okay, you're right! I added some more information. It now says:

lark.lexer.UnexpectedInput: No token defined for: 'd' in 'datai' at line 3136 col 8

Expecting: {'FUNCTION', 'LBRACK', 'THREADVAR', 'PROCEDURE', 'CONST', 'PACKAGE', 'BEGIN', 'LIBRARY', 'DOT', 'CONSTRUCTOR', 'CLASS', 'DESTRUCTOR', 'LABEL', 'USES', 'VAR', 'RESOURCESTRING', 'UNIT', 'PROGRAM', 'ASM', 'TYPE', 'EXPORTS'}

I'm not sure why this happens. It might be a bug in the parser? See if you can get it to happen for a smaller input and grammar. It will help you debug, and if it looks like a bug, it will help me debug too.

@goulu
Copy link
Author

goulu commented Feb 2, 2018

Ok I just modified the "Hello World ! " example like this:

from lark import Lark
l = Lark('''start: WORD "," WORD "!"
            %import common.WORD
            %ignore " "
            
            COMMENT :  "{" /(.|\n)+/ "}"
            %ignore COMMENT 
         ''')
print( l.parse("Hello, {comment} World!") ) 

but it crashes :

> Traceback (most recent call last):
>   File "C:\Dev\Python\lark\tests\test_hello.py", line 8, in <module>
>     ''')
>   File "c:\dev\python\lark\lark\lark.py", line 153, in __init__
>     self.grammar = load_grammar(grammar, source)
>   File "c:\dev\python\lark\lark\load_grammar.py", line 574, in load_grammar
>     raise GrammarError("Unexpected input %r at line %d column %d in %s" % (e.context, e.line, e.column, name))
> lark.common.GrammarError: Unexpected input '/(.|\n' at line 5 column 27 in <string>

if I remove the \n from COMMENT : "{" /(.|\n)+/ "}" // then it works.

So I suspected the \n should be in some way escaped and tried \\n instead : IT WORKS ! :-)

from lark import Lark
l = Lark('''start: WORD "," WORD "!"
            %import common.WORD
            %import common.NEWLINE
            %ignore " "
            
            COMMENT     : "{" /(.|\\n|\\r)+/ "}"    
                        | "(*" /(.|\\n|\\r)+/ "*)"  
                        |  "//" /(.)+/ NEWLINE
            %ignore COMMENT 
         ''')
print( l.parse("Hello, {comment} World!") )
print( l.parse("Hello, {multiline \n comment} World!") )
print( l.parse("{------header-------}Hello, (* comment *) World!//footer\n") )

@mcondarelli
Copy link

mcondarelli commented Feb 6, 2018

I'm having the same kind of problems with a basic Token definition:
WHATEVER: /[^\r\n]+/
rises an exception (lark.common.GrammarError: Unexpected input '/[^\r\n' at line 8 column 14 in ) while the following works as expected.
WHATEVER: /[^\\r\\n]+/
I am unsure if this is a bug or a feature.
Note that things like:
HEX_NUMBER: /0x[\da-f]*l?/i
seem to work OK, so I would suggest this is a problem with normal string escapes (and definitely an inconsistent handling).

@erezsh
Copy link
Member

erezsh commented Feb 9, 2018

@mcondarelli If you're defining the grammar in a Python file, you should define it as a raw string:

parser = Lark( r""" grammar """, ...)

@erezsh
Copy link
Member

erezsh commented Feb 9, 2018

@goulu I'm glad you figure it out. You can also use raw strings (r""" grammar """)

Btw, how is the performance? Using Earley with such a big grammar and big input files can lead to some serious runtime! Generally I recommend using LALR for such tasks.

@erezsh erezsh closed this as completed Mar 20, 2018
@delzac
Copy link

delzac commented Jan 23, 2019

Hi, just want to feedback back that defining grammar in a python file should be done as a raw string was a gotcha to me. Perhaps it might be helpful to add it in the examples or something.

Anyhow, lark is a joy to use, thanks for maintaining it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants