Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing time seems to increase dramatically with depth of parse tree #38

Closed
till314 opened this issue Nov 11, 2017 · 17 comments
Closed

Comments

@till314
Copy link

till314 commented Nov 11, 2017

I am trying to parse Fortran95 programs with lark. The grammar is medium size but mostly modeled after the Fortran 95 specification (using the same names for non-terminal symbols). I use a separate lexer to deal with all the weird stuff like Hollerith constants. The programs I want to parse are considerably larger than the test program below.

The sample fortran program

PROGRAM TEST
do i=1,10
if (i<5) write(,) i;
write(,) i*i;
end do
END

is converted to the token list C below. To compare with slightly simpler programs I skipped once the first write statement (token list B below) and once the second write statement (token list A below)

A
'PROGRAM NAME EOS EOS DO NAME EQUAL NATNUMBER COMMA NATNUMBER EOS IF LPAREN NAME LT NATNUMBER RPAREN WRITE LPAREN TIMES COMMA TIMES RPAREN NAME EOS EOS ENDDO EOS END EOS'
B
'PROGRAM NAME EOS EOS DO NAME EQUAL NATNUMBER COMMA NATNUMBER EOS IF LPAREN NAME LT NATNUMBER RPAREN WRITE LPAREN TIMES COMMA TIMES RPAREN NAME TIMES NAME EOS EOS ENDDO EOS END EOS'
C
'PROGRAM NAME EOS EOS DO NAME EQUAL NATNUMBER COMMA NATNUMBER EOS IF LPAREN NAME LT NATNUMBER RPAREN WRITE LPAREN TIMES COMMA TIMES RPAREN NAME EOS EOS WRITE LPAREN TIMES COMMA TIMES RPAREN NAME TIMES NAME EOS EOS ENDDO EOS END EOS'

To give an idea of the resulting parse trees I include the complete output for C below. Parse times are

start=dt.now(); L.parse(A); print(dt.now()-start)
...
0:00:01.807732

start=dt.now(); L.parse(B); print(dt.now()-start)
...
0:00:17.732535

start=dt.now(); L.parse(C); print(dt.now()-start)
...
0:13:35.003853

EDIT START:
It seems that the size of the parse trees may be to blame.
if ambiguity='resolve' is used then the respective sizes (len(str(...))) of the results are
A => 2728 B => 2978 C => 3528
while ambiguity='explicit' results in
A = 2582677 B=16834357 C=252801325

So it may be that a more space efficient method to store ambiguous (intermediate) parsing results is needed.
EDIT END.

Parse tree for C:
Tree(executable_program, [Tree(program_unit, [Tree(main_program, [Tree(program_stmt, [Token(PROGRAM, 'PROGRAM'), Tree(program_name, [Token(NAME, 'NAME')]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])]), Tree(execution_part, [Tree(executable_construct, [Tree(block_do_construct, [Tree(do_stmt, [Tree(nonlabel_do_stmt, [Token(DO, 'DO'), Tree(loop_control, [Tree(do_variable, [Tree(array_section, [Tree(data_ref, [Tree(part_ref, [Tree(part_name, [Token(NAME, 'NAME')])])])])]), Token(EQUAL, 'EQUAL'), Tree(scalar_numeric_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])]), Token(COMMA, 'COMMA'), Tree(scalar_numeric_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS')])])]), Tree(do_block, [Tree(block, [Tree(execution_part_construct, [Tree(if_stmt, [Token(IF, 'IF'), Token(LPAREN, 'LPAREN'), Tree(scalar_logical_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(variable, [Tree(array_section, [Tree(data_ref, [Tree(part_ref, [Tree(part_name, [Token(NAME, 'NAME')])])])])])])])])])])]), Tree(rel_op, [Token(LT, 'LT')]), Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])]), Token(RPAREN, 'RPAREN'), Tree(action_stmt, [Tree(write_stmt, [Token(WRITE, 'WRITE'), Token(LPAREN, 'LPAREN'), Tree(io_control_spec_list, [Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])]), Token(COMMA, 'COMMA'), Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])])]), Token(RPAREN, 'RPAREN'), Tree(output_item_list, [Tree(output_item, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(variable, [Token(NAME, 'NAME')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])])])])]), Tree(execution_part_construct, [Tree(write_stmt, [Token(WRITE, 'WRITE'), Token(LPAREN, 'LPAREN'), Tree(io_control_spec_list, [Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])]), Token(COMMA, 'COMMA'), Tree(io_control_spec, [Tree(io_unit, [Token(TIMES, 'TIMES')])])]), Token(RPAREN, 'RPAREN'), Tree(output_item_list, [Tree(output_item, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NAME, 'NAME')])])])])]), Tree(mult_op, [Token(TIMES, 'TIMES')]), Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NAME, 'NAME')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])])])])]), Tree(end_do, [Tree(end_do_stmt, [Token(ENDDO, 'ENDDO'), Tree(eos, [Token(EOS, 'EOS')])])])])])]), Tree(end_program_stmt, [Token(END, 'END'), Tree(eos, [Token(EOS, 'EOS')])])])])])

@erezsh
Copy link
Member

erezsh commented Nov 11, 2017

There are definitely a few space (and run-time) issues with explicit ambiguity, that I need to work on.

In the mean time, is it really necessary? I didn't know Fortran is an ambiguous language.

@till314
Copy link
Author

till314 commented Nov 11, 2017

Fortran is ambiguous. Without semantic analysis there are many ambiguities. Function calls look like accessing array elements, defining one-line internal functions look like array assignments, keywords are not protected (they can be used as variable names), etc.

The advantage of Earley parsers is that you do not have to waste time analyzing and perfecting the grammar but can resolve ambiguities during semantic analysis. Also there are many incompatible extensions of older compilers which makes managing (and fixing) several similar but subtly different grammars rather painful.

In any case, just collecting the minimal amount of information needed to construct parse forests during parsing may be an idea.

I haven't checked your code in detail but if you have a rule of the form A : B_1 B_2 ... B_n where each of the B_i allow two parse trees then building the parse forest immediately means building 2**n trees.
The size of the parse forest in the examples above seems to indicate that something like that is happening.
One way around this is to keep pointers from A to each of the subtrees B_i and insert a step resolving at least some of the ambiguities before building complete trees.

@erezsh
Copy link
Member

erezsh commented Nov 12, 2017

I have a slightly different version of Earley, that might solve your issue. I didn't test it thoroughly, but it's worth giving it a try.

Checkout the branch earley_fix2 and let me know if it improved anything.

@till314
Copy link
Author

till314 commented Nov 12, 2017

Thank you.

The simple test cases all work very fast now. 0.03~0.14s
However, for one of my larger test files (which can be parsed with another Earley parser) I get the exception below. You might be pruning too severely somewhere.

Traceback (most recent call last):
File "", line 1, in
File " ..lark_parser-0.4.1-py3.6.egg/lark/lark.py", line 188, in parse
return self.parser.parse(text)
File "..lark_parser-0.4.1-py3.6.egg/lark/parser_frontends.py", line 142, in parse
return self.parser.parse(text)
File "..lark_parser-0.4.1-py3.6.egg/lark/parsers/xearley.py", line 147, in parse
raise ParseError('Incomplete parse: Could not find a solution to input')
lark.common.ParseError: Incomplete parse: Could not find a solution to input

@erezsh
Copy link
Member

erezsh commented Nov 12, 2017

That's very likely. This suspicion is the main reason I haven't merged this fix to main yet. If you can provide more details, I'll appreciate it. It might help me narrow down on the issue.

@till314
Copy link
Author

till314 commented Nov 12, 2017

I'll give it a try. It may take a few days.

@erezsh
Copy link
Member

erezsh commented Nov 12, 2017

Great, thanks.

@till314
Copy link
Author

till314 commented Nov 13, 2017

I have mixed news.

  1. the incomplete parse was a mistake in the grammar conversion script a missing space (causing problems for very long token name)
  2. earley_fix2 version 0155d3d works ok.
  • Simple programs are much faster than my parser (<0.1s vs 1s)
  • slightly more complicate programs about the same 2s vs 2s
  • but larger programs either take much longer 5s vs 2.5s, 16s vs 4s, 100s vs 5s or run out of memory (16GB) after a while.
  1. now I tried to set up version ec7da3b (0155... does not exist any longer it seems) on a 64GB desktop. I'll let it run over night and see. EDIT: ran out of memory after 2 hours

EDIT2:
I ran the parser on some 600 fortran files with about 200 include files altogether about 12k lines, using a 2min timeout

  • there were about 50 timeouts
  • one file which contained 7 consecutive function definitions failed but once I split it into 2 files with 3 and 4 functions respectively both files worked fine. Two relevant rules are given below (executable_program is the start symbol). It seems that parsing 3 or 4 program units works fine but 7 program units fail. Since memory usage was below 1% the whole time according to top it is unlikely that memory was the problem.

Two grammar rules:
executable_program : comment* program_unit program_unit*
program_unit : block_data
| function_subprogram
| main_program
| module
| subroutine_subprogram

@erezsh
Copy link
Member

erezsh commented Nov 14, 2017

Simple programs are much faster than my parser

What do you mean? What are you comparing?

It seems that parsing 3 or 4 program units works fine but 7 program units fail

If you can provide an example (ideally, a minimal grammar + input) that behaves badly, I'll be more than happy to look into it.

@till314
Copy link
Author

till314 commented Nov 15, 2017

Sorry I should have been clearer. I have a working python Earley parser which is part of a larger in-house system (not for distribution) which is what I use for comparison. I try to build a similar system using open source tools which then can be open sourced. As far as the parser goes there are two conditions: all context free grammars + proper handling of ambiguous parses. So far lark is my main choice.

Unfortunately the example which triggers the problem is covered by a NDA.

I'll see what I can do in terms of coming up with a simple example which triggers the problem but it may take a while. So far all simple examples work fine.

@till314
Copy link
Author

till314 commented Nov 15, 2017

The problem seems size related.

I have a string to be parsed of length 131070 (a stream of tokens generated from Fortran code). Trying to parse generates the 'incomplete parse' message. If I simplify the original Fortran code just a tiny bit, for example by removing a single argument from a function call or a function definition, or by simplifying a condition in a if statement, or by removing an arbitrary assignment somewhere and regenerate the token stream (results in a string of length ~131060) then parsing works. I tried about 20 different simplifications and they all worked.

So it seems there is a hidden size limit somewhere. My first guess was that there is a limit for match objects in python and 131072 = 2**17 looks suspicious like a 'large constant' people would use but I checked the python source code of the re module and did not find anything suspicious (which of course does not mean much). I suspected the re module because of the 99 group limit which I have run in before.

@erezsh
Copy link
Member

erezsh commented Nov 15, 2017

That's a very strange error. I can't think of anything in the implementation that would care about the length of the input string (except when it comes to run-time and memory, of course).

You should try to run it in pypy if you can. They have their own regexp implementation, and it seems better than CPython's.

@till314
Copy link
Author

till314 commented Nov 15, 2017

I tried replacing re by regex as well as using pypy but neither changes anything.

@till314
Copy link
Author

till314 commented Nov 23, 2017

I close the issue for now. I played around a bit more without getting any new insights. The only next step I can think of is to either study the python regex library properly or to reimplement it and I don't really have the time/energy to do so at the moment.

@till314 till314 closed this as completed Nov 23, 2017
@erezsh
Copy link
Member

erezsh commented Nov 23, 2017

Well, another possible step is to provide me with an example that demonstrates this behavior, so that I can try to solve it.

@till314
Copy link
Author

till314 commented Nov 23, 2017

As I said before the only examples I have which trigger the problem are covered by NDAs. I have not been able to construct smaller examples. Also all examples I tried and which I could have shared (with some extra effort) run out of memory before the problem shows up.

So all I can say is that some large examples (token strings generated from Fortran code) fail to parse although all types of very minor reductions (removing an assignment somewhere, reducing the number of variables in a function call, removing a declaration) parse correctly.

Sorry for not being able to provide you with such an example.

@erezsh
Copy link
Member

erezsh commented Nov 25, 2017

No worries, I'll get to it one way or another. When I make meaningful improvements to the algorithm, I hope I can ask you to check again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants