parsing time seems to increase dramatically with depth of parse tree #38

till314 · 2017-11-11T14:27:09Z

I am trying to parse Fortran95 programs with lark. The grammar is medium size but mostly modeled after the Fortran 95 specification (using the same names for non-terminal symbols). I use a separate lexer to deal with all the weird stuff like Hollerith constants. The programs I want to parse are considerably larger than the test program below.

The sample fortran program

PROGRAM TEST
do i=1,10
if (i<5) write(,) i;
write(,) i*i;
end do
END

is converted to the token list C below. To compare with slightly simpler programs I skipped once the first write statement (token list B below) and once the second write statement (token list A below)

A
'PROGRAM NAME EOS EOS DO NAME EQUAL NATNUMBER COMMA NATNUMBER EOS IF LPAREN NAME LT NATNUMBER RPAREN WRITE LPAREN TIMES COMMA TIMES RPAREN NAME EOS EOS ENDDO EOS END EOS'
B
'PROGRAM NAME EOS EOS DO NAME EQUAL NATNUMBER COMMA NATNUMBER EOS IF LPAREN NAME LT NATNUMBER RPAREN WRITE LPAREN TIMES COMMA TIMES RPAREN NAME TIMES NAME EOS EOS ENDDO EOS END EOS'
C
'PROGRAM NAME EOS EOS DO NAME EQUAL NATNUMBER COMMA NATNUMBER EOS IF LPAREN NAME LT NATNUMBER RPAREN WRITE LPAREN TIMES COMMA TIMES RPAREN NAME EOS EOS WRITE LPAREN TIMES COMMA TIMES RPAREN NAME TIMES NAME EOS EOS ENDDO EOS END EOS'

To give an idea of the resulting parse trees I include the complete output for C below. Parse times are

start=dt.now(); L.parse(A); print(dt.now()-start)
...
0:00:01.807732

start=dt.now(); L.parse(B); print(dt.now()-start)
...
0:00:17.732535

start=dt.now(); L.parse(C); print(dt.now()-start)
...
0:13:35.003853

EDIT START:
It seems that the size of the parse trees may be to blame.
if ambiguity='resolve' is used then the respective sizes (len(str(...))) of the results are
A => 2728 B => 2978 C => 3528
while ambiguity='explicit' results in
A = 2582677 B=16834357 C=252801325

So it may be that a more space efficient method to store ambiguous (intermediate) parsing results is needed.
EDIT END.

Parse tree for C:
Tree(executable_program, [Tree(program_unit, [Tree(main_program, [Tree(program_stmt, [Token(PROGRAM, 'PROGRAM'), Tree(program_name, [Token(NAME, 'NAME')]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])]), Tree(execution_part, [Tree(executable_construct, [Tree(block_do_construct, [Tree(do_stmt, [Tree(nonlabel_do_stmt, [Token(DO, 'DO'), Tree(loop_control, [Tree(do_variable, [Tree(array_section, [Tree(data_ref, [Tree(part_ref, [Tree(part_name, [Token(NAME, 'NAME')])])])])]), Token(EQUAL, 'EQUAL'), Tree(scalar_numeric_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])]), Token(COMMA, 'COMMA'), Tree(scalar_numeric_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS')])])]), Tree(do_block, [Tree(block, [Tree(execution_part_construct, [Tree(if_stmt, [Token(IF, 'IF'), Token(LPAREN, 'LPAREN'), Tree(scalar_logical_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(variable, [Tree(array_section, [Tree(data_ref, [Tree(part_ref, [Tree(part_name, [Token(NAME, 'NAME')])])])])])])])])])])]), Tree(rel_op, [Token(LT, 'LT')]), Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])]), Token(RPAREN, 'RPAREN'), Tree(action_stmt, [Tree(write_stmt, [Token(WRITE, 'WRITE'), Token(LPAREN, 'LPAREN'), Tree(io_control_spec_list, [Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])]), Token(COMMA, 'COMMA'), Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])])]), Token(RPAREN, 'RPAREN'), Tree(output_item_list, [Tree(output_item, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(variable, [Token(NAME, 'NAME')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])])])])]), Tree(execution_part_construct, [Tree(write_stmt, [Token(WRITE, 'WRITE'), Token(LPAREN, 'LPAREN'), Tree(io_control_spec_list, [Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])]), Token(COMMA, 'COMMA'), Tree(io_control_spec, [Tree(io_unit, [Token(TIMES, 'TIMES')])])]), Token(RPAREN, 'RPAREN'), Tree(output_item_list, [Tree(output_item, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NAME, 'NAME')])])])])]), Tree(mult_op, [Token(TIMES, 'TIMES')]), Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NAME, 'NAME')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])])])])]), Tree(end_do, [Tree(end_do_stmt, [Token(ENDDO, 'ENDDO'), Tree(eos, [Token(EOS, 'EOS')])])])])])]), Tree(end_program_stmt, [Token(END, 'END'), Tree(eos, [Token(EOS, 'EOS')])])])])])

erezsh · 2017-11-11T18:37:33Z

There are definitely a few space (and run-time) issues with explicit ambiguity, that I need to work on.

In the mean time, is it really necessary? I didn't know Fortran is an ambiguous language.

till314 · 2017-11-11T19:45:00Z

Fortran is ambiguous. Without semantic analysis there are many ambiguities. Function calls look like accessing array elements, defining one-line internal functions look like array assignments, keywords are not protected (they can be used as variable names), etc.

The advantage of Earley parsers is that you do not have to waste time analyzing and perfecting the grammar but can resolve ambiguities during semantic analysis. Also there are many incompatible extensions of older compilers which makes managing (and fixing) several similar but subtly different grammars rather painful.

In any case, just collecting the minimal amount of information needed to construct parse forests during parsing may be an idea.

I haven't checked your code in detail but if you have a rule of the form A : B_1 B_2 ... B_n where each of the B_i allow two parse trees then building the parse forest immediately means building 2**n trees.
The size of the parse forest in the examples above seems to indicate that something like that is happening.
One way around this is to keep pointers from A to each of the subtrees B_i and insert a step resolving at least some of the ambiguities before building complete trees.

erezsh · 2017-11-12T09:10:58Z

I have a slightly different version of Earley, that might solve your issue. I didn't test it thoroughly, but it's worth giving it a try.

Checkout the branch earley_fix2 and let me know if it improved anything.

till314 · 2017-11-12T17:40:46Z

Thank you.

The simple test cases all work very fast now. 0.03~0.14s
However, for one of my larger test files (which can be parsed with another Earley parser) I get the exception below. You might be pruning too severely somewhere.

Traceback (most recent call last):
File "", line 1, in
File " ..lark_parser-0.4.1-py3.6.egg/lark/lark.py", line 188, in parse
return self.parser.parse(text)
File "..lark_parser-0.4.1-py3.6.egg/lark/parser_frontends.py", line 142, in parse
return self.parser.parse(text)
File "..lark_parser-0.4.1-py3.6.egg/lark/parsers/xearley.py", line 147, in parse
raise ParseError('Incomplete parse: Could not find a solution to input')
lark.common.ParseError: Incomplete parse: Could not find a solution to input

erezsh · 2017-11-12T18:14:52Z

That's very likely. This suspicion is the main reason I haven't merged this fix to main yet. If you can provide more details, I'll appreciate it. It might help me narrow down on the issue.

till314 · 2017-11-12T18:34:58Z

I'll give it a try. It may take a few days.

erezsh · 2017-11-12T19:24:34Z

Great, thanks.

till314 · 2017-11-13T17:47:31Z

I have mixed news.

the incomplete parse was a mistake in the grammar conversion script a missing space (causing problems for very long token name)
earley_fix2 version 0155d3d works ok.

Simple programs are much faster than my parser (<0.1s vs 1s)
slightly more complicate programs about the same 2s vs 2s
but larger programs either take much longer 5s vs 2.5s, 16s vs 4s, 100s vs 5s or run out of memory (16GB) after a while.

now I tried to set up version ec7da3b (0155... does not exist any longer it seems) on a 64GB desktop. I'll let it run over night and see. EDIT: ran out of memory after 2 hours

EDIT2:
I ran the parser on some 600 fortran files with about 200 include files altogether about 12k lines, using a 2min timeout

there were about 50 timeouts
one file which contained 7 consecutive function definitions failed but once I split it into 2 files with 3 and 4 functions respectively both files worked fine. Two relevant rules are given below (executable_program is the start symbol). It seems that parsing 3 or 4 program units works fine but 7 program units fail. Since memory usage was below 1% the whole time according to top it is unlikely that memory was the problem.

Two grammar rules:
executable_program : comment* program_unit program_unit*
program_unit : block_data
| function_subprogram
| main_program
| module
| subroutine_subprogram

erezsh · 2017-11-14T19:00:03Z

Simple programs are much faster than my parser

What do you mean? What are you comparing?

It seems that parsing 3 or 4 program units works fine but 7 program units fail

If you can provide an example (ideally, a minimal grammar + input) that behaves badly, I'll be more than happy to look into it.

till314 · 2017-11-15T04:58:57Z

Sorry I should have been clearer. I have a working python Earley parser which is part of a larger in-house system (not for distribution) which is what I use for comparison. I try to build a similar system using open source tools which then can be open sourced. As far as the parser goes there are two conditions: all context free grammars + proper handling of ambiguous parses. So far lark is my main choice.

Unfortunately the example which triggers the problem is covered by a NDA.

I'll see what I can do in terms of coming up with a simple example which triggers the problem but it may take a while. So far all simple examples work fine.

till314 · 2017-11-15T17:04:12Z

The problem seems size related.

I have a string to be parsed of length 131070 (a stream of tokens generated from Fortran code). Trying to parse generates the 'incomplete parse' message. If I simplify the original Fortran code just a tiny bit, for example by removing a single argument from a function call or a function definition, or by simplifying a condition in a if statement, or by removing an arbitrary assignment somewhere and regenerate the token stream (results in a string of length ~131060) then parsing works. I tried about 20 different simplifications and they all worked.

So it seems there is a hidden size limit somewhere. My first guess was that there is a limit for match objects in python and 131072 = 2**17 looks suspicious like a 'large constant' people would use but I checked the python source code of the re module and did not find anything suspicious (which of course does not mean much). I suspected the re module because of the 99 group limit which I have run in before.

erezsh · 2017-11-15T18:47:25Z

That's a very strange error. I can't think of anything in the implementation that would care about the length of the input string (except when it comes to run-time and memory, of course).

You should try to run it in pypy if you can. They have their own regexp implementation, and it seems better than CPython's.

till314 · 2017-11-15T19:59:09Z

I tried replacing re by regex as well as using pypy but neither changes anything.

till314 · 2017-11-23T05:40:34Z

I close the issue for now. I played around a bit more without getting any new insights. The only next step I can think of is to either study the python regex library properly or to reimplement it and I don't really have the time/energy to do so at the moment.

erezsh · 2017-11-23T08:57:27Z

Well, another possible step is to provide me with an example that demonstrates this behavior, so that I can try to solve it.

till314 · 2017-11-23T12:44:35Z

As I said before the only examples I have which trigger the problem are covered by NDAs. I have not been able to construct smaller examples. Also all examples I tried and which I could have shared (with some extra effort) run out of memory before the problem shows up.

So all I can say is that some large examples (token strings generated from Fortran code) fail to parse although all types of very minor reductions (removing an assignment somewhere, reducing the number of variables in a function call, removing a declaration) parse correctly.

Sorry for not being able to provide you with such an example.

erezsh · 2017-11-25T13:47:26Z

No worries, I'll get to it one way or another. When I make meaningful improvements to the algorithm, I hope I can ask you to check again.

till314 closed this as completed Nov 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing time seems to increase dramatically with depth of parse tree #38

parsing time seems to increase dramatically with depth of parse tree #38

till314 commented Nov 11, 2017 •

edited

erezsh commented Nov 11, 2017

till314 commented Nov 11, 2017

erezsh commented Nov 12, 2017

till314 commented Nov 12, 2017

erezsh commented Nov 12, 2017

till314 commented Nov 12, 2017

erezsh commented Nov 12, 2017

till314 commented Nov 13, 2017 •

edited

erezsh commented Nov 14, 2017

till314 commented Nov 15, 2017

till314 commented Nov 15, 2017

erezsh commented Nov 15, 2017

till314 commented Nov 15, 2017

till314 commented Nov 23, 2017

erezsh commented Nov 23, 2017

till314 commented Nov 23, 2017

erezsh commented Nov 25, 2017

parsing time seems to increase dramatically with depth of parse tree #38

parsing time seems to increase dramatically with depth of parse tree #38

Comments

till314 commented Nov 11, 2017 • edited

erezsh commented Nov 11, 2017

till314 commented Nov 11, 2017

erezsh commented Nov 12, 2017

till314 commented Nov 12, 2017

erezsh commented Nov 12, 2017

till314 commented Nov 12, 2017

erezsh commented Nov 12, 2017

till314 commented Nov 13, 2017 • edited

erezsh commented Nov 14, 2017

till314 commented Nov 15, 2017

till314 commented Nov 15, 2017

erezsh commented Nov 15, 2017

till314 commented Nov 15, 2017

till314 commented Nov 23, 2017

erezsh commented Nov 23, 2017

till314 commented Nov 23, 2017

erezsh commented Nov 25, 2017

till314 commented Nov 11, 2017 •

edited

till314 commented Nov 13, 2017 •

edited