-
-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsing time seems to increase dramatically with depth of parse tree #38
Comments
There are definitely a few space (and run-time) issues with explicit ambiguity, that I need to work on. In the mean time, is it really necessary? I didn't know Fortran is an ambiguous language. |
Fortran is ambiguous. Without semantic analysis there are many ambiguities. Function calls look like accessing array elements, defining one-line internal functions look like array assignments, keywords are not protected (they can be used as variable names), etc. The advantage of Earley parsers is that you do not have to waste time analyzing and perfecting the grammar but can resolve ambiguities during semantic analysis. Also there are many incompatible extensions of older compilers which makes managing (and fixing) several similar but subtly different grammars rather painful. In any case, just collecting the minimal amount of information needed to construct parse forests during parsing may be an idea. I haven't checked your code in detail but if you have a rule of the form A : B_1 B_2 ... B_n where each of the B_i allow two parse trees then building the parse forest immediately means building 2**n trees. |
I have a slightly different version of Earley, that might solve your issue. I didn't test it thoroughly, but it's worth giving it a try. Checkout the branch |
Thank you. The simple test cases all work very fast now. 0.03~0.14s Traceback (most recent call last): |
That's very likely. This suspicion is the main reason I haven't merged this fix to main yet. If you can provide more details, I'll appreciate it. It might help me narrow down on the issue. |
I'll give it a try. It may take a few days. |
Great, thanks. |
I have mixed news.
EDIT2:
Two grammar rules: |
What do you mean? What are you comparing?
If you can provide an example (ideally, a minimal grammar + input) that behaves badly, I'll be more than happy to look into it. |
Sorry I should have been clearer. I have a working python Earley parser which is part of a larger in-house system (not for distribution) which is what I use for comparison. I try to build a similar system using open source tools which then can be open sourced. As far as the parser goes there are two conditions: all context free grammars + proper handling of ambiguous parses. So far lark is my main choice. Unfortunately the example which triggers the problem is covered by a NDA. I'll see what I can do in terms of coming up with a simple example which triggers the problem but it may take a while. So far all simple examples work fine. |
The problem seems size related. I have a string to be parsed of length 131070 (a stream of tokens generated from Fortran code). Trying to parse generates the 'incomplete parse' message. If I simplify the original Fortran code just a tiny bit, for example by removing a single argument from a function call or a function definition, or by simplifying a condition in a if statement, or by removing an arbitrary assignment somewhere and regenerate the token stream (results in a string of length ~131060) then parsing works. I tried about 20 different simplifications and they all worked. So it seems there is a hidden size limit somewhere. My first guess was that there is a limit for match objects in python and 131072 = 2**17 looks suspicious like a 'large constant' people would use but I checked the python source code of the re module and did not find anything suspicious (which of course does not mean much). I suspected the re module because of the 99 group limit which I have run in before. |
That's a very strange error. I can't think of anything in the implementation that would care about the length of the input string (except when it comes to run-time and memory, of course). You should try to run it in |
I tried replacing re by regex as well as using pypy but neither changes anything. |
I close the issue for now. I played around a bit more without getting any new insights. The only next step I can think of is to either study the python regex library properly or to reimplement it and I don't really have the time/energy to do so at the moment. |
Well, another possible step is to provide me with an example that demonstrates this behavior, so that I can try to solve it. |
As I said before the only examples I have which trigger the problem are covered by NDAs. I have not been able to construct smaller examples. Also all examples I tried and which I could have shared (with some extra effort) run out of memory before the problem shows up. So all I can say is that some large examples (token strings generated from Fortran code) fail to parse although all types of very minor reductions (removing an assignment somewhere, reducing the number of variables in a function call, removing a declaration) parse correctly. Sorry for not being able to provide you with such an example. |
No worries, I'll get to it one way or another. When I make meaningful improvements to the algorithm, I hope I can ask you to check again. |
I am trying to parse Fortran95 programs with lark. The grammar is medium size but mostly modeled after the Fortran 95 specification (using the same names for non-terminal symbols). I use a separate lexer to deal with all the weird stuff like Hollerith constants. The programs I want to parse are considerably larger than the test program below.
The sample fortran program
PROGRAM TEST
do i=1,10
if (i<5) write(,) i;
write(,) i*i;
end do
END
is converted to the token list C below. To compare with slightly simpler programs I skipped once the first write statement (token list B below) and once the second write statement (token list A below)
To give an idea of the resulting parse trees I include the complete output for C below. Parse times are
start=dt.now(); L.parse(A); print(dt.now()-start)
...
0:00:01.807732
start=dt.now(); L.parse(B); print(dt.now()-start)
...
0:00:17.732535
start=dt.now(); L.parse(C); print(dt.now()-start)
...
0:13:35.003853
EDIT START:
It seems that the size of the parse trees may be to blame.
if ambiguity='resolve' is used then the respective sizes (len(str(...))) of the results are
A => 2728 B => 2978 C => 3528
while ambiguity='explicit' results in
A = 2582677 B=16834357 C=252801325
So it may be that a more space efficient method to store ambiguous (intermediate) parsing results is needed.
EDIT END.
Parse tree for C:
Tree(executable_program, [Tree(program_unit, [Tree(main_program, [Tree(program_stmt, [Token(PROGRAM, 'PROGRAM'), Tree(program_name, [Token(NAME, 'NAME')]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])]), Tree(execution_part, [Tree(executable_construct, [Tree(block_do_construct, [Tree(do_stmt, [Tree(nonlabel_do_stmt, [Token(DO, 'DO'), Tree(loop_control, [Tree(do_variable, [Tree(array_section, [Tree(data_ref, [Tree(part_ref, [Tree(part_name, [Token(NAME, 'NAME')])])])])]), Token(EQUAL, 'EQUAL'), Tree(scalar_numeric_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])]), Token(COMMA, 'COMMA'), Tree(scalar_numeric_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS')])])]), Tree(do_block, [Tree(block, [Tree(execution_part_construct, [Tree(if_stmt, [Token(IF, 'IF'), Token(LPAREN, 'LPAREN'), Tree(scalar_logical_expr, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(variable, [Tree(array_section, [Tree(data_ref, [Tree(part_ref, [Tree(part_name, [Token(NAME, 'NAME')])])])])])])])])])])]), Tree(rel_op, [Token(LT, 'LT')]), Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NATNUMBER, 'NATNUMBER')])])])])])])])])])])])])])]), Token(RPAREN, 'RPAREN'), Tree(action_stmt, [Tree(write_stmt, [Token(WRITE, 'WRITE'), Token(LPAREN, 'LPAREN'), Tree(io_control_spec_list, [Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])]), Token(COMMA, 'COMMA'), Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])])]), Token(RPAREN, 'RPAREN'), Tree(output_item_list, [Tree(output_item, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(variable, [Token(NAME, 'NAME')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])])])])]), Tree(execution_part_construct, [Tree(write_stmt, [Token(WRITE, 'WRITE'), Token(LPAREN, 'LPAREN'), Tree(io_control_spec_list, [Tree(io_control_spec, [Tree(format, [Token(TIMES, 'TIMES')])]), Token(COMMA, 'COMMA'), Tree(io_control_spec, [Tree(io_unit, [Token(TIMES, 'TIMES')])])]), Token(RPAREN, 'RPAREN'), Tree(output_item_list, [Tree(output_item, [Tree(expr, [Tree(level_5_expr, [Tree(equiv_operand, [Tree(or_operand, [Tree(and_operand, [Tree(level_4_expr, [Tree(level_3_expr, [Tree(level_2_expr, [Tree(add_operand, [Tree(add_operand, [Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NAME, 'NAME')])])])])]), Tree(mult_op, [Token(TIMES, 'TIMES')]), Tree(mult_operand, [Tree(level_1_expr, [Tree(primary, [Tree(constant, [Token(NAME, 'NAME')])])])])])])])])])])])])])])]), Tree(eos, [Token(EOS, 'EOS'), Token(EOS, 'EOS')])])])])]), Tree(end_do, [Tree(end_do_stmt, [Token(ENDDO, 'ENDDO'), Tree(eos, [Token(EOS, 'EOS')])])])])])]), Tree(end_program_stmt, [Token(END, 'END'), Tree(eos, [Token(EOS, 'EOS')])])])])])
The text was updated successfully, but these errors were encountered: