Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gvpr script to find max required lookahead for lx #116

Merged
merged 1 commit into from
Feb 22, 2019

Conversation

jameysharp
Copy link
Contributor

This script analyzes the output of lx -l dot to compute how much input a lexer must be able to buffer after reaching an accepting state before it can determine whether there's a longer match that it should return instead. I wrote it to investigate issue #111 but I think it's a useful analysis in its own right.

Sample output from the lx examples:

$ for f in examples/lx/*.lx; do echo "$f:"; build/bin/lx -l dot < "$f" | gvpr -f share/bin/lookahead.gvpr; echo; done
examples/lx/act.lx:
after accepting, a longest pattern leading to another accept is:
- $act_assign [A-Z_a-z] $act_assign
- $act_ident [A-Z_a-z] $act_ident
- $act_label [A-Z_a-z] $act_label
- $act_ref [A-Z_a-z] $act_ref
- $act_str [^@] $act_str
- $act_token [A-Z_a-z] $act_token
- $ident [-A-Z_a-z] $ident
- discard [\t\n ] discard
requires 1 characters of lookahead

examples/lx/a.lx:
after accepting, a longest pattern leading to another accept is:
- $a bc $token
- $int [0-9] $int
- $str_char 0[0-9] $str_octal
- $str_octal [0-9] $str_octal
- discard / discard
unbounded lookahead required after these tokens:
- $string

examples/lx/c11-pp.lx:
after accepting, a longest pattern leading to another accept is:
- $chr / $block_comment_end
- $escape_hex [0-9A-Fa-f] $escape_hex
- $escape_octal [0-7] $escape_octal
- $identifier \\U[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f] $identifier
- $newline \n $newline
- $other U[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f] $identifier
- $pp_number \\U[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f] $pp_number
- $punctuator %: $punctuator
- $whitespace [\t\v\f ] $whitespace
requires 10 characters of lookahead

examples/lx/c90-pp.lx:
ambiguous mappings to $q_char_sequence, $h_char_sequence; for example on input '<a'
ambiguous mappings to $h_char_sequence, $q_char_sequence; for example on input '<a>'
after accepting, a longest pattern leading to another accept is:
- $h_char_sequence [!"#%-=?A-Z\-_a-z|}~] $h_char_sequence
- $q_char_sequence [!#%-?A-Z\-_a-z|}~] $q_char_sequence
- discard [!#%-?A-Z\-_a-z|}~] $q_char_sequence
requires 1 characters of lookahead

examples/lx/literals.lx:
after accepting, a longest pattern leading to another accept is:
- $float [Ee][+-][0-9] $float
- $int [0-9] $int
- $str_lit [^"\] $str_lit
- discard [\t\n\r ] discard
requires 3 characters of lookahead

examples/lx/longest.lx:
after accepting, a longest pattern leading to another accept is:
- $dot .. $ellipsis
requires 2 characters of lookahead

examples/lx/scheme.lx:
after accepting, a longest pattern leading to another accept is:
- $ident /./ $ident
- discard [\t\n ] discard
requires 1 characters of lookahead

examples/lx/trie.lx:
after accepting, a longest pattern leading to another accept is:
- $bye e $bye
requires 1 characters of lookahead

@katef
Copy link
Owner

katef commented Feb 22, 2019

I love it! Thank you!

@katef katef merged commit 9ec9b14 into katef:master Feb 22, 2019
@jameysharp jameysharp deleted the lookahead-analysis branch February 22, 2019 03:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants