👉 Note: common parsing libraries include combinators like
chainl1
which explicitly handle left-recursive grammars without needing to refactor the grammar as shown here. I recommend using such combinators where provided / feasible. Leaving this repo up for reference's sake.
Parser combinators are expressive and easy to embed and reuse within an application. However, they implement recursive descent parsing algorithms, which cannot parse left-recursive grammars. Thankfully, there exists a simple technique to eliminate left recursion in most grammars.
These concepts are detailed here and elsewhere, but typically in the academic jargon of context-free grammars and parsing theory. In contrast, this codebase aims to demonstrate the problem and fix for those familiar with Haskell fundamentals.
Imagine we have a small data structure representing a potentially recursive tree of subtraction expressions (similar to Hutton's Razor).
data Expr = Lit Int | Sub Expr Expr
exampleExpr = Sub (Lit 4) (Sub (Lit 3) (Lit 0))
The string language we may want to parse into this data structure could consist of digits, parens, subtraction symbols and so on:
str1 = "1"
str2 = "1-3"
str3 = "(9)"
str4 = "0-(3-8)-(((2))-(2-1))"
This is just a toy example, but it demonstrates the idea that our language (the set of legal strings) may include more tokens than are explicitly represented in our target (the result of parsing).
To organize our thoughts we might try to draft a grammar, describing the production rules that can generate all legal strings. Grammars are often written in BNF or EBNF syntax, but this example is hopefully understandable without prior knowledge:
EXPR = SUB | GROUP | LIT
SUB = EXPR, "-", EXPR
GROUP = "(", EXPR, ")"
LIT = "0" | "1" | "2" | ... | "9"
In other words,
- "An
EXPR
ession is either aSUB
traction,GROUP
, orLIT
eral" - "A
SUB
traction is anEXPR
ession, followed by '-', followed by anEXPR
ession" - "A
GROUP
is '(', followed by anEXPR
ession, followed by ')'" - "A
LIT
eral is either '0', or '1', or... (etc.)"
Grammars consist of terminal symbols like "-" and "3", which actually appear in the language strings, and nonterminal placeholders like LIT
, which do not. To build an arbitrary legal string by hand, start from the EXPR
placeholder, and replace it with anything on the right side of its corresponding rule (SUB
, GROUP
, or LIT
). Proceed with replacing nonterminal placeholders with permitted substitutions until your string consists solely of terminal symbols. For example:
EXPR
LIT
4
Or:
EXPR
GROUP
(EXPR)
(SUB)
(EXPR-EXPR)
(LIT-EXPR)
(5-EXPR)
(5-LIT)
(5-2)
Remarkably, defining a valid grammar for a language (that is, the set of rules that can generate any legal string in the language) is almost the same as defining a working set of parsers for the language (that is, the functions which can analyze an existing string for its structure). Even though these activities (generating vs. consuming strings) are in some ways opposite, their forms are comparable.
So, the context-free production rule:
GROUP = "(", EXPR, ")"
Corresponds directly to the Haskell (via trifecta
) parser:
group :: Parser Expr
group = char '(' *> expr <* char ')'
(Or in monadic style with do
notation, if you prefer:)
group :: Parser Expr
group = do
char '('
e <- expr
char ')'
pure e
The grammar shown earlier is in fact a 100% valid grammar for the expression language we wish to parse. That is, the grammar is capable of producing any arbitrary string of the language, including examples like "0-(3-8)-(((2))-(2-1))"
.
We want to go backwards – analyze an existing string. In src/Broken.hs
, we attempt to structure our parser combinator outline according to this grammar. However, if you attempt to use that parser (not recommended!) on a simple string like "1", it will result in an infinite loop. Why? Let's review the first two lines of the grammar, and their corresponding parsers:
EXPR = SUB | GROUP | LIT
SUB = EXPR, "-", EXPR
-- Grammar rule: EXPR = SUB | GROUP | LIT
expr :: Parser Expr
expr = sub <|> group <|> lit -- first try `sub`...
-- Grammar rule: SUB = EXPR, "-", EXPR
sub :: Parser Expr
sub = do
e1 <- expr -- now do `expr`. WARNING: infinite recursion!
char '-'
e2 <- expr
pure $ Sub e1 e2
The sequence of events when parsing a string like "1" via the expr
parser is as follows:
- Hm, an
expr
might be asub
, let's try that parser. - Ok, a
sub
begins with anexpr
, let's try that parser. (GOTO 1)
At this point the issue becomes quite clear! Even though this grammar is a valid one for producing arbitrary strings, it is not a useful one for parsing strings via recursive descent; it immediately enters into an infinite loop. This is because the grammar is left-recursive. Informally, a left-recursive grammar:
- features a production rule of the form
A = A ... | ...
which loops on itself immediately, or... - a set of production rules
A = B ... | ...
,B = C ... | ...
,C = A ... | ...
which loop around eventually.
Parsers are allowed to be recursive, so long as there exists the possibility for the parser to exit the loop. A parser cannot unconditionally recurse on itself – that is recursion without a base case, a classic programming error.
A naive attempt at solving the problem might just change the order of rules without modifying their structure. For example, perhaps we place the SUB
rule at the end of EXPR
?
EXPR = GROUP | LIT | SUB
GROUP = "(", EXPR, ")"
LIT = "0" | "1" | "2" | ... | "9"
SUB = EXPR, "-", EXPR
expr :: Parser Expr
expr = group <|> lit <|> sub -- first try `group`...
This is again a valid grammar, but does us no good for parsing. A string like "(1)-2"
would be parsed as the group "(1)"
yielding Lit 1
, and then stop - failing to consume the remaining "-2"
string. Our parser now terminates, but without ever attempting the recursive case! We will need a different approach.
The technique, which will work in most cases, is to identify the left-recursive path A => A ...
and split it up into two stages: a "start" and "end" step. The "start" step will be mandatory; the "end" step will be effectively optional, by allowing empty results.
Before:
EXPR = SUB | GROUP | LIT
SUB = EXPR, "-", EXPR
...
After:
EXPR = START, END
START = GROUP | LIT
END = "-", EXPR | NOTHING
...
(This "NOTHING" result is typically written in grammars using the Greek letter epsilon 𝜀
, and it corresponds to the empty string.)
Notice that the misbehaving SUB
rule disappears entirely! It has instead been split up across the START
rule (which parses a chunk of information) and the END
rule (which might parse the continuation of a subtraction, with recursive right-hand expression, or might give up).
In Haskell, we can represent this "successful parse of nothing" using the famous Maybe
datatype.
-- Grammar rule: EXPR = START, END
expr :: Parser Expr
expr = do
e1 <- start
mE2 <- end
case mE2 of
Nothing -> pure e1
Just e2 -> pure $ Sub e1 e2
-- Grammar rule: START = GROUP | LIT
start :: Parser Expr
start = group <|> lit
-- Grammar rule: END = "-", EXPR | NOTHING
end :: Parser (Maybe Expr)
end = getEnd <|> pure Nothing where
getEnd = do
char '-'
e <- expr
pure $ Just e
Because end
is recursive – the expr
it parses itself consists of a new start
and end
– you can keep parsing an indefinite chain of subtractions, exactly analogous to a cons list. And just like the famous cons list, that chain of nested parses ends when you hit an empty case (<|> pure Nothing
, when no -
symbol is encountered).
Bubbling the information back up, our expr
parser has to now react to both possibilities:
- Either no
end
was encountered (i.e., no-
symbol), meaning this is NOT a subtraction expression; or, - An
end
was built, in case this WAS a subtraction expression.
Let's trace through parsing the string "1" again:
- Hm, an
expr
begins withstart
- The
start
is either agroup
(nope) or alit
(yep!) - Continuing where we left off, the
expr
ends withend
end
either begins with "-" (nope) or it's nothing (yep!)- So we have a successful
e1
expression, andNothing
fore2
; guess we just returne1
(which isLit 1
).
What about parsing a subtraction like "1-1"?
- Hm, an
expr
begins withstart
- The
start
is either agroup
(nope) or alit
(yep!) - Continuing where we left off, the
expr
ends withend
end
either begins with "-" (yep!) or it's nothing (nope)- Since we matched "-",
end
now continues on with a newexpr
- RECURSE: The new
expr
follows the same path as "1" above - We have a successful
e1
expression, and also a successfule2
expression; time to return aSub e1 e2
.
This is meant as a Haskeller-approachable introduction to the elimination of left recursion for recursive descent parsers. The full set of techniques as explained here includes additional examples and variations. I hope you find it helpful, and please let me know if I've made any mistakes.