Skip to content

Basics of Combinators

Jamie Willis edited this page Aug 28, 2023 · 10 revisions

This wiki is now deprecated, please consult the new-style wiki here: https://j-mie6.github.io/parsley/guide/basics-of-combinators

Parsley is a parser combinator library. In contrast to a parser generator library, like ANTLR, this allows the users to build their parsers as part of the host language: in this case Scala. This is called being an embedded Domain Specific Language or eDSL. Practically, this is useful because it allows you to factor out repeated parts of your grammars and make them reusable, as well as using all the language features normally at your disposal to create the grammars too. This page will touch on both of those ideas. Another advantage is that there is less boiler-plate compared with some parser generators: you don't need to convert between the AST the generator produces and your own, you can parse straight into the desired type.

Contents

Part I: Basic Combinators and Sequencing

We'll start really basic: just reading a character or two and seeing how to combine the results using combinators. For a first look, we will just parse one of any character. If you are familar with regex, this would match the pattern (.).

import parsley.Parsley
import parsley.character.item

val p: Parsley[Char] = item
p.parse("a") // returns Success('a')
p.parse("1") // returns Success('1')
p.parse("")  // returns Failure(..)

The Parsley type is the type of parsers. The type parameter Char here represents what type the parser will return when it has been executed using parse(input). Soon we will see an example with a different type. Parsers, when executed, return a Result[Err, A] for whatever A the parser returned: this is one of Success containing the value or Failure containing an error message of type Err (by default this is String). This is the basic workflow when using parsers. The item parser will read any single character, no matter what (so long as there is one to read). It isn't particularly useful though, so lets match specific characters instead and parse two of them this time. The regex for this would be (ab).

import parsley.Parsley
import parsley.character.char

val ab: Parsley[(Char, Char)] = char('a') <~> char('b')

ab.parse("ab") // returns Success(('a', 'b'))
ab.parse("a") // returns Failure(..)

A few new things have appeared in this new example. The char combinator is new: given a specific character it will parse that character only. We'll see how you can define this and item in terms of another, more general, combinator soon. Notice that the type of ab is no longer just a Parsley[Char], but a Parsley[(Char, Char)]: this is due to the <~> combinator with the following type (in a psuedo-syntax, for simplicity).

(_ <~> _): (p: Parsley[A], q: Parsley[B]) => Parsley[(A, B)]

What this combinator does (pronounced "zip") is that it first parses p, then q afterwards and then combines their results into a tuple. Suppose we had char('a') <~> char('b') <~> char('c') then this would have type Parsley[((Char, Char), Char)]. This is the first example of a sequencing combinator. There are two other combinators that look similar:

(_ ~> _): (p: Parsley[A], q: Parsley[B]) => Parsley[B]
(_ <~ _): (p: Parsley[A], q: Parsley[B]) => Parsley[A]

They are pronounced then and then discard respectively. Again, they both parse both p and then q, but they only return the result they are "pointing" at. Notice that <~> points at both results. These are more commonly known as *> and <*, but they are otherwise identical, so use whatever resonates more strongly with you. We'll see soon how we can define them in terms of <~> to get a sense of how combinators can be built up in terms of more primitive ones.

What ties char and item together

We've seen both char and item can be used to read characters, there is, however, a more primitive one which can be used to implement them both. This combinator is called satisfy and has the following type:

def satisfy(predicate: Char => Boolean): Parsley[Char]

def char(c: Char): Parsley[Char] = satisfy(_ == c)
val item: Parsley[Char] = satisfy(_ => true)

The combinator satisfy takes a function, and will read a character when the predicate returns true on that character, and fails otherwise. This makes satisfy a bit more versatile and it can be used to implement a wide range of functionality. For example, we can implement a parser that reads digits using satisfy:

import parsley.Parsley
import parsley.character.satisfy

val digit = satisfy(_.isDigit)
digit.parse("1") // returns Success('1')
digit.parse("2") // returns Success('2')
digit.parse("a") // returns Failure(..)

This is, however, already implemented by parsley.character.digit. Parsley is very rich in terms of the combinators it supports out of the box, so do hunt around before reinventing the wheel!

Changing the result type

It's all well and good being able to sequence together reading single characters, but this doesn't exactly scale well to larger, more complex, parsers. Indeed, it's likely we aren't interested in an increasing deeply nested tuple! A good starting point for this is the humble map combinator:

(_.map(_)): (p: Parsley[A], f: A => B) => Parsley[B]

This can be used to change the result of a parser p with the parser f, presumably into something more useful. Let's see a couple of examples of this in action! Firstly, let's suppose we wanted our digit combinator from before to return an Int instead of a Char...

import parsley.Parsley
import parsley.character.satisfy

val digit: Parsley[Int] = satisfy(_.isDigit).map(_.asDigit)
digit.parse("1") // returns Success(1)

Here we can see that the digit parser is no longer type Parsley[Char] but type Parsley[Int]. This is because the asDigit method on Char returns an Int. To reinforce how this works, let's see how ~> and <~ can be made out of a combination of <~> and map:

p ~> q == (p <~> q).map(_._2)
p <~ q == (p <~> q).map(_._1)

The first definition pairs p and q together, and then takes the second element of the pair with map, and the second definition does the same but instead takes the first element of the pair. Now, using this tupling approach paired with map, we can do a lot of stuff! However, there is a more general strategy to do this:

def lift2[A, B, C](f: (A, B) => C, p: Parsley[A], q: Parsley[B]): Parsley[C]

// pairs p and q's results together
p <~> q = lift2[A, B, (A, B)]((_, _), p, q)
// adds the result of p onto the list result of ps
p <::> ps = lift2[A, List[A], List[A]](_ :: _, p, ps)
// applies a function from pf onto the value from px
pf <*> px = lift2[A => B, A, B]((f, x) => f(x), pf, px)
...

The lift family of combinators are great for combining n parsers with an arity n function. For instance, map is actually the same as a lift1. And above we can see that lift2 can implement a bunch of useful combinators. In particular, let's see how we can use <::> to implement a way of reading Strings instead of just Chars!

Putting the pieces together: Building string

Our new challenge is going to be making an implementation of the string combinator. Obviously, this combinator already exists in the library, so we can play around with it first to see how it works:

import parsley.Parsley
import parsley.character.string

val abc = string("abc")
abc.parse("abc") // returns Success("abc")
abc.parse("abcd") // returns Success("abc")
abc.parse("ab") // returns Failure(..)
abc.parse("a bc") // returns Failure(..)
abc.parse("dabc") // returns Failure(..)

Notice how the result of the parser is a string. The string combinator reads a specific string exactly. Here are a couple more examples to help you get your head around everything we've seen so far:

import parsley.Parsley
import parsley.character.{char, string}

(string("abc") <* char('d')).parse("abcd") // returns Success("abc")
(string("abc") ~> char('d')).parse("abcd") // returns Success('d')
(string("abc") <~> char('d')).parse("abcd") // returns Success(("abc", 'd'))

Now let's start building the string combinator from scratch! Bear in mind, that unlike in Haskell, a Scala string is not List[Char] but is the Java String. This makes it a little more annoying to implement, since we'll have to convert a List[Char] into a String at the end, with map.

import parsley.Parsley

def string(str: String): Parsley[String] = {
    def helper(cs: List[Char]): Parsley[List[Char]] = ???
    helper(str.toList).map(_.mkString)
}

We've started here by defining the string function, and made the skeleton of an internal helper function that will turn a list of characters into a parser that reads that list of characters and returns them all. The main body of the function uses this, and afterwards maps the _.mkString method on lists to convert the result back into a string. Now we need to focus on the helper. The first step is to consider how to handle the empty string. For this we need another very handy combinator called pure, which reads no input and returns what's given to it:

import parsley.Parsley, Parsley.pure

// def pure[A](x: A): Parsley[A]
// pure(7).parse("") returns Success(7)
def helper(cs: List[Char]): Parsley[List[Char]] = cs match {
    case Nil     => pure(Nil)
    case c :: cs => ???
}

Now the question is how to handle the recursive case? Well in the base case we transformed the empty list into a parser that returns the empty list. We'll follow that same shape here and use <::>!

import parsley.Parsley, Parsley.pure
import parsley.character.char

def helper(cs: List[Char]): Parsley[List[Char]] = cs match {
    case Nil     => pure(Nil)
    case c :: cs => char(c) <::> helper(cs)
}

What happens here is that we take each character in the string, convert it to a parser that reads that specific character, and then add that onto the front of reading the rest of the characters. In full:

import parsley.Parsley, Parsley._
import parsley.character.char

def string(str: String): Parsley[String] = {
    def helper(cs: List[Char]): Parsley[List[Char]] = cs match {
        case Nil     => pure(Nil)
        case c :: cs => char(c) <::> helper(cs)
    }
    helper(str.toList).map(_.mkString)
}

// string "abc" == (char('a') <::> (char ('b') <::> (char 'c' <::> pure(Nil)))).map(_.mkString)

Hopefully, this gives some intuition about how we can start to sequence together larger and larger building blocks out of smaller ones. It's also a lesson in how Scala can be used to help you build your parsers up! Again, the string combinator is already provided to you (and optimised) so be sure to check around in parsley.character and parsley.combinator for combinators that might already fit your needs. That's about everything there is to say about sequencing and combining results, so next up is looking at choice.

Takeaways

  • Characters can be read using combinators found in parsley.character
  • To sequence two things but only use the result of one, you'll want *>/~> or <*/<~
  • The result of a parser can be changed using map
  • N parser's results can be combined using the liftN combinators with an arity N function
  • Larger combinators are built out of smaller ones using regular Scala functionality

Part II: Choice and Handling Failure

Most practical parsers aren't just a straight line like string or reading a bunch of characters, usually there are choices to be made along the way.

Alternatives

When parsers fail to recognise some input, most of the time, there is an alternative branch that could have been taken instead. Let's take a simple example again, say matching the regex (a|b). From now on, I'm going to use some syntactic sugar from parsley.implicits so I don't have to write char or string.

import parsley.Parsley
import parsley.implicits.character.charLift

val aOrB = 'a' <|> 'b'

aOrB.parse("a") // returns Success('a')
aOrB.parse("b") // returns Success('b')
aOrB.parse("c") // returns Failure(..)

Here, the <|> combinator (pronounced "alt" or "or") allows the parser to try an alternative branch when the first fails (and so on for a longer chain of them). To be well typed, the <|> combinator requires both branches to return the same type (or at least a super-type of both). For this specific usecase, character.oneOf('a', 'b') would probably have been more appropriate.

Let's carry on reinforcing the connections with what we've seen so far, and see how sequencing and branching interact:

import parsley.Parsley
import parsley.implicits.character.{charLift, stringLift}

val p = 'a' *> ("a" <|> "bc") <* 'd'

p.parse("bcd") // fails, there isn't an 'a'
p.parse("ad") // fails, why?
p.parse("aae") // fails, why?
p.parse("abcde") // what happens here, what does it return (and the type)?
p.parse("aad") // what happens here, what is the type it returns?
p.parse(" aad") // what about this

Think about the what results you expect from each of these, and then try them out in a REPL to see if you're right: also check out the error messages from the failing ones! Recall that string basically reads a bunch of characters in sequence, one after the other. Let's see what happens when we put longer strings inside the branches:

import parsley.Parsley
import parsley.implicits.character.stringLift

val p = "abc" <|> "def" <|> "dead"

p.parse("abc") // returns Success("abc")
p.parse("def") // returns Success("def")
p.parse("dead") // returns Failure(..)

Ah, we have a problem! The first two alternatives parse fine, but the last one does not? The answer to this is fairly simple, but I want to illustrate how we can make steps towards diagnosing this problem ourselves using the combinators found in parsley.debug:

import parsley.Parsley
import parsley.implicits.character.stringLift
import parsley.debug._

val p = ("abc".debug("reading abc") <|>
            ("def".debug("reading def") <|> "dead".debug("reading dead")).debug("second branch")
        ).debug("first branch")

p.parse("abc") // returns Success("abc")
p.parse("def") // returns Success("def")
p.parse("dead") // returns Failure(..)

The debug combinator can be attached to any operation (by default Parsley associates <|> to the right, which is why I've bracketed them this way round). It will provide printouts when it enters the debug and when it exits, along with information about the state of the parser. Let's see the three printouts:

scala> p.parse("abc")
>first branch> (1, 1): abc•
                       ^
  >reading abc> (1, 1): abc•
                        ^
  <reading abc< (1, 4): abc• Good
                           ^
<first branch< (1, 4): abc• Good
                          ^

scala> p.parse("def")
>first branch> (1, 1): def•
                       ^
  >reading abc> (1, 1): def•
                        ^
  <reading abc< (1, 1): def• Fail
                        ^
  >second branch> (1, 1): def•
                          ^
    >reading def> (1, 1): def•
                          ^
    <reading def< (1, 4): def• Good
                             ^
  <second branch< (1, 4): def• Good
                             ^
<first branch< (1, 4): def• Good
                          ^

scala> p.parse("dead")
>first branch> (1, 1): dead•
                       ^
  >reading abc> (1, 1): dead•
                        ^
  <reading abc< (1, 1): dead• Fail
                        ^
  >second branch> (1, 1): dead•
                          ^
    >reading def> (1, 1): dead•
                          ^
    <reading def< (1, 3): dead• Fail
                            ^
  <second branch< (1, 3): dead• Fail
                            ^
<first branch< (1, 3): dead• Fail
                         ^

Crucially, in the last printout, we can see the trace of the parser as it went wrong. It started by executing the outermost branch, and tried to read "abc" and failed, but the caret is still pointing at the first character. Then the second branch is taken as the alternative to the first: this time the parser tries to read "def" and gets two characters of the way through it before failing at the 'a'. Notice though, that the second branch immediately exited without attempting the third alternative! This is the key: when a parser fails but has consumed input, the <|> combinator will not work. The reason for this is to improve the quality of error messages, as well as keeping parsers efficient. The "best" solution to this problem is to change our parser slightly to remove the common leading string of the last two alternatives like so:

import parsley.Parsley
import parsley.implicits.character.{charLift, stringLift}

val p = "abc" <|> ("de" *> ('f' #> "def" <|> "ad" #> "dead"))

p.parse("abc") // returns Success("abc")
p.parse("def") // returns Success("def")
p.parse("dead") // returns Success("dead")

In this version of the parser, the leading "de" has been factored out, leaving an alternative of "f" <|> "ad" remaining. But the original parser returned the full string, and this wouldn't: it would return "abc", "f", or "ad". The #> (pronounced "as") combinator will replace the result of parser on the left with the value found on the right (e.g. "123" #> 123 <|> "456" #> 456 would be Parsley[Int]). You can think of it as a map with the constant function or as p *> pure(x). Using #> we can replace the results of the last two parsers with the results we expected from before. This is the most efficient way of dealing with our problem, because this parser is still linear time in the worst case. But sometimes this transformation isn't so straight-forward, especially in deeply nested grammars. In this case we can reach for another, easier to read, tool.

Backtracking

In the last section, we saw that the <|> doesn't proceed with the second alternative if the first consumed input before failing. That is to say it doesn't backtrack. There is, however, a combinator that permits backtracking to happen, called attempt. Let's see it in action:

import parsley.Parsley, Parsley.attempt
import parsley.implicits.character.stringLift
import parsley.debug._

val p = "abc" <|> attempt("def") <|> "dead"
val q = "abc" <|> (attempt("def".debug("reading def")).debug("backtrack!") <|>
                   "dead".debug("reading dead")).debug("branch")

p.parse("dead") // returns Success("dead")
q.parse("dead")
/*
>branch> (1, 1): dead•
                 ^
  >backtrack!> (1, 1): dead•
                       ^
    >reading def> (1, 1): dead•
                          ^
    <reading def< (1, 3): dead• Fail
                            ^
  <backtrack!< (1, 1): dead• Fail
                       ^
  >reading dead> (1, 1): dead•
                         ^
  <reading dead< (1, 5): dead• Good
                             ^
<branch< (1, 5): dead• Good
                     ^
*/

Here we can see attempt in action, as well as a debug trace so you can see what's going on. When we wrap the left hand side of a branch with attempt, when it fails we will rollback any input it consumed, which then allows the branch to accept the alternative. We can see that in the debug trace. You only need to use attempt where you know that two branches share a common leading edge. Knowing when to do this is just based on practice. Adding an attempt never makes a parser wrong, but it can make the error messages worse, and also excessive backtracking can increase the time complexity of the parser significantly. If you know that if a branch consumes input and fails then its alternatives wouldn't succeed either, then you shouldn't be using attempt. It is also useful to make a specific sub-parser behave as if it were indivisible: think reading keywords, which are all or nothing.

Lookahead

Usually, a good ordering of your alternatives is most of what you need to write a functioning parser for just about anything. However, every now and then it's convenient to look-ahead at what is to come (in either a positive or a negative way): for instance checking if there is no remaining input with the eof combinator is an example of negative look-ahead. There are two combinators for doing this, which we'll explore now:

import parsley.Parsley, Parsley.{notFollowedBy, lookAhead}
import parsley.character.item
import parsley.combinator.eof
import parsley.implicits.character.stringLift
import parsley.debug._

// def lookAhead[A](p: Parsley[A]): Parsley[A]
// def notFollowedBy(p: Parsley[_]): Parsley[Unit]

val eof = notFollowedBy(item)
val abcOnly = "abc" <* eof

abcOnly.parse("abc") // returns Success("abc")
abcOnly.parse("abcd") // returns Failure(..)

val p = "abc".debug("abc") <* lookAhead("!!".debug("!!")).debug("lookahead")

p.parse("abc!!")
/*
>abc> (1, 1): abc!•
              ^
<abc< (1, 4): abc!• Good
                 ^
>lookahead> (1, 4): abc!•
                       ^
  >!!> (1, 4): abc!!•
                  ^
  <!!< (1, 6): abc!!• Good
                    ^
<lookahead< (1, 4): abc!!• Good
                       ^
*/

p.parse("abc!")
/*
>abc> (1, 1): abc!•
              ^
<abc< (1, 4): abc!• Good
                 ^
>lookahead> (1, 4): abc!•
                       ^
  >!!> (1, 4): abc!•
                  ^
  <!!< (1, 5): abc!• Fail
                   ^
<lookahead< (1, 5): abc!• Fail
                        ^
*/

Some key things to note here: the result of backtracking is always (). This is because the parser has to fail for notFollowedBy to succeed, so it can't return the result of the look-ahead! On the other hand, lookAhead can return the result of the parser that was ran. You can see from the debug traces that when it succeeds, lookAhead does not consume input, but if it fails having consumed input, that input remains consumed. However, notFollowedBy never consumes input.

Takeaways

  • Alternative branches of a grammar are encoded by <|>
  • Within a branch, you are free to do whatever you want, but you must ensure both branches' types match
  • When a branch fails having consumed input it won't take the second branch.
  • The attempt combinator can be used to enable backtracking so that consumed input is undone when passing back through (it doesn't affect any <|>s that execute inside it, however)
  • When parsers go wrong, debug is a fantastic tool to investigate it with: use it early and often!
  • Negative and positive look-ahead can be done with lookAhead and notFollowedBy

Interlude: Regex Parser Examples

Before we move on to slightly more realistic parsing problems that can't just be captured by regex, we'll consolidate what we've seen so far by writing a few parsers for some regular expressions. For all of these examples, I'll simplify it by using void, which ignores the result of a parser and replaces it with (): Unit. This turns our parsers into recognisers. I'll also be introducing a couple of new ideas so we can complete the functionality we need to properly capture regex (namely the equivalents of optional ?, zero-or-more * and one-or-more +). Let's start there:

// This is the regex *
// it will perform `p` zero or more times (greedily) and collect all its results into a list
def many[A](p: Parsley[A]): Parsley[List[A]]
def skipMany(p: Parsley[_]): Parsley[Unit] = void(many(p)) // ideally, it wouldn't build the list

// This is the regex +
// similar to many, except it requires at least 1 `p` to succeed
def some[A](p: Parsley[A]): Parsley[List[A]] = p <::> many(p)
def skipSome(p: Parsley[_]): Parsley[Unit] = p *> skipMany(p)

// This is the regex ?
// it will either parse `p` or will return `x` otherwise
def optionally[A](p: Parsley[A], x: A): Parsley[A] = p #> x <|> pure(x)
def optional(p: Parsley[_]): Parsley[Unit] = optionally(p, ())
def option[A](p: Parsley[A]): Parsley[Option[A]] = p.map(Some(_)) <|> pure(None)

// This is the regex [^ .. ]
// it will parse any character _not_ passed to it
def noneOf[A](cs: Char*): Parsley[Char] = satisfy(!cs.contains(_))

With the exception of many, which we can't define just yet, all of these handy combinators are implemented with everything we've seen so far. You can find them all, and many more, in parsley.combinator.

import parsley.Parsley, Parsley.attempt
import parsley.combinator.{many, some, optional, eof}
import parsley.implicits.character.{charLift, stringLift}
import parsley.implicits.combinator.voidImplicitly
import parsley.character.{noneOf, oneOf, item}

// regex .at
val r1: Parsley[Unit] = item *> "at"

// regex [hc]at
val r2: Parsley[Unit] = oneOf('h', 'c') *> "at"

// regex [^hc]at
val r3: Parsley[Unit] = noneOf('h', 'c') *> "at"

// regex [hc]at$
val r4: Parsley[Unit] = oneOf('h', 'c') *> "at" *> eof

// regex s.*
val r5: Parsley[Unit] = 's' *> many(item)

// regex [hc]?at
val r6: Parsley[Unit] = optional(oneOf('h', 'c')) *> "at"

// regex [hc]+at
val r7: Parsley[Unit] = some(oneOf('h', 'c')) *> "at"

// regex h(i|ello|ey)( world)?(\!|\.)?
val r8: Parsley[Unit] = 'h' *> ("i" <|> attempt("ello") <|> "ey") *>
                        optional(" world") *>
                        optional('!' <|> '.')

Have a play around with those in a REPL and make sure you understand how they work and what inputs they will succeed on. I've written the type explictly here so that the voidImplicitly function will fire and make them all the right type.

Part III: Recursive Context-Free Parsers

Everything we've seen so far has been as powerful as a regular expression. While regular expressions are certainly useful for writing lexers they are normally not powerful enough to parse the syntax of a programming language. It's worth nothing here that, usually, parser combinators don't make a distinction between lexing and parsing: you build your lexers out of the combinators and then use them in the parser. That has some advantages, but it does mean that backtracking is more expensive than it would otherwise be in a parser with a dedicated lexing phase. The reason this is considered good style is because of the higher-order nature of parser combinators: parsers are a first-class value that can be manipulated freely by the combinators, as opposed to more rigid grammar rules and terminals.

In this section, we'll explore how to extend the knowledge we've already built to start writing more complex parsers with recursive control-flow: this is sufficient to parse context-free grammars. It just takes the addition of the flatMap combinator to recover context-sensitive grammars from this, but grammars that require flatMap are rare in practice, so I won't touch on it here.

Recursion via Laziness

Writing recursive parsers is, fortunately, quite straight-forward but it does rely on Scala's lazy value feature. Let's start with the classic matching brackets parser (which cannot be parsed with regex: here comes to mind...):

import parsley.Parsley
import parsley.implicits.character.charLift
import parsley.combinator.{skipMany, eof}

lazy val matching: Parsley[Unit] = skipMany('(' *> matching <* ')')
val onlyMatching = matching <* eof

onlyMatching.parse("") // succeeds
onlyMatching.parse("(") // fails
onlyMatching.parse("()") // succeeds
onlyMatching.parse("()()()") // succeeds
onlyMatching.parse("(((())))") // succeeds
onlyMatching.parse("(((()))()()(()))") // succeeds
onlyMatching.parse("(((()))(()()(()))") // fails

There's a bit to unpack here! Firstly, the type ascription here on matching isn't optional: when we write recursive parsers, at least one thing in the recursive group needs to have a type, since Scala cannot infer the types of recursive functions (or in this case values). The lazy keyword here makes the matching parser only evaluated on demand. In this case that will be in the onlyMatching parser, which is only a val. The reason this is important is that it means that the reference to matching inside the parser isn't forced immediately, causing an infinite loop. That being said, every combinator in Parsley is defined using lazy function arguments (including a lazy this for methods), so sometimes it isn't actually necessary to get the right behaviour.

My advice is to use lazy val whenever you do write a parser that references itself, even indirectly. If your parser infinite loops before running it, you'll know you've missed one: but there are other, more obscure symptoms. Laziness is also important when you need to forward reference another parser. Here is an example:

lazy val q = ??? *> p
lazy val p = ???

// vs

val p = ???
val q = ??? *> p

Usually, the parsers can be re-ordered so that their definitions do not "cross over" each other in the words of scalac. But when writing recursive parsers with multiple parts you may need to use lazy val to get scalac to accept the ordering. This can also depend on whether or not they are defined locally in a function or as attributes of a class. Indeed, the above example is fine if p and q are both attributes of a class, even without lazy val!

Much more importantly, however, is noticing that when parsers are recursive, they absolutely must be references (i.e. val). A recursive parser using def will, without question, infinite loop. This is because Parsley will need the recursive point to be the same physical object as the original definition. When using a def, each time recursion is encountered it will create a new object and generate a truly infinite parser, instead of a cyclic one. We'll see an example of where we need to be careful about this later.

Before we move on with a more fleshed out example, I want to annotate the matching parser with debug to give you a sense of how recursive parsing works out (I marked both parentheses and the matching parser itself):

Trace of onlyMatchingDebug.parse("(()(()))")

scala> onlyMatchingDebug.parse("(()(()))")
>matching> (1, 1): (()(()
                   ^
  >left> (1, 1): (()(()
                 ^
  <left< (1, 2): (()(()) Good
                  ^
  >matching> (1, 2): (()(())
                      ^
    >left> (1, 2): (()(())
                    ^
    <left< (1, 3): (()(()))• Good
                     ^
    >matching> (1, 3): (()(()))•
                         ^
      >left> (1, 3): (()(()))•
                       ^
      <left< (1, 3): (()(()))• Fail
                       ^
    <matching< (1, 3): (()(()))• Good
                         ^
    >right> (1, 3): (()(()))•
                      ^
    <right< (1, 4): (()(()))• Good
                       ^
    >left> (1, 4): (()(()))•
                      ^
    <left< (1, 5): (()(()))• Good
                       ^
    >matching> (1, 5): (()(()))•
                           ^
      >left> (1, 5): (()(()))•
                         ^
      <left< (1, 6): (()(()))• Good
                          ^
      >matching> (1, 6): (()(()))•
                              ^
        >left> (1, 6): (()(()))•
                            ^
        <left< (1, 6): (()(()))• Fail
                            ^
      <matching< (1, 6): (()(()))• Good
                              ^
      >right> (1, 6): (()(()))•
                           ^
      <right< (1, 7): ()(()))• Good
                           ^
      >left> (1, 7): ()(()))•
                          ^
      <left< (1, 7): ()(()))• Fail
                          ^
    <matching< (1, 7): ()(()))• Good
                            ^
    >right> (1, 7): ()(()))•
                         ^
    <right< (1, 8): )(()))• Good
                         ^
    >left> (1, 8): )(()))•
                        ^
    <left< (1, 8): )(()))• Fail
                        ^
  <matching< (1, 8): )(()))• Good
                          ^
  >right> (1, 8): )(()))•
                       ^
  <right< (1, 9): (()))• Good
                       ^
  >left> (1, 9): (()))•
                      ^
  <left< (1, 9): (()))• Fail
                      ^
<matching< (1, 9): (()))• Good
                        ^

Take a moment to just absorb that and be comfortable with how recursive parsing works out.

Example: Parsing Boolean Expressions

The matching bracket parser was a simple example of a recursive parser. For our second to last example on this page (phew), we are going to parse (and evaluate!) boolean expressions with right associative operators. I'm going to start by giving the EBNF for this grammar to give a sense of what we are working with (and for you to be able to compare the approaches).

<expr> ::= <term> '||' <expr> | <term>
<term> ::= <not> '&&' <term> | <not>
<not> ::= '!' <not> | <atom>
<atom> ::= 'true' | 'false' | '(' <expr> ')'

We can see from this already it is a very recursive grammar, with almost every rule being recursive, as well as a recursive call to <expr> in <atom>. Now, it's perfectly possible to translate this grammar almost directly, but notice in <expr> and <term> that both alternatives in the grammar share a common leading prefix: as we identified earlier, this would require us to enable backtracking with attempt and will affect the time-complexity of the parse (here it would be exponential!). So, as a quick refactor, we will extract the common edge and represent the grammar as follows (where square brackets indicate optional component):

<expr> ::= <term> ['||' <expr>]
<term> ::= <not> ['&&' <term>]
<not> ::= '!' <not> | <atom>
<atom> ::= 'true' | 'false' | '(' <expr> ')'

Now, this grammar can be parsed in linear time, even when translated directly. This is much better! However, I'll make the inefficient parser first, as it has the simpler translation (even if it's less efficient) and will give a sense of how the solution works out.

import parsley.Parsley, Parsley.attempt
import parsley.implicits.character.stringLift
import parsley.implicits.lift.Lift2
import parsley.implicits.zipped.Zipped2

val or = (x: Boolean, y: Boolean) => x || y

//      <expr>                ::=                <term>   '||' <expr>   | <term>
lazy val expr: Parsley[Boolean] = attempt(or.lift(term <* "||", expr)) <|> term
//      <term>                ::=         <not> '&&'   <term>                  | <not>
lazy val term: Parsley[Boolean] = attempt((not, "&&" *> term).zipped(_ && _)) <|> not
//      <not>                ::= '!'   <not>         | <atom>
lazy val not: Parsley[Boolean] = "!" *> not.map(!_) <|> atom
// <atom> ::= 'true'          |  'false'           |   '('   <expr>   ')'
val atom    = "true" #> true <|> "false" #> false <|> ("(" *> expr <* ")")

expr.parse("!false") // returns Success(true)
expr.parse("true&&!(false||true)") // returns Success(false)

Here I've introduced a tiny bit of sugar: by importing implicits.lift.Lift2, I've enabled the lift method on two argument functions: essentially the same as lift2 itself: (or.lift _): (Parsley[Boolean], Parsley[Boolean]) => Parsley[Boolean]. The lift is used to actually perform our desired actions: when we read || we want to actually combine the results with ||! However, you'll notice I had to define it as a function with an explicit type signature: this is because Scala is reluctant to perform inference on the lambda when the argument types aren't known. To mitigate this, implicits.zipped.Zipped2 provides the same functionality, but where .zipped is called on a tuple, notice how this means that a raw unannotated lambda can be used: this is because Scala has already learnt information about what types the arguments should have! Try to use the tupled zipped notation sparingly, however: the backwards application makes it a little trickier to notice the ,s in the brackets. Try to only use it when you only have small parsers as arguments and lift when it works fine.

The parser itself has a close resemblance to the original grammar, just with the extra processing of the result. Notice, of course, that because expr, term and not are self-recursive, they all need explicit type signatures, and have been marked as lazy. This also allows me to use atom before it's defined in the lazy not. However, as I mentioned before, this is not ideal because of the heavy backtracking implied by the use of attempt. The solution, as I've said, is to implement the second grammar. This is, as we'll see, a little tricker:

import parsley.Parsley
import parsley.implicits.character.stringLift
import parsley.implicits.lift.Lift2
import parsley.combinator.option

val and = (y: Boolean) => (x: Boolean) => x && y
// This is possible here, because false is the "zero" of ||, but more generally we'd want the other definition
// val or = (x: Boolean, y: Option[Boolean]) => x || y.getOrElse(false)
val or = (x: Boolean, y: Option[Boolean]) => y.foldLeft(x)(_ || _)

// <expr>                     ::=        <term>       ['||'   <expr> ]
lazy val expr: Parsley[Boolean] = or.lift(term, option("||" *> expr  ))
// <term>                     ::= <not>     ('&&'   <term>          |      epsilon      )
lazy val term: Parsley[Boolean] =  not <**> ("&&" *> term.map(and) </> identity[Boolean])
//      <not>                ::= '!'   <not>         | <atom>
lazy val not: Parsley[Boolean] = "!" *> not.map(!_) <|> atom
// <atom> ::= 'true'          |  'false'           |   '('   <expr>   ')'
val atom    = "true" #> true <|> "false" #> false <|> ("(" *> expr <* ")")

expr.parse("!false") // returns Success(true)
expr.parse("true&&!(false||true)") // returns Success(false)

The new example is the more efficient, linear time, form of the parser. Here I've employed two different approaches of compiling the first term/not with the possible result after a possible operator. In the expr case, I've used a form very similar to our original parser, except by using the option combinator, I can try and parse what's inside and return None if it's not there. The or function will then have to handle both the case where it was only a term and the case where it was a term with another expr. Here I've used Option.foldLeft to do this, but there are many other ways of writing the function.

In the second case, with term, I've adopted an approach using the <**> combinator, which has the following type:

(_ <**> _): (Parsley[A], Parsley[A => B]) => Parsley[B]

It is essentially <*> but with the function and the argument reversed. So inside the brackets, I have to read a term, and if I'm successful I can apply the and function to it (notice the order of the arguments in the and function has been deliberately reversed and it has been curried). If I'm not successful I should return the identity function on Boolean, identity[Boolean]. The p </> x combinator is the same as saying p <|> pure(x). This means our initial not result will either be applied to the identity function or the partially applied and function.

The reason I've shown both styles is so that you can decide for yourself which you prefer: Option or curried. This really isn't the best we could have done though! The page on building expression parsers will show you how you can write this parser without having to fiddle with the and and or functions at all! (spoiler: it involves a combinator that builds the expression parsers for you).

Higher-Order Example: Defining many

As a final exercise, I want to show how we can implement the many combinator: recall that it will execute its argument zero or more times and collect all the results. It's a nice exercise in how the concepts we've already seen apply to an example where the parser we are constructing has a parameter of its own: in other words, a higher-order parser. It will also highlight a gotcha when writing your own combinators, just in case you become comfortable enough to do so.

The first step will be to think about what many(p) should do: if the parser p fails without consuming input, then we shouldn't fail but instead stop and return the results we've collected so far. If no ps were parsed, then the empty list should be returned. This gives us a hint about a use of </>, where we want to handle a failure by returning a known value. If a p is parsed, we need to try reading yet more ps, so this is an indication of recursion creeping in. So, with this in mind, let's see the definition:

import parsley.Parsley

// many p = p <:> many p <|> pure []
def many[A](p: =>Parsley[A]): Parsley[List[A]] = {
    lazy val go: Parsley[List[A]] = (p <::> go) </> Nil
    go
}

The definition isn't so complex, but comparing it with the Haskell equivalent in the comments does shed light on what extra things we need to be careful of here. The first is noticing that the argument p has been marked as =>Parsley[A]. This is because, like all of Parsley's combinators, we ideally want the argument to be lazy: this is what =>A means (except unlike Haskell, the argument isn't memoised). Then we can see the familar lazy val with explicit type signature that we expect from recursive parsers by now. What might seem a bit strange, however, is why I created the go value in the first place. You may be tempted to write something like this instead:

import parsley.Parsley

def many[A](p: =>Parsley[A]): Parsley[List[A]] = (p <::> many(p)) </> Nil

And the answer ties back to what I mentioned earlier: there is a difference in quite how recursive each of the two are. In the first example, go physically refers back to itself, and so that is a morally finite parser. The second example creates a new parser at each recursive step, so it is morally infinite. In Parsley, we must be careful to only work with finite parsers, as they are actually represented by Abstract Syntax Trees. So the solution here is to create a value that can reference the parameter p, without needing to pass it around itself. You might wonder if it's possible to make parsers that, say, have a value they pass around. The answer is yes, but it's quite uncommon to need to do that. For these circumstances, the functionality in parsley.registers is useful, but this is certainly out of scope for this page!

Takeaways

  • Recursive parsers are where the real work happens
  • Using lazy val with any parser that is recursive is the safest way of writing them
  • Recursive parsers need explicit types, as Scala can't infer them
  • Parameterised recursion must be avoided: if the argument doesn't change then hoist it out!
  • With expression grammars in particular, we should be mindful about the adverse effects of backtracking

What Next?

The next logical step once you've digested this page, is to go and have a play around yourself! When you feel ready, you should take a look at the Building Expression Parsers page to start seeing how recursive parsers can go wrong, and what the typical strategies are of addressing this.