site/parsing-expressions.html

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
<title>Parsing Expressions &middot; Crafting Interpreters</title>

<!-- Tell mobile browsers we're optimized for them and they don't need to crop
     the viewport. -->
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<link rel="stylesheet" type="text/css" href="style.css" />

<!-- Oh, God, Source Code Pro is so beautiful it makes me want to cry. -->
<link href='https://fonts.googleapis.com/css?family=Source+Code+Pro:400|Source+Sans+Pro:300,400,600' rel='stylesheet' type='text/css'>

<link rel="icon" type="image/png" href="image/favicon.png" />
<script src="jquery-3.4.1.min.js"></script>
<script src="script.js"></script>

<!-- Google analytics -->
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-42804721-2', 'auto');
  ga('send', 'pageview');
</script>

</head>
<body id="top">

<!-- <div class="scrim"></div> -->
<nav class="wide">
  <a href="/"><img src="image/logotype-small.png" title="Crafting Interpreters"></a>
  <div class="contents">
<!-- If there is a part, it must be a chapter within a part. -->
<h3><a href="#top">Parsing Expressions<small>6</small></a></h3>

<ul>
    <li><a href="#ambiguity-and-the-parsing-game"><small>6.1</small> Ambiguity and the Parsing Game</a></li>
    <li><a href="#recursive-descent-parsing"><small>6.2</small> Recursive Descent Parsing</a></li>
    <li><a href="#syntax-errors"><small>6.3</small> Syntax Errors</a></li>
    <li><a href="#wiring-up-the-parser"><small>6.4</small> Wiring up the Parser</a></li>
    <li class="divider"></li>
    <li class="end-part"><a href="#challenges">Challenges</a></li>
    <li class="end-part"><a href="#design-note"><small>note</small>Logic Versus History</a></li>
</ul>


<div class="prev-next">
    <a href="representing-code.html" title="Representing Code" class="left">&larr;&nbsp;Previous</a>
    <a href="a-tree-walk-interpreter.html" title="A Tree-Walk Interpreter">&uarr;&nbsp;Up</a>
    <a href="evaluating-expressions.html" title="Evaluating Expressions" class="right">Next&nbsp;&rarr;</a>
</div>  </div>
</nav>

<nav class="narrow">
<a href="/"><img src="image/logotype-small.png" title="Crafting Interpreters"></a>
<a href="representing-code.html" title="Representing Code" class="prev">←</a>
<a href="evaluating-expressions.html" title="Evaluating Expressions" class="next">→</a>
</nav>

<div class="page">
<div class="nav-wrapper">
<nav class="floating">
  <a href="/"><img src="image/logotype-small.png" title="Crafting Interpreters"></a>
  <div class="expandable">
<!-- If there is a part, it must be a chapter within a part. -->
<h3><a href="#top">Parsing Expressions<small>6</small></a></h3>

<ul>
    <li><a href="#ambiguity-and-the-parsing-game"><small>6.1</small> Ambiguity and the Parsing Game</a></li>
    <li><a href="#recursive-descent-parsing"><small>6.2</small> Recursive Descent Parsing</a></li>
    <li><a href="#syntax-errors"><small>6.3</small> Syntax Errors</a></li>
    <li><a href="#wiring-up-the-parser"><small>6.4</small> Wiring up the Parser</a></li>
    <li class="divider"></li>
    <li class="end-part"><a href="#challenges">Challenges</a></li>
    <li class="end-part"><a href="#design-note"><small>note</small>Logic Versus History</a></li>
</ul>


<div class="prev-next">
    <a href="representing-code.html" title="Representing Code" class="left">&larr;&nbsp;Previous</a>
    <a href="a-tree-walk-interpreter.html" title="A Tree-Walk Interpreter">&uarr;&nbsp;Up</a>
    <a href="evaluating-expressions.html" title="Evaluating Expressions" class="right">Next&nbsp;&rarr;</a>
</div>  </div>
  <a id="expand-nav">≡</a>
</nav>
</div>

<article class="chapter">

  <div class="number">6</div>
  <h1>Parsing Expressions</h1>

<div class="sign-up closable">
  <h1>This book is a work in progress!</h1>
    <span class="dismiss">&times;</span>
    <p>If you see a mistake, find something unclear, or have a suggestion, please <a href="https://github.com/munificent/craftinginterpreters/issues" target="_blank">let me know</a>. To follow its progress, please join the mailing list:</p>

  <!-- Begin MailChimp Signup Form -->
  <div id="mc_embed_signup">
  <form action="//gameprogrammingpatterns.us7.list-manage.com/subscribe/post?u=0952ca43ed2536d6717766b88&amp;id=6e96334109" method="post" id="mc-embedded-subscribe-form" name="mc-embedded-subscribe-form" class="validate" target="_blank" novalidate>
    <input type="email" value="" name="EMAIL" class="email" id="mce-EMAIL" placeholder="Your email address" required>
    <!-- real people should not fill this in and expect good things - do not remove this or risk form bot signups-->
    <div style="position: absolute; left: -5000px;" aria-hidden="true"><input type="text" name="b_0952ca43ed2536d6717766b88_6e96334109" tabindex="-1" value=""></div>
    <input type="submit" value="Sign me up!" name="subscribe" id="mc-embedded-subscribe" class="button">
  </form>
  </div>
  <!--End mc_embed_signup-->
  <p class="small">(I post about once a month. Don&#8217;t worry, I won&#8217;t spam you.)</p>
</div>

<blockquote>
<p>Grammar, which knows how to control even kings.
<cite>Molière</cite></p>
</blockquote>
<p><span name="parse">This</span> chapter marks the first major milestone of the
book. Many of us have cobbled together a mishmash of regular expressions and
substring operations to extract some sense out of a pile of text. The code was
probably riddled with bugs and a beast to maintain. Writing a <em>real</em> parser<span class="em">&mdash;</span>one with decent error-handling, a coherent internal structure, and the ability
to robustly chew through a sophisticated syntax<span class="em">&mdash;</span>is considered a rare,
impressive skill. In this chapter, you will <span name="attain">attain</span>
it.</p>
<aside name="parse">
<p>&ldquo;Parse&rdquo; comes to English from the Old French &ldquo;pars&rdquo; for &ldquo;part of speech&rdquo;. It
means to take a text and map each word to the grammar of the language. We use it
here in the same sense, except that our language is a little more modern than
Old French.</p>
</aside>
<aside name="attain">
<p>Like many rites of passage, you&rsquo;ll probably find it looks a little smaller, a
little less daunting when it&rsquo;s behind you than when it loomed ahead.</p>
</aside>
<p>It&rsquo;s easier than you think, partially because we front-loaded a lot of the hard
work in the <a href="representing-code.html">last chapter</a>. You already know your way around a formal grammar.
You&rsquo;re familiar with syntax trees, and we have some Java classes to represent
them. The only remaining piece is parsing<span class="em">&mdash;</span>transmogrifying a sequence of
tokens into one of those syntax trees.</p>
<p>Some CS textbooks make a big deal out of parsers. In the 60s, computer
scientists<span class="em">&mdash;</span>understandably tired of programming in assembly language<span class="em">&mdash;</span>started designing more sophisticated, <span name="human">human</span>-friendly
languages like FORTRAN and ALGOL. Alas, they weren&rsquo;t very <em>machine</em>-friendly,
for the primitive machines at the time.</p>
<aside name="human">
<p>Consider how harrowing assembly programming on those old machines must have been
for <em>FORTRAN</em> to be an improvement.</p>
</aside>
<p>They designed languages that they honestly weren&rsquo;t even sure how to write
compilers for, and then did ground-breaking work inventing parsing and compiling
techniques that could handle these new big languages on those old tiny machines.</p>
<p>Classic compiler books read like fawning hagiographies of these pioneers and
their tools. The cover of &ldquo;Compilers: Principles, Techniques, and Tools&rdquo;
literally has a dragon labeled &ldquo;complexity of compiler design&rdquo; being slain by a
knight bearing a sword and shield branded &ldquo;LALR parser generator&rdquo; and &ldquo;syntax
directed translation&rdquo;. They laid it on thick.</p>
<p>A little self-congratulation is well-deserved, but the truth is you don&rsquo;t need
to know most of that stuff to bang out a high quality parser for a modern
machine. As always, I encourage you to broaden your education and take it in
later, but this book omits the trophy case.</p>
<h2><a href="#ambiguity-and-the-parsing-game" name="ambiguity-and-the-parsing-game"><small>6&#8202;.&#8202;1</small>Ambiguity and the Parsing Game</a></h2>
<p>In the last chapter, I said you can &ldquo;play&rdquo; a context free grammar like a game in
order to <em>generate</em> strings. Now, we play that game in reverse. Given a
string<span class="em">&mdash;</span>a series of tokens<span class="em">&mdash;</span>we map those tokens to terminals in the grammar
to figure out which rules could have generated that string.</p>
<p>The &ldquo;could have&rdquo; part is interesting. It&rsquo;s entirely possible to create a grammar
that is <em>ambiguous</em>, where different choices of productions can lead to the same
string. When you&rsquo;re using the grammar to <em>generate</em> strings, that doesn&rsquo;t matter
much. Once you have the string, who cares how you got to it?</p>
<p>When parsing, ambiguity means the parser may misunderstand the user&rsquo;s code. As
we parse, we aren&rsquo;t just determining if the string is valid Lox code, we&rsquo;re
also tracking which rules match which parts of it so that we know what part of
the language each token belongs to. Here&rsquo;s the Lox expression grammar we put
together in the last chapter:</p>
<div class="codehilite"><pre>
<span class="i">expression</span> → <span class="i">literal</span>
           | <span class="i">unary</span>
           | <span class="i">binary</span>
           | <span class="i">grouping</span> ;

<span class="i">literal</span>    → <span class="t">NUMBER</span> | <span class="t">STRING</span> | <span class="s">&quot;false&quot;</span> | <span class="s">&quot;true&quot;</span> | <span class="s">&quot;nil&quot;</span> ;
<span class="i">grouping</span>   → <span class="s">&quot;(&quot;</span> <span class="i">expression</span> <span class="s">&quot;)&quot;</span> ;
<span class="i">unary</span>      → ( <span class="s">&quot;-&quot;</span> | <span class="s">&quot;!&quot;</span> ) <span class="i">expression</span> ;
<span class="i">binary</span>     → <span class="i">expression</span> <span class="i">operator</span> <span class="i">expression</span> ;
<span class="i">operator</span>   → <span class="s">&quot;==&quot;</span> | <span class="s">&quot;!=&quot;</span> | <span class="s">&quot;&lt;&quot;</span> | <span class="s">&quot;&lt;=&quot;</span> | <span class="s">&quot;&gt;&quot;</span> | <span class="s">&quot;&gt;=&quot;</span>
           | <span class="s">&quot;+&quot;</span>  | <span class="s">&quot;-&quot;</span>  | <span class="s">&quot;*&quot;</span> | <span class="s">&quot;/&quot;</span> ;
</pre></div>
<p>This is a valid string in that grammar:</p><img src="image/parsing-expressions/tokens.png" alt="6 / 3 - 1" />
<p>But there are two ways we could have generated it. One way is:</p>
<ol>
<li>Starting at <code>expression</code>, pick <code>binary</code>.</li>
<li>For the left-hand <code>expression</code>, pick <code>NUMBER</code>, and use <code>6</code>.</li>
<li>For the operator, pick <code>"/"</code>.</li>
<li>For the right-hand <code>expression</code>, pick <code>binary</code> again.</li>
<li>In that nested <code>binary</code> expression, pick <code>3 - 1</code>.</li>
</ol>
<p>Another is:</p>
<ol>
<li>Starting at <code>expression</code>, pick <code>binary</code>.</li>
<li>For the left-hand <code>expression</code>, pick <code>binary</code> again.</li>
<li>In that nested <code>binary</code> expression, pick <code>6 / 3</code>.</li>
<li>Back at the outer <code>binary</code>, for the operator, pick <code>"-"</code>.</li>
<li>For the right-hand <code>expression</code>, pick <code>NUMBER</code>, and use <code>1</code>.</li>
</ol>
<p>Those produce the same <em>strings</em>, but not the same <em>syntax trees</em>:</p><img src="image/parsing-expressions/syntax-trees.png" alt="Two valid syntax trees: (6 / 3) - 1 and 6 / (3 - 1)" />
<p>In other words, the grammar allows seeing the expression as <code>(6 / 3) - 1</code> or <code>6 / (3 - 1)</code>. That in turn affects the result of evaluating it. The way
mathematicians have solved this ambiguity since blackboards were first invented
is by defining rules for precedence and associativity.</p>
<ul>
<li>
<p><span name="nonassociative"><strong>Precedence</strong></span> determines which operator
is evaluated first in an expression containing a mixture of different
operators. Precedence rules tell us that we evaluate the <code>/</code> before the <code>-</code>
in the above example. Operators with higher precedence are evaluated
before operators with lower precedence. Equivalently, higher precedence
operators are said to &ldquo;bind tighter&rdquo;.</p>
</li>
<li>
<p><strong>Associativity</strong> determines which operator is evaluated first in a series
of the <em>same</em> operator. When an operator is <strong>left-associative</strong> (think
&ldquo;left-to-right&rdquo;), operators on the left evaluate before those on the right.
Since <code>-</code> is left-associative, this expression:</p>
<div class="codehilite"><pre>
<span class="n">5</span> - <span class="n">3</span> - <span class="n">1</span>
</pre></div>
<p>is equivalent to:</p>
<div class="codehilite"><pre>
(<span class="n">5</span> - <span class="n">3</span>) - <span class="n">1</span>
</pre></div>
<p>Assignment, on the other hand, is <strong>right-associative</strong>. This:</p>
<div class="codehilite"><pre>
<span class="i">a</span> = <span class="i">b</span> = <span class="i">c</span>
</pre></div>
<p>is equivalent to:</p>
<div class="codehilite"><pre>
<span class="i">a</span> = (<span class="i">b</span> = <span class="i">c</span>)
</pre></div>
</li>
</ul>
<aside name="nonassociative">
<p>While not common these days, some languages specify that certain pairs of
operators have <em>no</em> relative precedence. That makes it a syntax error to mix
those operators in an expression without using explicit grouping.</p>
<p>Likewise, some operators are <strong>non-associative</strong>. That means it&rsquo;s an error to
use that operator more than once in a sequence. For example, Perl&rsquo;s range
operator isn&rsquo;t associative, so <code>a .. b</code> is OK, but <code>a .. b .. c</code> is an error.</p>
</aside>
<p>Without well-defined precedence and associativity, an expression that uses
multiple operators is ambiguous<span class="em">&mdash;</span>it can be parsed into different syntax trees,
which could in turn evaluate to different results. We&rsquo;ll fix that in Lox by
applying the same precedence rules as C, going from lowest to highest:</p><table>
<thead>
<tr>
  <td>Name</td>
  <td>Operators</td>
  <td>Associates</td>
</tr>
</thead>
<tbody>
<tr>
  <td>Equality</td>
  <td><code>==</code> <code>!=</code></td>
  <td>Left</td>
</tr>
<tr>
  <td>Comparison</td>
  <td><code>&gt;</code> <code>&gt;=</code>
      <code>&lt;</code> <code>&lt;=</code></td>
  <td>Left</td>
</tr>
<tr>
  <td>Addition</td>
  <td><code>-</code> <code>+</code></td>
  <td>Left</td>
</tr>
<tr>
  <td>Multiplication</td>
  <td><code>/</code> <code>*</code></td>
  <td>Left</td>
</tr>
<tr>
  <td>Unary</td>
  <td><code>!</code> <code>-</code></td>
  <td>Right</td>
</tr>
</tbody>
</table>
<p>Right now, the grammar stuffs all expression types into a single <code>expression</code>
rule. That same rule is used as the non-terminal for subexpressions, which lets
the grammar accept any kind of expression as an operand, regardless of whether
the precedence rules allow it.</p>
<p>We fix that by <span name="massage">stratifying</span> the grammar. We define a
separate rule for each precedence level:</p>
<div class="codehilite"><pre>
<span class="i">expression</span>     → ...
<span class="i">equality</span>       → ...
<span class="i">comparison</span>     → ...
<span class="i">addition</span>       → ...
<span class="i">multiplication</span> → ...
<span class="i">unary</span>          → ...
<span class="i">primary</span>        → ...
</pre></div>
<aside name="massage">
<p>Instead of baking precedence right into the grammar rules, some parser
generators let you keep the same ambiguous-but-simple grammar and then add in a
little explicit operator precedence metadata on the side in order to
disambiguate.</p>
</aside>
<p>Each rule here only matches expressions at its precedence level or higher. For
example, <code>unary</code> matches a unary expression like <code>!negated</code> or a primary
expression like <code>1234</code>. And <code>addition</code> can match <code>1 + 2</code> but also <code>3 * 4 / 5</code>.
The final <code>primary</code> rule covers the highest-precedence expressions<span class="em">&mdash;</span>literals
and parenthesized grouping expressions.</p>
<p>We just need to fill in the productions for each of those rules. We&rsquo;ll do the
easy ones first. The top <code>expression</code> rule matches any expression at any
precedence level. Since <span name="equality"><code>equality</code></span> has the lowest
precedence, if we match that, then it covers everything:</p>
<aside name="equality">
<p>We could eliminate <code>expression</code> and simply use <code>equality</code> in the other rules
that contain expressions, but using <code>expression</code> makes those other rules read a
little better.</p>
<p>Also, in later chapters, when we expand the grammar to include assignment and
logical operators, we&rsquo;ll only need to change <code>expression</code> instead of touching
every rule that contains an expression.</p>
</aside>
<div class="codehilite"><pre>
<span class="i">expression</span>     → <span class="i">equality</span>
</pre></div>
<p>Over at the other end of the precedence table, a primary expression contains
all the literals and grouping expressions:</p>
<div class="codehilite"><pre>
<span class="i">primary</span>        → <span class="t">NUMBER</span> | <span class="t">STRING</span> | <span class="s">&quot;false&quot;</span> | <span class="s">&quot;true&quot;</span> | <span class="s">&quot;nil&quot;</span>
               | <span class="s">&quot;(&quot;</span> <span class="i">expression</span> <span class="s">&quot;)&quot;</span> ;
</pre></div>
<p>A unary expression starts with a unary operator followed by the operand. Since
unary operators can nest<span class="em">&mdash;</span><code>!!true</code> is a valid if weird expression<span class="em">&mdash;</span>the
operand can itself be a unary operator. A recursive rule handles that nicely:</p>
<div class="codehilite"><pre>
<span class="i">unary</span>          → ( <span class="s">&quot;!&quot;</span> | <span class="s">&quot;-&quot;</span> ) <span class="i">unary</span> ;
</pre></div>
<p>But this rule has a problem. It never terminates. Remember, each rule needs to
match expressions at that precedence level <em>or higher</em>, so we also need to let
this match a primary expression:</p>
<div class="codehilite"><pre>
<span class="i">unary</span>          → ( <span class="s">&quot;!&quot;</span> | <span class="s">&quot;-&quot;</span> ) <span class="i">unary</span>
               | <span class="i">primary</span> ;
</pre></div>
<p>That works.</p>
<p>The remaining rules are all binary operators. We&rsquo;ll start with the rule for
multiplication and division. Here&rsquo;s a first try:</p>
<div class="codehilite"><pre>
<span class="i">multiplication</span> → <span class="i">multiplication</span> ( <span class="s">&quot;/&quot;</span> | <span class="s">&quot;*&quot;</span> ) <span class="i">unary</span>
               | <span class="i">unary</span> ;
</pre></div>
<p>The rule recurses to match the left operand. That enables the rule to match a
series of multiplication and division expressions like <code>1 * 2 / 3</code>. Putting the
recursive production on the left side and <code>unary</code> on the right makes the rule
<span name="mult">left-associative</span> and unambiguous.</p>
<aside name="mult">
<p>In principle, it doesn&rsquo;t matter whether you treat multiplication as left- or
right-associative<span class="em">&mdash;</span>you get the same result either way. Alas, in the real world
with limited precision, roundoff and overflow mean that associativity can affect
the result of a sequence of multiplications. Consider:</p>
<div class="codehilite"><pre>
<span class="k">print</span> <span class="n">0.1</span> * (<span class="n">0.2</span> * <span class="n">0.3</span>);
<span class="k">print</span> (<span class="n">0.1</span> * <span class="n">0.2</span>) * <span class="n">0.3</span>;
</pre></div>
<p>In languages like Lox that use <a href="https://en.wikipedia.org/wiki/Double-precision_floating-point_format">IEEE 754</a> double-precision floating-point
numbers, the first evaluates to <code>0.006</code>, while the second yields
<code>0.006000000000000001</code>. Sometimes that tiny difference matters.
<a href="https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html">This</a> is a good place to learn more.</p>
</aside>
<p>All of this is correct, but the fact that the first nonterminal in the body of
the rule is the same as the head of the rule means this production is
<strong>left-recursive</strong>. Some parsing techniques, including the one we&rsquo;re going to
use, have trouble with left recursion. (Recursion elsewhere, like we have in
<code>unary</code> and the indirect recursion for grouping in <code>primary</code> are not a problem.)</p>
<p>There are many grammars you can define that match the same language. The choice
for how to model a particular language is partially a matter of taste and
partially a pragmatic one. This rule is correct, but not optimal for how we
intend to parse it. Instead of a left recursive rule, we&rsquo;ll use a different one:</p>
<div class="codehilite"><pre>
<span class="i">multiplication</span> → <span class="i">unary</span> ( ( <span class="s">&quot;/&quot;</span> | <span class="s">&quot;*&quot;</span> ) <span class="i">unary</span> )* ;
</pre></div>
<p>We define a multiplication expression as a flat <em>sequence</em> of multiplications
and divisions. This matches the same syntax as the previous rule, but better
mirrors the code we&rsquo;ll write to parse code. We use the same structure for all of
other binary operator precedence levels giving us this complete expression
grammar:</p>
<div class="codehilite"><pre>
<span class="i">expression</span>     → <span class="i">equality</span> ;
<span class="i">equality</span>       → <span class="i">comparison</span> ( ( <span class="s">&quot;!=&quot;</span> | <span class="s">&quot;==&quot;</span> ) <span class="i">comparison</span> )* ;
<span class="i">comparison</span>     → <span class="i">addition</span> ( ( <span class="s">&quot;&gt;&quot;</span> | <span class="s">&quot;&gt;=&quot;</span> | <span class="s">&quot;&lt;&quot;</span> | <span class="s">&quot;&lt;=&quot;</span> ) <span class="i">addition</span> )* ;
<span class="i">addition</span>       → <span class="i">multiplication</span> ( ( <span class="s">&quot;-&quot;</span> | <span class="s">&quot;+&quot;</span> ) <span class="i">multiplication</span> )* ;
<span class="i">multiplication</span> → <span class="i">unary</span> ( ( <span class="s">&quot;/&quot;</span> | <span class="s">&quot;*&quot;</span> ) <span class="i">unary</span> )* ;
<span class="i">unary</span>          → ( <span class="s">&quot;!&quot;</span> | <span class="s">&quot;-&quot;</span> ) <span class="i">unary</span>
               | <span class="i">primary</span> ;
<span class="i">primary</span>        → <span class="t">NUMBER</span> | <span class="t">STRING</span> | <span class="s">&quot;false&quot;</span> | <span class="s">&quot;true&quot;</span> | <span class="s">&quot;nil&quot;</span>
               | <span class="s">&quot;(&quot;</span> <span class="i">expression</span> <span class="s">&quot;)&quot;</span> ;
</pre></div>
<p>This grammar is more complex than the one we had before, but in return we have
eliminated the previous one&rsquo;s ambiguity. It&rsquo;s just what we need to make a
parser.</p>
<h2><a href="#recursive-descent-parsing" name="recursive-descent-parsing"><small>6&#8202;.&#8202;2</small>Recursive Descent Parsing</a></h2>
<p>There is a whole pack of parsing techniques whose names mostly seem to be
combinations of &ldquo;L&rdquo; and &ldquo;R&rdquo;<span class="em">&mdash;</span><a href="https://en.wikipedia.org/wiki/LL_parser">LL(k)</a>, <a href="https://en.wikipedia.org/wiki/LR_parser">LR(1)</a>, <a href="https://en.wikipedia.org/wiki/LALR_parser">LALR</a><span class="em">&mdash;</span>along with
more exotic beasts like <a href="https://en.wikipedia.org/wiki/Parser_combinator">parser combinators</a>, <a href="https://en.wikipedia.org/wiki/Earley_parser">Earley parsers</a>,
<a href="https://en.wikipedia.org/wiki/Shunting-yard_algorithm">the shunting yard algorithm</a>, and <a href="https://en.wikipedia.org/wiki/Parsing_expression_grammar">packrat parsing</a>. For our first
interpreter, one technique is more than sufficient: <strong>recursive descent</strong>.</p>
<p>Recursive descent is the simplest way to build a parser, and doesn&rsquo;t require
using complex parser generator tools like Yacc, Bison or ANTLR. All you need is
straightforward hand-written code. Don&rsquo;t be fooled by its simplicity, though.
Recursive descent parsers are fast, robust, and can support sophisticated
error-handling. In fact, GCC, V8 (the JavaScript VM in Chrome), Roslyn (the C#
compiler written in C#) and many other heavyweight production language
implementations use recursive descent. It kicks ass.</p>
<p>It is considered a <strong>top-down parser</strong> because it starts from the top or
outermost grammar rule (here <code>expression</code>) and works its way <span
name="descent">down</span> into the nested subexpressions before finally
reaching the leaves of the syntax tree. This is in contrast with bottom-up
parsers like LR that start with primary expressions and compose them into larger
and larger chunks of syntax.</p>
<aside name="descent">
<p>It&rsquo;s called &ldquo;recursive <em>descent</em>&rdquo; because it walks <em>down</em> the grammar.
Confusingly, we also use direction metaphorically when talking about &ldquo;high&rdquo; and
&ldquo;low&rdquo; precedence, but the orientation is reversed. In a top-down parser, you
reach the lowest-precedence expressions first because they may in turn contain
subexpressions of higher precedence.</p><img src="image/parsing-expressions/direction.png" alt="Top-down grammar rules in order of increasing precedence.">
<p>CS people really need to get together and straighten out their metaphors. Don&rsquo;t
even get me started on which direction the stack is supposed to grow.</p>
</aside>
<p>A recursive descent parser is a literal translation of the grammar&rsquo;s rules
straight into imperative code. Each rule becomes a function. The body of the
rule translates to code roughly like:</p><table>
<thead>
<tr>
  <td>Grammar notation</td>
  <td>Code representation</td>
</tr>
</thead>
<tbody>
  <tr><td>Terminal</td><td>Code to match and consume a token</td></tr>
  <tr><td>Nonterminal</td><td>Call to that rule&rsquo;s function</td></tr>
  <tr><td><code>|</code></td><td><code>if</code> or <code>switch</code> statement</td></tr>
  <tr><td><code>*</code> or <code>+</code></td><td><code>while</code> or <code>for</code> loop</td></tr>
  <tr><td><code>?</code></td><td><code>if</code> statement</td></tr>
</tbody>
</table>
<p>It&rsquo;s called &ldquo;<em>recursive</em> descent&rdquo; because when a grammar rule refers to itself<span class="em">&mdash;</span>directly or indirectly<span class="em">&mdash;</span>that translates to recursive method calls.</p>
<h3><a href="#the-parser-class" name="the-parser-class"><small>6&#8202;.&#8202;2&#8202;.&#8202;1</small>The parser class</a></h3>
<p>Each grammar rule becomes a method inside this new class:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
create new file</div>
<pre>
<span class="k">package</span> <span class="i">com.craftinginterpreters.lox</span>;

<span class="k">import</span> <span class="i">java.util.List</span>;

<span class="k">import static</span> <span class="i">com.craftinginterpreters.lox.TokenType.*</span>;

<span class="k">class</span> <span class="t">Parser</span> {
  <span class="k">private</span> <span class="k">final</span> <span class="t">List</span>&lt;<span class="t">Token</span>&gt; <span class="i">tokens</span>;
  <span class="k">private</span> <span class="t">int</span> <span class="i">current</span> = <span class="n">0</span>;

  <span class="t">Parser</span>(<span class="t">List</span>&lt;<span class="t">Token</span>&gt; <span class="i">tokens</span>) {
    <span class="k">this</span>.<span class="i">tokens</span> = <span class="i">tokens</span>;
  }
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, create new file</div>

<p>Like the scanner, it consumes a sequence, only now we&rsquo;re working at the level of
entire tokens. It takes in a list of tokens and uses <code>current</code> to point to the
next token eagerly waiting to be used.</p>
<p>We&rsquo;re going to run straight through the expression grammar now and translate
each rule to Java code. The first rule, <code>expression</code>, simply expands to the
<code>equality</code> rule, so that&rsquo;s straightforward:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>Parser</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">Expr</span> <span class="i">expression</span>() {
    <span class="k">return</span> <span class="i">equality</span>();
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>Parser</em>()</div>

<p>Each method for parsing a grammar rule produces a syntax tree for that rule and
returns it to the caller. When the body of the rule contains a nonterminal<span class="em">&mdash;</span>a
reference to another rule<span class="em">&mdash;</span>we <span name="left">call</span> that rule&rsquo;s
method.</p>
<aside name="left">
<p>This is why left recursion is problematic for recursive descent. The function
for a left-recursive rule immediately calls itself, which calls itself again,
and so on, until the parser hits a stack overflow and dies.</p>
</aside>
<p>The rule for equality is a little more complex:</p>
<div class="codehilite"><pre>
<span class="i">equality</span> → <span class="i">comparison</span> ( ( <span class="s">&quot;!=&quot;</span> | <span class="s">&quot;==&quot;</span> ) <span class="i">comparison</span> )* ;
</pre></div>
<p>In Java, that becomes:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>expression</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">Expr</span> <span class="i">equality</span>() {
    <span class="t">Expr</span> <span class="i">expr</span> = <span class="i">comparison</span>();

    <span class="k">while</span> (<span class="i">match</span>(<span class="i">BANG_EQUAL</span>, <span class="i">EQUAL_EQUAL</span>)) {
      <span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
      <span class="t">Expr</span> <span class="i">right</span> = <span class="i">comparison</span>();
      <span class="i">expr</span> = <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Binary</span>(<span class="i">expr</span>, <span class="i">operator</span>, <span class="i">right</span>);
    }

    <span class="k">return</span> <span class="i">expr</span>;
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>expression</em>()</div>

<p>Let&rsquo;s step through it. The left <code>comparison</code> nonterminal in the body is
translated to the first call to <code>comparison()</code> and we store that in a local
variable.</p>
<p>Then, the <code>( ... )*</code> loop in the rule is mapped to a <code>while</code> loop. We need to
know when to exit that loop. We can see that inside the rule, we must first find
either a <code>!=</code> or <code>==</code> token. So, if we <em>don&rsquo;t</em> see one of those, we must be done
with the sequence of equality operators. We express that check using a handy
<code>match()</code> method:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>equality</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">boolean</span> <span class="i">match</span>(<span class="t">TokenType</span>... <span class="i">types</span>) {
    <span class="k">for</span> (<span class="t">TokenType</span> <span class="i">type</span> : <span class="i">types</span>) {
      <span class="k">if</span> (<span class="i">check</span>(<span class="i">type</span>)) {
        <span class="i">advance</span>();
        <span class="k">return</span> <span class="k">true</span>;
      }
    }

    <span class="k">return</span> <span class="k">false</span>;
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>equality</em>()</div>

<p>This checks to see if the current token is any of the given types. If so, it
consumes the token and returns <code>true</code>. Otherwise, it returns <code>false</code> and leaves
the token as the current one.</p>
<p>The <code>match()</code> method is defined in terms of two more fundamental operations:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>match</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">boolean</span> <span class="i">check</span>(<span class="t">TokenType</span> <span class="i">type</span>) {
    <span class="k">if</span> (<span class="i">isAtEnd</span>()) <span class="k">return</span> <span class="k">false</span>;
    <span class="k">return</span> <span class="i">peek</span>().<span class="i">type</span> == <span class="i">type</span>;
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>match</em>()</div>

<p>This returns <code>true</code> if the current token is of the given type. Unlike <code>match()</code>,
it doesn&rsquo;t consume the token, it only looks at it.</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>check</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">Token</span> <span class="i">advance</span>() {
    <span class="k">if</span> (!<span class="i">isAtEnd</span>()) <span class="i">current</span>++;
    <span class="k">return</span> <span class="i">previous</span>();
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>check</em>()</div>

<p>This consumes the current token and returns it, similar to how our scanner&rsquo;s
<code>advance()</code> method did with characters.</p>
<p>These methods bottom out on the last handful of primitive operations:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>advance</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">boolean</span> <span class="i">isAtEnd</span>() {
    <span class="k">return</span> <span class="i">peek</span>().<span class="i">type</span> == <span class="i">EOF</span>;
  }

  <span class="k">private</span> <span class="t">Token</span> <span class="i">peek</span>() {
    <span class="k">return</span> <span class="i">tokens</span>.<span class="i">get</span>(<span class="i">current</span>);
  }

  <span class="k">private</span> <span class="t">Token</span> <span class="i">previous</span>() {
    <span class="k">return</span> <span class="i">tokens</span>.<span class="i">get</span>(<span class="i">current</span> - <span class="n">1</span>);
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>advance</em>()</div>

<p><code>isAtEnd()</code> checks if we&rsquo;ve run out of tokens to parse. <code>peek()</code> returns the
current token we have yet to consume and <code>previous()</code> returns the most recently
consumed token. The latter makes it easier to use <code>match()</code> and then access the
just-matched token.</p>
<p>That&rsquo;s most of the parsing infrastructure we need. Where were we? Right, so if
we are inside the <code>while</code> loop in <code>equality()</code>, then the parser knows it found a
<code>!=</code> or <code>==</code> operator and must be parsing an equality expression.</p>
<p>It grabs the token that was matched for the operator so we can track which kind
of binary expression this is. Then it calls <code>comparison()</code> again to parse the
right-hand operand. It combines the operator and the two operands into a new
<code>Expr.Binary</code> syntax tree node, and then loops around. Each time, it stores the
expression back in the same <code>expr</code> local variable. As it zips through a sequence
of equality expressions, that creates a left-associative nested tree of binary
operator nodes.</p>
<p><span name="sequence"></span></p><img src="image/parsing-expressions/sequence.png" alt="The syntax tree created by parsing 'a == b == c == d == e'" />
<aside name="sequence">
<p>Parsing <code>a == b == c == d == e</code>. Each iteration, we create a new binary
expression using the previous one as the left operand.</p>
</aside>
<p>The parser falls out of the loop once it hits a token that&rsquo;s not an equality
operator. Finally, it returns the expression. Note that if it doesn&rsquo;t encounter
a single equality operator, then it never enters the loop. In that case, the
<code>equality()</code> method effectively calls and returns <code>comparison()</code>. In that way,
this method matches an equality operator <em>or anything of higher precedence</em>.</p>
<p>Moving on to the next rule&hellip;</p>
<div class="codehilite"><pre>
<span class="i">comparison</span> → <span class="i">addition</span> ( ( <span class="s">&quot;&gt;&quot;</span> | <span class="s">&quot;&gt;=&quot;</span> | <span class="s">&quot;&lt;&quot;</span> | <span class="s">&quot;&lt;=&quot;</span> ) <span class="i">addition</span> )* ;
</pre></div>
<p>Translated to Java:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>equality</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">Expr</span> <span class="i">comparison</span>() {
    <span class="t">Expr</span> <span class="i">expr</span> = <span class="i">addition</span>();

    <span class="k">while</span> (<span class="i">match</span>(<span class="i">GREATER</span>, <span class="i">GREATER_EQUAL</span>, <span class="i">LESS</span>, <span class="i">LESS_EQUAL</span>)) {
      <span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
      <span class="t">Expr</span> <span class="i">right</span> = <span class="i">addition</span>();
      <span class="i">expr</span> = <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Binary</span>(<span class="i">expr</span>, <span class="i">operator</span>, <span class="i">right</span>);
    }

    <span class="k">return</span> <span class="i">expr</span>;
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>equality</em>()</div>

<p>The grammar rule is virtually identical to <code>equality</code> and so is the
corresponding code. The only <span name="handle">differences</span> are the
token types for the operators we match, and the method we call for the operands,
now <code>addition()</code> instead of <code>comparison()</code>. The remaining two binary operator
rules follow the same pattern:</p>
<aside name="handle">
<p>If you wanted to do some clever Java 8, you could create a helper method for
parsing a left-associative series of binary operators given a list of token
types and an operand method handle and unify some of this redundant code.</p>
</aside>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>comparison</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">Expr</span> <span class="i">addition</span>() {
    <span class="t">Expr</span> <span class="i">expr</span> = <span class="i">multiplication</span>();

    <span class="k">while</span> (<span class="i">match</span>(<span class="i">MINUS</span>, <span class="i">PLUS</span>)) {
      <span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
      <span class="t">Expr</span> <span class="i">right</span> = <span class="i">multiplication</span>();
      <span class="i">expr</span> = <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Binary</span>(<span class="i">expr</span>, <span class="i">operator</span>, <span class="i">right</span>);
    }

    <span class="k">return</span> <span class="i">expr</span>;
  }

  <span class="k">private</span> <span class="t">Expr</span> <span class="i">multiplication</span>() {
    <span class="t">Expr</span> <span class="i">expr</span> = <span class="i">unary</span>();

    <span class="k">while</span> (<span class="i">match</span>(<span class="i">SLASH</span>, <span class="i">STAR</span>)) {
      <span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
      <span class="t">Expr</span> <span class="i">right</span> = <span class="i">unary</span>();
      <span class="i">expr</span> = <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Binary</span>(<span class="i">expr</span>, <span class="i">operator</span>, <span class="i">right</span>);
    }

    <span class="k">return</span> <span class="i">expr</span>;
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>comparison</em>()</div>

<p>That&rsquo;s all of the binary operators, parsed with the correct precedence and
associativity. We&rsquo;re crawling up the precedence hierarchy and now we&rsquo;ve reached
the unary operators:</p>
<div class="codehilite"><pre>
<span class="i">unary</span> → ( <span class="s">&quot;!&quot;</span> | <span class="s">&quot;-&quot;</span> ) <span class="i">unary</span>
      | <span class="i">primary</span> ;
</pre></div>
<p>The code for this is a little different:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>multiplication</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">Expr</span> <span class="i">unary</span>() {
    <span class="k">if</span> (<span class="i">match</span>(<span class="i">BANG</span>, <span class="i">MINUS</span>)) {
      <span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
      <span class="t">Expr</span> <span class="i">right</span> = <span class="i">unary</span>();
      <span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Unary</span>(<span class="i">operator</span>, <span class="i">right</span>);
    }

    <span class="k">return</span> <span class="i">primary</span>();
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>multiplication</em>()</div>

<p>Again, we look at the <span name="current">current<span> token to see how to
parse. If it&rsquo;s a <code>!</code> or <code>-</code>, we must have a unary expression. In that case, we
grab the token, and then recursively call <code>unary()</code> again to parse the operand.
Wrap that all up in a unary expression syntax tree and we&rsquo;re done.</p>
<aside name="current">
<p>The fact that the parser looks ahead at upcoming tokens to decide how to parse
puts recursive descent into the category of <strong>predictive parsers</strong>.</p>
</aside>
<p>Otherwise, we must have reached the highest level of precedence, primary
expressions.</p>
<div class="codehilite"><pre>
<span class="i">primary</span> → <span class="t">NUMBER</span> | <span class="t">STRING</span> | <span class="s">&quot;false&quot;</span> | <span class="s">&quot;true&quot;</span> | <span class="s">&quot;nil&quot;</span>
        | <span class="s">&quot;(&quot;</span> <span class="i">expression</span> <span class="s">&quot;)&quot;</span> ;
</pre></div>
<p>Most of the cases for the rule are single terminals, so it&rsquo;s pretty
straightforward:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>unary</em>()</div>
<pre>

  <span class="k">private</span> <span class="t">Expr</span> <span class="i">primary</span>() {
    <span class="k">if</span> (<span class="i">match</span>(<span class="i">FALSE</span>)) <span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Literal</span>(<span class="k">false</span>);
    <span class="k">if</span> (<span class="i">match</span>(<span class="i">TRUE</span>)) <span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Literal</span>(<span class="k">true</span>);
    <span class="k">if</span> (<span class="i">match</span>(<span class="i">NIL</span>)) <span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Literal</span>(<span class="k">null</span>);

    <span class="k">if</span> (<span class="i">match</span>(<span class="i">NUMBER</span>, <span class="i">STRING</span>)) {
      <span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Literal</span>(<span class="i">previous</span>().<span class="i">literal</span>);
    }

    <span class="k">if</span> (<span class="i">match</span>(<span class="i">LEFT_PAREN</span>)) {
      <span class="t">Expr</span> <span class="i">expr</span> = <span class="i">expression</span>();
      <span class="i">consume</span>(<span class="i">RIGHT_PAREN</span>, <span class="s">&quot;Expect &#39;)&#39; after expression.&quot;</span>);
      <span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Grouping</span>(<span class="i">expr</span>);
    }
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>unary</em>()</div>

<p>The interesting branch is the one for handling parentheses. After we match an
opening <code>(</code> and parse the expression inside it, we <em>must</em> find a <code>)</code> token. If
we don&rsquo;t, that&rsquo;s an error.</p>
<h2><a href="#syntax-errors" name="syntax-errors"><small>6&#8202;.&#8202;3</small>Syntax Errors</a></h2>
<p>A parser really has two jobs:</p>
<ol>
<li>
<p>Given a valid sequence of tokens, produce a corresponding syntax tree.</p>
</li>
<li>
<p>Given an <em>invalid</em> sequence of tokens, detect any errors and tell the
user about their mistakes.</p>
</li>
</ol>
<p>Don&rsquo;t underestimate how important the second job is! In modern IDEs and editors,
the parser is constantly reparsing code<span class="em">&mdash;</span>often while the user is still editing
it<span class="em">&mdash;</span>in order to syntax highlight and support things like auto-complete. That
means it will encounter code in incomplete, half-wrong states <em>all the time.</em></p>
<p>When the user doesn&rsquo;t realize the syntax is wrong, it is up to the parser to
help guide them back onto the right path. The way it reports errors is a large
part of your language&rsquo;s user interface. Good syntax error handling is hard. By
definition, the code isn&rsquo;t in a well-defined state, so there&rsquo;s no infallible way
to know what the user <em>meant</em> to write. The parser can&rsquo;t read your <span
name="telepathy">mind</span>.</p>
<aside name="telepathy">
<p>Not yet at least. With the way things are going in machine learning these days,
who knows what the future will bring?</p>
</aside>
<p>There are a couple of hard requirements for when the parser runs into a syntax
error:</p>
<ul>
<li>
<p><strong>It must detect and report the error.</strong> If it doesn&rsquo;t detect the <span
name="error">error</span> and passes the resulting malformed syntax tree on
to the interpreter, all manner of horrors may be summoned.</p>
<aside name="error">
<p>Philosophically speaking, if an error isn&rsquo;t detected and the interpreter
runs the code, is it <em>really</em> an error?</p>
</aside></li>
<li>
<p><strong>It must not crash or hang.</strong> Syntax errors are a fact of life and language
tools have to be robust in the face of them. Segfaulting or getting stuck in
an infinite loop isn&rsquo;t allowed. While the source may not be valid <em>code</em>,
it&rsquo;s still a valid <em>input to the parser</em> because users use the parser to
learn what syntax is allowed.</p>
</li>
</ul>
<p>Those are the table stakes if you want to get in the parser game at all, but you
really want to raise the ante beyond that. A decent parser should:</p>
<ul>
<li>
<p><strong>Be fast.</strong> Computers are thousands of times faster than they were when
parser technology was first invented. The days of needing to optimize your
parser so that it could get through an entire source file during a coffee
break are over. But programmer expectations have risen as quickly, if not
faster. They expect their editors to reparse files in milliseconds after
every keystroke.</p>
</li>
<li>
<p><strong>Report as many distinct errors as there are.</strong> Aborting after the first
error is easy to implement, but it&rsquo;s annoying for users if every time they
fix what they think is the one error in a file, a new one appears. They
want to see them all.</p>
</li>
<li>
<p><strong>Minimize <em>cascaded</em> errors.</strong> Once a single error is found, the parser no
longer really knows what&rsquo;s going on. It tries to get itself back on track
and keep going, but if it gets confused, it may report a slew of ghost
errors that don&rsquo;t indicate other real problems in the code. When the first
error is fixed, they disappear, because they merely represent the parser&rsquo;s
own confusion. These are annoying because they can scare the user into
thinking their code is in a worse state than it is.</p>
</li>
</ul>
<p>The last two points are in tension. We want to report as many separate errors as
we can, but we don&rsquo;t want to report ones that are merely side effects of an
earlier one.</p>
<p>The way a parser responds to an error and keeps going to look for later errors
is called <strong>&ldquo;error recovery&rdquo;</strong>. It was a hot research topic in the 60s. Back
then, you&rsquo;d hand a stack of punch cards to the secretary and come back the next
day to see if the compiler succeeded. With an iteration loop that slow, you
<em>really</em> wanted to find every single error in your code in one pass.</p>
<p>Today, when parsers complete before you&rsquo;ve even finished typing, it&rsquo;s less of an
issue. Simple, fast error recovery is fine.</p>
<h3><a href="#panic-mode-error-recovery" name="panic-mode-error-recovery"><small>6&#8202;.&#8202;3&#8202;.&#8202;1</small>Panic mode error recovery</a></h3>
<aside name="panic">
<p>You know you want to push it.</p><img src="image/parsing-expressions/panic.png" alt="A big shiny 'PANIC' button.">
</aside>
<p>Of all the recovery techniques devised in yesteryear, the one that best stood
the test of time is called<span class="em">&mdash;</span>somewhat alarmingly<span class="em">&mdash;</span><span name="panic">&ldquo;panic
mode&rdquo;</span>. As soon as the parser detects an error, it enters panic mode. It
knows at least one token doesn&rsquo;t make sense given its current state in the
middle of some stack of grammar productions.</p>
<p>Before it can get back to parsing, it needs to get its state and the sequence of
forthcoming tokens aligned such that the next token does match the rule being
parsed. This process is called <strong>synchronization</strong>.</p>
<p>To do that, we select some rule in the grammar that will mark the
synchronization point. The parser fixes its parsing state by jumping out of any
nested productions until it gets back to that rule. Then it synchronizes the
token stream by discarding tokens until it reaches one that can appear at that
point in the rule.</p>
<p>Any additional real syntax errors hiding in those discarded tokens aren&rsquo;t
reported, but it also means that any mistaken cascaded errors that are side
effects of the initial error aren&rsquo;t <em>falsely</em> reported either, which is a decent
trade-off.</p>
<p>The traditional place in the grammar to synchronize is between statements. We
don&rsquo;t have those yet, so we won&rsquo;t actually synchronize in this chapter, but
we&rsquo;ll get the machinery in place for later.</p>
<h3><a href="#entering-panic-mode" name="entering-panic-mode"><small>6&#8202;.&#8202;3&#8202;.&#8202;2</small>Entering panic mode</a></h3>
<p>Back before we went on this side trip around error recovery, we were writing the
code to parse a parenthesized expression. After parsing the expression, it
looks for the closing <code>)</code> by calling <code>consume()</code>. Here, finally, is that method:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>match</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">Token</span> <span class="i">consume</span>(<span class="t">TokenType</span> <span class="i">type</span>, <span class="t">String</span> <span class="i">message</span>) {
    <span class="k">if</span> (<span class="i">check</span>(<span class="i">type</span>)) <span class="k">return</span> <span class="i">advance</span>();

    <span class="k">throw</span> <span class="i">error</span>(<span class="i">peek</span>(), <span class="i">message</span>);
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>match</em>()</div>

<p>It&rsquo;s similar to <code>match()</code> in that it checks to see if the next token is of the
expected type. If so, it consumes it and everything is groovy. If some other
token is there, then we&rsquo;ve hit an error. We report it by calling this:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>previous</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">ParseError</span> <span class="i">error</span>(<span class="t">Token</span> <span class="i">token</span>, <span class="t">String</span> <span class="i">message</span>) {
    <span class="t">Lox</span>.<span class="i">error</span>(<span class="i">token</span>, <span class="i">message</span>);
    <span class="k">return</span> <span class="k">new</span> <span class="t">ParseError</span>();
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>previous</em>()</div>

<p>First, that shows the error to the user by calling:</p>
<div class="codehilite"><div class="source-file"><em>lox/Lox.java</em><br>
add after <em>report</em>()</div>
<pre>
  <span class="k">static</span> <span class="t">void</span> <span class="i">error</span>(<span class="t">Token</span> <span class="i">token</span>, <span class="t">String</span> <span class="i">message</span>) {
    <span class="k">if</span> (<span class="i">token</span>.<span class="i">type</span> == <span class="t">TokenType</span>.<span class="i">EOF</span>) {
      <span class="i">report</span>(<span class="i">token</span>.<span class="i">line</span>, <span class="s">&quot; at end&quot;</span>, <span class="i">message</span>);
    } <span class="k">else</span> {
      <span class="i">report</span>(<span class="i">token</span>.<span class="i">line</span>, <span class="s">&quot; at &#39;&quot;</span> + <span class="i">token</span>.<span class="i">lexeme</span> + <span class="s">&quot;&#39;&quot;</span>, <span class="i">message</span>);
    }
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Lox.java</em>, add after <em>report</em>()</div>

<p>This reports an error at a given token. It shows the token&rsquo;s location and the
token itself. This will come in handy later since we use tokens throughout the
interpreter to track locations in code.</p>
<p>After this is called, the user knows about the syntax error, but what does the
<em>parser</em> do next? Back in <code>error()</code>, it creates and returns a ParseError, an
instance of:</p>
<div class="codehilite"><pre class="insert-before">
class Parser {
</pre><div class="source-file"><em>lox/Parser.java</em><br>
nest inside class <em>Parser</em></div>
<pre class="insert">
  <span class="k">private</span> <span class="k">static</span> <span class="k">class</span> <span class="t">ParseError</span> <span class="k">extends</span> <span class="t">RuntimeException</span> {}

</pre><pre class="insert-after">
  private final List&lt;Token&gt; tokens;
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, nest inside class <em>Parser</em></div>

<p>This is a simple sentinel class we use to unwind the parser. The <code>error()</code>
method <em>returns</em> it instead of <em>throwing</em> because we want to let the caller
decide whether to unwind or not.</p>
<p>Some parse errors occur in places where the parser isn&rsquo;t likely to get into a
weird state and we don&rsquo;t need to <span name="production">synchronize</span>. In
those places, we simply report the error and keep on truckin&rsquo;. For example, Lox
limits the number of arguments you can pass to a function. If you pass too many,
the parser needs to report that error, but it can and should simply keep on
parsing the extra arguments instead of freaking out and going into panic mode.</p>
<aside name="production">
<p>Another way to handle common syntax errors is with <strong>error productions</strong>. You
augment the grammar with a rule that matches the erroneous syntax. The parser
safely parses it but then reports it as an error instead of producing a syntax
tree.</p>
<p>For example, some languages have a unary <code>+</code> operator, like <code>+123</code>, but Lox does
not. Instead of getting confused when the parser stumbles onto a <code>+</code> at the
beginning of an expression, we could extend the unary rule to allow it:</p>
<div class="codehilite"><pre>
<span class="i">unary</span> → ( <span class="s">&quot;!&quot;</span> | <span class="s">&quot;-&quot;</span> | <span class="s">&quot;+&quot;</span> ) <span class="i">unary</span>
        | <span class="i">primary</span> ;
</pre></div>
<p>This lets the parser consume <code>+</code> without going into panic mode or leaving the
parser in a weird state.</p>
<p>Error productions work well because you, the parser author, know <em>how</em> the code
is wrong and what the user was likely trying to do. That means you can give a
more helpful message to get the user back on track, like, &ldquo;Unary &lsquo;+&rsquo; expressions
are not supported.&rdquo; Mature parsers tend to accumulate error productions like
barnacles since they help users fix common mistakes.</p>
</aside>
<p>In our case, though, the syntax error is nasty enough that we want to panic and
synchronize. Discarding tokens is pretty easy, but how do we synchronize the
parser&rsquo;s own state?</p>
<h3><a href="#synchronizing-a-recursive-descent-parser" name="synchronizing-a-recursive-descent-parser"><small>6&#8202;.&#8202;3&#8202;.&#8202;3</small>Synchronizing a recursive descent parser</a></h3>
<p>With recursive descent, the parser&rsquo;s state<span class="em">&mdash;</span>which rules it is in the middle of
recognizing<span class="em">&mdash;</span>is not stored explicitly in fields. Instead, we use Java&rsquo;s
own call stack to track what the parser is doing. Each rule in the process of
being parsed is a callframe on the stack. In order to reset that state, we need
to clear out those callframes.</p>
<p>The natural way to do that in Java is exceptions. When we want to synchronize,
we <em>throw</em> that ParseError object. Higher up in the method for the grammar rule
we are synchronizing to, we&rsquo;ll catch it. Since we are synchronizing on statement
boundaries, we&rsquo;ll catch the exception there. After the exception is caught, the
parser is in the right state. All that&rsquo;s left is to synchronize the tokens.</p>
<p>We want to discard tokens until we&rsquo;re right at the beginning of the next
statement. That boundary is pretty easy to spot<span class="em">&mdash;</span>it&rsquo;s one of the main reasons
we picked it. <em>After</em> a semicolon, we&rsquo;re <span name="semicolon">probably</span>
finished with a statement. Most statements start with a keyword<span class="em">&mdash;</span><code>for</code>, <code>if</code>,
<code>return</code>, <code>var</code>, etc. When the <em>next</em> token is any of those, we&rsquo;re probably
about to start a statement.</p>
<aside name="semicolon">
<p>I say &ldquo;probably&rdquo; because we could hit a semicolon separating clauses in a for
loop. Our synchronization isn&rsquo;t perfect, but that&rsquo;s OK. We&rsquo;ve already reported
the first error precisely, so everything after that is kind of &ldquo;best effort&rdquo;.</p>
</aside>
<p>This method encapsulates that logic:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>error</em>()</div>
<pre>
  <span class="k">private</span> <span class="t">void</span> <span class="i">synchronize</span>() {
    <span class="i">advance</span>();

    <span class="k">while</span> (!<span class="i">isAtEnd</span>()) {
      <span class="k">if</span> (<span class="i">previous</span>().<span class="i">type</span> == <span class="i">SEMICOLON</span>) <span class="k">return</span>;

      <span class="k">switch</span> (<span class="i">peek</span>().<span class="i">type</span>) {
        <span class="k">case</span> <span class="i">CLASS</span>:
        <span class="k">case</span> <span class="i">FUN</span>:
        <span class="k">case</span> <span class="i">VAR</span>:
        <span class="k">case</span> <span class="i">FOR</span>:
        <span class="k">case</span> <span class="i">IF</span>:
        <span class="k">case</span> <span class="i">WHILE</span>:
        <span class="k">case</span> <span class="i">PRINT</span>:
        <span class="k">case</span> <span class="i">RETURN</span>:
          <span class="k">return</span>;
      }

      <span class="i">advance</span>();
    }
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>error</em>()</div>

<p>It discards tokens until it thinks it found a statement boundary. After catching
a ParseError, we&rsquo;ll call this and then we are hopefully back in sync. When it
works well, we have discarded tokens that would have likely caused cascaded
errors anyway and now we can parse the rest of the file starting at the next
statement.</p>
<p>Alas, we don&rsquo;t get to see this method in action, since we don&rsquo;t have statements
yet. We&rsquo;ll get to that <a href="statements-and-state.html">in a couple of chapters</a>. For now, if an
error occurs, we&rsquo;ll panic and unwind all the way to the top and stop parsing.
Since we can only parse a single expression anyway, that&rsquo;s no big loss.</p>
<h2><a href="#wiring-up-the-parser" name="wiring-up-the-parser"><small>6&#8202;.&#8202;4</small>Wiring up the Parser</a></h2>
<p>We are mostly done parsing expressions now. There is one other place where we
need to add a little error handling. As the parser descends through the parsing
methods for each grammar rule, it eventually hits <code>primary()</code>. If none of the
cases in there match, it means we are sitting on a token that can&rsquo;t start an
expression. We need to handle that error too:</p>
<div class="codehilite"><pre class="insert-before">
    if (match(LEFT_PAREN)) {
      Expr expr = expression();
      consume(RIGHT_PAREN, &quot;Expect ')' after expression.&quot;);
      return new Expr.Grouping(expr);
    }
</pre><div class="source-file"><em>lox/Parser.java</em><br>
in <em>primary</em>()</div>
<pre class="insert">

    <span class="k">throw</span> <span class="i">error</span>(<span class="i">peek</span>(), <span class="s">&quot;Expect expression.&quot;</span>);
</pre><pre class="insert-after">
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, in <em>primary</em>()</div>

<p>With that, all that remains in the parser is to define an initial method to kick
it off. It&rsquo;s called, naturally enough, <code>parse()</code>:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>Parser</em>()</div>
<pre>
  <span class="t">Expr</span> <span class="i">parse</span>() {
    <span class="k">try</span> {
      <span class="k">return</span> <span class="i">expression</span>();
    } <span class="k">catch</span> (<span class="t">ParseError</span> <span class="i">error</span>) {
      <span class="k">return</span> <span class="k">null</span>;
    }
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>Parser</em>()</div>

<p>We&rsquo;ll revisit this method later when we add statements to the language. For now,
it parses a single expression and returns it. We also have some temporary code
to exit out of panic mode. Syntax error recovery is the parser&rsquo;s job, so we
don&rsquo;t want the ParseError exception to escape into the rest of the interpreter.</p>
<p>When a syntax error does occur, this method returns <code>null</code>. That&rsquo;s OK. The
parser promises not to crash or hang on invalid syntax, but it doesn&rsquo;t promise
to return a <em>usable syntax tree</em> if an error is found. As soon as the parser
reports an error, <code>hadError</code> gets set, and subsequent phases are skipped.</p>
<p>Finally, we can hook up our brand new parser to the main Lox class and try it
out. We still don&rsquo;t have an interpreter so, for now, we&rsquo;ll parse to a syntax
tree and then use the AstPrinter class from the <a href="representing-code.html#a-(not-very)-pretty-printer">last chapter</a> to
display it.</p>
<p>Delete the old code to print the scanned tokens and replace it with this:</p>
<div class="codehilite"><pre class="insert-before">
    List&lt;Token&gt; tokens = scanner.scanTokens();
</pre><div class="source-file"><em>lox/Lox.java</em><br>
in <em>run</em>()<br>
replace 5 lines</div>
<pre class="insert">
    <span class="t">Parser</span> <span class="i">parser</span> = <span class="k">new</span> <span class="t">Parser</span>(<span class="i">tokens</span>);
    <span class="t">Expr</span> <span class="i">expression</span> = <span class="i">parser</span>.<span class="i">parse</span>();

    <span class="c">// Stop if there was a syntax error.</span>
    <span class="k">if</span> (<span class="i">hadError</span>) <span class="k">return</span>;

    <span class="t">System</span>.<span class="i">out</span>.<span class="i">println</span>(<span class="k">new</span> <span class="t">AstPrinter</span>().<span class="i">print</span>(<span class="i">expression</span>));
</pre><pre class="insert-after">
  }
</pre></div>
<div class="source-file-narrow"><em>lox/Lox.java</em>, in <em>run</em>(), replace 5 lines</div>

<p>Congratulations, you have crossed the <span name="harder">threshold</span>! That
really is all there is to hand-writing a parser. We&rsquo;ll extend the grammar in
later chapters with assignment, statements, and other stuff, but none of that is
any more complex than the binary operators we tackled here.</p>
<aside name="harder">
<p>It is possible to define a grammar that&rsquo;s more difficult than Lox&rsquo;s to parse
using recursive descent. Predictive parsing gets tricky when you may need to
look ahead a large number of tokens to figure out what you&rsquo;re sitting on.</p>
<p>In practice, most languages are designed to avoid that. Even in cases where they
aren&rsquo;t, you can usually hack around it without too much pain. If you can parse
C++ using recursive descent, you can parse anything.</p>
</aside>
<p>Fire up the interpreter and type in some expressions. See how it handles
precedence and associativity correctly? Not bad for less than 200 lines of code.</p>
<div class="challenges">
<h2><a href="#challenges" name="challenges">Challenges</a></h2>
<ol>
<li>
<p>In C, a block is a statement form that allows you to pack a series of
statements where a single one is expected. The <a href="https://en.wikipedia.org/wiki/Comma_operator">comma operator</a> is an
analogous syntax for expressions. A comma-separated series of expressions
can be given where a single expression is expected (except inside a function
call&rsquo;s argument list). At runtime, the comma operator evaluates the left
operand and discards the result. Then it evaluates and returns the right
operand.</p>
<p>Add support for comma expressions. Give them the same precedence and
associativity as in C. Write the grammar, and then implement the necessary
parsing code.</p>
</li>
<li>
<p>Likewise, add support for the C-style conditional or &ldquo;ternary&rdquo; operator
<code>?:</code>. What precedence level is allowed between the <code>?</code> and <code>:</code>? Is the whole
operator left-associative or right-associative?</p>
</li>
<li>
<p>Add error productions to handle each binary operator appearing without a
left-hand operand. In other words, detect a binary operator appearing at the
beginning of an expression. Report that as an error, but also parse and
discard a right-hand operand with the appropriate precedence.</p>
</li>
</ol>
</div>
<div class="design-note">
<h2><a href="#design-note" name="design-note">Design Note: Logic Versus History</a></h2>
<p>Let&rsquo;s say we decide to add bitwise <code>&amp;</code> and <code>|</code> operators to Lox. Where should we
put them in the precedence hierarchy? C<span class="em">&mdash;</span>and most languages that follow in C&rsquo;s
footsteps<span class="em">&mdash;</span>place them below <code>==</code>. This is widely considered a mistake because
it means common operations like testing a flag require parentheses:</p>
<div class="codehilite"><pre>
<span class="k">if</span> (<span class="i">flags</span> &amp; <span class="a">FLAG_MASK</span> == <span class="a">SOME_FLAG</span>) { ... } <span class="c">// Wrong.</span>
<span class="k">if</span> ((<span class="i">flags</span> &amp; <span class="a">FLAG_MASK</span>) == <span class="a">SOME_FLAG</span>) { ... } <span class="c">// Right.</span>
</pre></div>
<p>Should we fix this for Lox and put bitwise operators higher up the precedence
table than C does? There are two strategies we can take.</p>
<p>You almost never want to use the result of an <code>==</code> expression as the operand to
a bitwise operator. By making bitwise bind tighter, users don&rsquo;t need to
parenthesize as often. So if we do that, and users assume the precedence is
chosen logically to minimize parentheses, they&rsquo;re likely to infer it correctly.</p>
<p>This kind of internal consistency makes the language easier to learn because
there are fewer edge cases and exceptions users have to stumble into and then
correct. That&rsquo;s good, because before users can use our language, they have to
load all of that syntax and semantics into their heads. A simpler, more rational
language <em>makes sense</em>.</p>
<p>But, for many users there is an even faster shortcut to getting our language&rsquo;s
ideas into their wetware<span class="em">&mdash;</span><em>use concepts they already know.</em> Many newcomers to
our language will be coming from some other language or languages. If our
language uses some of the same syntax or semantics as those, there is much less
for the user to learn (and <em>unlearn</em>).</p>
<p>This is particularly helpful with syntax. You may not remember it well today,
but way back when you learned your very first programming language, code
probably looked alien and unapproachable. Only through painstaking effort did
you learn to read and accept it. If you design a novel syntax for your new
language, you force users to start that process all over again.</p>
<p>Taking advantage of what users already know is one of the most powerful tools
you can use to ease adoption of your language. It&rsquo;s almost impossible to
underestimate how useful this is. But it faces you with a nasty problem: What
happens when the thing the users all know <em>kind of sucks?</em> C&rsquo;s bitwise operator
precedence is a mistake that doesn&rsquo;t make sense. But it&rsquo;s a <em>familiar</em> mistake
that millions have already gotten used to and learned to live with.</p>
<p>Do you stay true to your language&rsquo;s own internal logic and ignore history? Do
you start from a blank slate and first principles? Or do you weave your language
into the rich tapestry of programming history and give your users a leg up by
starting from something they already know?</p>
<p>There is no perfect answer here, only trade-offs. You and I are obviously biased
towards liking novel languages, so our natural inclination is to burn the
history books and start our own story.</p>
<p>In practice, it&rsquo;s often better to make the most of what users already know.
Getting them to come to your language requires a big leap. The smaller you can
make that chasm, the more people will be willing to cross it. But you can&rsquo;t
<em>always</em> stick to history, or your language won&rsquo;t have anything new and
compelling to get people to <em>want</em> to jump over.</p>
</div>

<footer>
<a href="evaluating-expressions.html" class="next">
  Next Chapter: &ldquo;Evaluating Expressions&rdquo; &rarr;
</a>
Hand-crafted by Robert Nystrom&ensp;&mdash;&ensp;<a href="https://github.com/munificent/craftinginterpreters/blob/master/LICENSE" target="_blank">&copy; 2015&hairsp;&ndash;&hairsp;2020</a>
</footer>
</article>

</div>
</body>
</html>