/
parsing-expressions.html
1164 lines (1094 loc) · 72.5 KB
/
parsing-expressions.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
<title>Parsing Expressions · Crafting Interpreters</title>
<!-- Tell mobile browsers we're optimized for them and they don't need to crop
the viewport. -->
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<link rel="stylesheet" type="text/css" href="style.css" />
<!-- Oh, God, Source Code Pro is so beautiful it makes me want to cry. -->
<link href='https://fonts.googleapis.com/css?family=Source+Code+Pro:400|Source+Sans+Pro:300,400,600' rel='stylesheet' type='text/css'>
<link rel="icon" type="image/png" href="image/favicon.png" />
<script src="jquery-3.4.1.min.js"></script>
<script src="script.js"></script>
<!-- Google analytics -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-42804721-2', 'auto');
ga('send', 'pageview');
</script>
</head>
<body id="top">
<!-- <div class="scrim"></div> -->
<nav class="wide">
<a href="/"><img src="image/logotype-small.png" title="Crafting Interpreters"></a>
<div class="contents">
<!-- If there is a part, it must be a chapter within a part. -->
<h3><a href="#top">Parsing Expressions<small>6</small></a></h3>
<ul>
<li><a href="#ambiguity-and-the-parsing-game"><small>6.1</small> Ambiguity and the Parsing Game</a></li>
<li><a href="#recursive-descent-parsing"><small>6.2</small> Recursive Descent Parsing</a></li>
<li><a href="#syntax-errors"><small>6.3</small> Syntax Errors</a></li>
<li><a href="#wiring-up-the-parser"><small>6.4</small> Wiring up the Parser</a></li>
<li class="divider"></li>
<li class="end-part"><a href="#challenges">Challenges</a></li>
<li class="end-part"><a href="#design-note"><small>note</small>Logic Versus History</a></li>
</ul>
<div class="prev-next">
<a href="representing-code.html" title="Representing Code" class="left">← Previous</a>
<a href="a-tree-walk-interpreter.html" title="A Tree-Walk Interpreter">↑ Up</a>
<a href="evaluating-expressions.html" title="Evaluating Expressions" class="right">Next →</a>
</div> </div>
</nav>
<nav class="narrow">
<a href="/"><img src="image/logotype-small.png" title="Crafting Interpreters"></a>
<a href="representing-code.html" title="Representing Code" class="prev">←</a>
<a href="evaluating-expressions.html" title="Evaluating Expressions" class="next">→</a>
</nav>
<div class="page">
<div class="nav-wrapper">
<nav class="floating">
<a href="/"><img src="image/logotype-small.png" title="Crafting Interpreters"></a>
<div class="expandable">
<!-- If there is a part, it must be a chapter within a part. -->
<h3><a href="#top">Parsing Expressions<small>6</small></a></h3>
<ul>
<li><a href="#ambiguity-and-the-parsing-game"><small>6.1</small> Ambiguity and the Parsing Game</a></li>
<li><a href="#recursive-descent-parsing"><small>6.2</small> Recursive Descent Parsing</a></li>
<li><a href="#syntax-errors"><small>6.3</small> Syntax Errors</a></li>
<li><a href="#wiring-up-the-parser"><small>6.4</small> Wiring up the Parser</a></li>
<li class="divider"></li>
<li class="end-part"><a href="#challenges">Challenges</a></li>
<li class="end-part"><a href="#design-note"><small>note</small>Logic Versus History</a></li>
</ul>
<div class="prev-next">
<a href="representing-code.html" title="Representing Code" class="left">← Previous</a>
<a href="a-tree-walk-interpreter.html" title="A Tree-Walk Interpreter">↑ Up</a>
<a href="evaluating-expressions.html" title="Evaluating Expressions" class="right">Next →</a>
</div> </div>
<a id="expand-nav">≡</a>
</nav>
</div>
<article class="chapter">
<div class="number">6</div>
<h1>Parsing Expressions</h1>
<div class="sign-up closable">
<h1>This book is a work in progress!</h1>
<span class="dismiss">×</span>
<p>If you see a mistake, find something unclear, or have a suggestion, please <a href="https://github.com/munificent/craftinginterpreters/issues" target="_blank">let me know</a>. To follow its progress, please join the mailing list:</p>
<!-- Begin MailChimp Signup Form -->
<div id="mc_embed_signup">
<form action="//gameprogrammingpatterns.us7.list-manage.com/subscribe/post?u=0952ca43ed2536d6717766b88&id=6e96334109" method="post" id="mc-embedded-subscribe-form" name="mc-embedded-subscribe-form" class="validate" target="_blank" novalidate>
<input type="email" value="" name="EMAIL" class="email" id="mce-EMAIL" placeholder="Your email address" required>
<!-- real people should not fill this in and expect good things - do not remove this or risk form bot signups-->
<div style="position: absolute; left: -5000px;" aria-hidden="true"><input type="text" name="b_0952ca43ed2536d6717766b88_6e96334109" tabindex="-1" value=""></div>
<input type="submit" value="Sign me up!" name="subscribe" id="mc-embedded-subscribe" class="button">
</form>
</div>
<!--End mc_embed_signup-->
<p class="small">(I post about once a month. Don’t worry, I won’t spam you.)</p>
</div>
<blockquote>
<p>Grammar, which knows how to control even kings.
<cite>Molière</cite></p>
</blockquote>
<p><span name="parse">This</span> chapter marks the first major milestone of the
book. Many of us have cobbled together a mishmash of regular expressions and
substring operations to extract some sense out of a pile of text. The code was
probably riddled with bugs and a beast to maintain. Writing a <em>real</em> parser<span class="em">—</span>one with decent error-handling, a coherent internal structure, and the ability
to robustly chew through a sophisticated syntax<span class="em">—</span>is considered a rare,
impressive skill. In this chapter, you will <span name="attain">attain</span>
it.</p>
<aside name="parse">
<p>“Parse” comes to English from the Old French “pars” for “part of speech”. It
means to take a text and map each word to the grammar of the language. We use it
here in the same sense, except that our language is a little more modern than
Old French.</p>
</aside>
<aside name="attain">
<p>Like many rites of passage, you’ll probably find it looks a little smaller, a
little less daunting when it’s behind you than when it loomed ahead.</p>
</aside>
<p>It’s easier than you think, partially because we front-loaded a lot of the hard
work in the <a href="representing-code.html">last chapter</a>. You already know your way around a formal grammar.
You’re familiar with syntax trees, and we have some Java classes to represent
them. The only remaining piece is parsing<span class="em">—</span>transmogrifying a sequence of
tokens into one of those syntax trees.</p>
<p>Some CS textbooks make a big deal out of parsers. In the 60s, computer
scientists<span class="em">—</span>understandably tired of programming in assembly language<span class="em">—</span>started designing more sophisticated, <span name="human">human</span>-friendly
languages like FORTRAN and ALGOL. Alas, they weren’t very <em>machine</em>-friendly,
for the primitive machines at the time.</p>
<aside name="human">
<p>Consider how harrowing assembly programming on those old machines must have been
for <em>FORTRAN</em> to be an improvement.</p>
</aside>
<p>They designed languages that they honestly weren’t even sure how to write
compilers for, and then did ground-breaking work inventing parsing and compiling
techniques that could handle these new big languages on those old tiny machines.</p>
<p>Classic compiler books read like fawning hagiographies of these pioneers and
their tools. The cover of “Compilers: Principles, Techniques, and Tools”
literally has a dragon labeled “complexity of compiler design” being slain by a
knight bearing a sword and shield branded “LALR parser generator” and “syntax
directed translation”. They laid it on thick.</p>
<p>A little self-congratulation is well-deserved, but the truth is you don’t need
to know most of that stuff to bang out a high quality parser for a modern
machine. As always, I encourage you to broaden your education and take it in
later, but this book omits the trophy case.</p>
<h2><a href="#ambiguity-and-the-parsing-game" name="ambiguity-and-the-parsing-game"><small>6 . 1</small>Ambiguity and the Parsing Game</a></h2>
<p>In the last chapter, I said you can “play” a context free grammar like a game in
order to <em>generate</em> strings. Now, we play that game in reverse. Given a
string<span class="em">—</span>a series of tokens<span class="em">—</span>we map those tokens to terminals in the grammar
to figure out which rules could have generated that string.</p>
<p>The “could have” part is interesting. It’s entirely possible to create a grammar
that is <em>ambiguous</em>, where different choices of productions can lead to the same
string. When you’re using the grammar to <em>generate</em> strings, that doesn’t matter
much. Once you have the string, who cares how you got to it?</p>
<p>When parsing, ambiguity means the parser may misunderstand the user’s code. As
we parse, we aren’t just determining if the string is valid Lox code, we’re
also tracking which rules match which parts of it so that we know what part of
the language each token belongs to. Here’s the Lox expression grammar we put
together in the last chapter:</p>
<div class="codehilite"><pre>
<span class="i">expression</span> → <span class="i">literal</span>
| <span class="i">unary</span>
| <span class="i">binary</span>
| <span class="i">grouping</span> ;
<span class="i">literal</span> → <span class="t">NUMBER</span> | <span class="t">STRING</span> | <span class="s">"false"</span> | <span class="s">"true"</span> | <span class="s">"nil"</span> ;
<span class="i">grouping</span> → <span class="s">"("</span> <span class="i">expression</span> <span class="s">")"</span> ;
<span class="i">unary</span> → ( <span class="s">"-"</span> | <span class="s">"!"</span> ) <span class="i">expression</span> ;
<span class="i">binary</span> → <span class="i">expression</span> <span class="i">operator</span> <span class="i">expression</span> ;
<span class="i">operator</span> → <span class="s">"=="</span> | <span class="s">"!="</span> | <span class="s">"<"</span> | <span class="s">"<="</span> | <span class="s">">"</span> | <span class="s">">="</span>
| <span class="s">"+"</span> | <span class="s">"-"</span> | <span class="s">"*"</span> | <span class="s">"/"</span> ;
</pre></div>
<p>This is a valid string in that grammar:</p><img src="image/parsing-expressions/tokens.png" alt="6 / 3 - 1" />
<p>But there are two ways we could have generated it. One way is:</p>
<ol>
<li>Starting at <code>expression</code>, pick <code>binary</code>.</li>
<li>For the left-hand <code>expression</code>, pick <code>NUMBER</code>, and use <code>6</code>.</li>
<li>For the operator, pick <code>"/"</code>.</li>
<li>For the right-hand <code>expression</code>, pick <code>binary</code> again.</li>
<li>In that nested <code>binary</code> expression, pick <code>3 - 1</code>.</li>
</ol>
<p>Another is:</p>
<ol>
<li>Starting at <code>expression</code>, pick <code>binary</code>.</li>
<li>For the left-hand <code>expression</code>, pick <code>binary</code> again.</li>
<li>In that nested <code>binary</code> expression, pick <code>6 / 3</code>.</li>
<li>Back at the outer <code>binary</code>, for the operator, pick <code>"-"</code>.</li>
<li>For the right-hand <code>expression</code>, pick <code>NUMBER</code>, and use <code>1</code>.</li>
</ol>
<p>Those produce the same <em>strings</em>, but not the same <em>syntax trees</em>:</p><img src="image/parsing-expressions/syntax-trees.png" alt="Two valid syntax trees: (6 / 3) - 1 and 6 / (3 - 1)" />
<p>In other words, the grammar allows seeing the expression as <code>(6 / 3) - 1</code> or <code>6 / (3 - 1)</code>. That in turn affects the result of evaluating it. The way
mathematicians have solved this ambiguity since blackboards were first invented
is by defining rules for precedence and associativity.</p>
<ul>
<li>
<p><span name="nonassociative"><strong>Precedence</strong></span> determines which operator
is evaluated first in an expression containing a mixture of different
operators. Precedence rules tell us that we evaluate the <code>/</code> before the <code>-</code>
in the above example. Operators with higher precedence are evaluated
before operators with lower precedence. Equivalently, higher precedence
operators are said to “bind tighter”.</p>
</li>
<li>
<p><strong>Associativity</strong> determines which operator is evaluated first in a series
of the <em>same</em> operator. When an operator is <strong>left-associative</strong> (think
“left-to-right”), operators on the left evaluate before those on the right.
Since <code>-</code> is left-associative, this expression:</p>
<div class="codehilite"><pre>
<span class="n">5</span> - <span class="n">3</span> - <span class="n">1</span>
</pre></div>
<p>is equivalent to:</p>
<div class="codehilite"><pre>
(<span class="n">5</span> - <span class="n">3</span>) - <span class="n">1</span>
</pre></div>
<p>Assignment, on the other hand, is <strong>right-associative</strong>. This:</p>
<div class="codehilite"><pre>
<span class="i">a</span> = <span class="i">b</span> = <span class="i">c</span>
</pre></div>
<p>is equivalent to:</p>
<div class="codehilite"><pre>
<span class="i">a</span> = (<span class="i">b</span> = <span class="i">c</span>)
</pre></div>
</li>
</ul>
<aside name="nonassociative">
<p>While not common these days, some languages specify that certain pairs of
operators have <em>no</em> relative precedence. That makes it a syntax error to mix
those operators in an expression without using explicit grouping.</p>
<p>Likewise, some operators are <strong>non-associative</strong>. That means it’s an error to
use that operator more than once in a sequence. For example, Perl’s range
operator isn’t associative, so <code>a .. b</code> is OK, but <code>a .. b .. c</code> is an error.</p>
</aside>
<p>Without well-defined precedence and associativity, an expression that uses
multiple operators is ambiguous<span class="em">—</span>it can be parsed into different syntax trees,
which could in turn evaluate to different results. We’ll fix that in Lox by
applying the same precedence rules as C, going from lowest to highest:</p><table>
<thead>
<tr>
<td>Name</td>
<td>Operators</td>
<td>Associates</td>
</tr>
</thead>
<tbody>
<tr>
<td>Equality</td>
<td><code>==</code> <code>!=</code></td>
<td>Left</td>
</tr>
<tr>
<td>Comparison</td>
<td><code>></code> <code>>=</code>
<code><</code> <code><=</code></td>
<td>Left</td>
</tr>
<tr>
<td>Addition</td>
<td><code>-</code> <code>+</code></td>
<td>Left</td>
</tr>
<tr>
<td>Multiplication</td>
<td><code>/</code> <code>*</code></td>
<td>Left</td>
</tr>
<tr>
<td>Unary</td>
<td><code>!</code> <code>-</code></td>
<td>Right</td>
</tr>
</tbody>
</table>
<p>Right now, the grammar stuffs all expression types into a single <code>expression</code>
rule. That same rule is used as the non-terminal for subexpressions, which lets
the grammar accept any kind of expression as an operand, regardless of whether
the precedence rules allow it.</p>
<p>We fix that by <span name="massage">stratifying</span> the grammar. We define a
separate rule for each precedence level:</p>
<div class="codehilite"><pre>
<span class="i">expression</span> → ...
<span class="i">equality</span> → ...
<span class="i">comparison</span> → ...
<span class="i">addition</span> → ...
<span class="i">multiplication</span> → ...
<span class="i">unary</span> → ...
<span class="i">primary</span> → ...
</pre></div>
<aside name="massage">
<p>Instead of baking precedence right into the grammar rules, some parser
generators let you keep the same ambiguous-but-simple grammar and then add in a
little explicit operator precedence metadata on the side in order to
disambiguate.</p>
</aside>
<p>Each rule here only matches expressions at its precedence level or higher. For
example, <code>unary</code> matches a unary expression like <code>!negated</code> or a primary
expression like <code>1234</code>. And <code>addition</code> can match <code>1 + 2</code> but also <code>3 * 4 / 5</code>.
The final <code>primary</code> rule covers the highest-precedence expressions<span class="em">—</span>literals
and parenthesized grouping expressions.</p>
<p>We just need to fill in the productions for each of those rules. We’ll do the
easy ones first. The top <code>expression</code> rule matches any expression at any
precedence level. Since <span name="equality"><code>equality</code></span> has the lowest
precedence, if we match that, then it covers everything:</p>
<aside name="equality">
<p>We could eliminate <code>expression</code> and simply use <code>equality</code> in the other rules
that contain expressions, but using <code>expression</code> makes those other rules read a
little better.</p>
<p>Also, in later chapters, when we expand the grammar to include assignment and
logical operators, we’ll only need to change <code>expression</code> instead of touching
every rule that contains an expression.</p>
</aside>
<div class="codehilite"><pre>
<span class="i">expression</span> → <span class="i">equality</span>
</pre></div>
<p>Over at the other end of the precedence table, a primary expression contains
all the literals and grouping expressions:</p>
<div class="codehilite"><pre>
<span class="i">primary</span> → <span class="t">NUMBER</span> | <span class="t">STRING</span> | <span class="s">"false"</span> | <span class="s">"true"</span> | <span class="s">"nil"</span>
| <span class="s">"("</span> <span class="i">expression</span> <span class="s">")"</span> ;
</pre></div>
<p>A unary expression starts with a unary operator followed by the operand. Since
unary operators can nest<span class="em">—</span><code>!!true</code> is a valid if weird expression<span class="em">—</span>the
operand can itself be a unary operator. A recursive rule handles that nicely:</p>
<div class="codehilite"><pre>
<span class="i">unary</span> → ( <span class="s">"!"</span> | <span class="s">"-"</span> ) <span class="i">unary</span> ;
</pre></div>
<p>But this rule has a problem. It never terminates. Remember, each rule needs to
match expressions at that precedence level <em>or higher</em>, so we also need to let
this match a primary expression:</p>
<div class="codehilite"><pre>
<span class="i">unary</span> → ( <span class="s">"!"</span> | <span class="s">"-"</span> ) <span class="i">unary</span>
| <span class="i">primary</span> ;
</pre></div>
<p>That works.</p>
<p>The remaining rules are all binary operators. We’ll start with the rule for
multiplication and division. Here’s a first try:</p>
<div class="codehilite"><pre>
<span class="i">multiplication</span> → <span class="i">multiplication</span> ( <span class="s">"/"</span> | <span class="s">"*"</span> ) <span class="i">unary</span>
| <span class="i">unary</span> ;
</pre></div>
<p>The rule recurses to match the left operand. That enables the rule to match a
series of multiplication and division expressions like <code>1 * 2 / 3</code>. Putting the
recursive production on the left side and <code>unary</code> on the right makes the rule
<span name="mult">left-associative</span> and unambiguous.</p>
<aside name="mult">
<p>In principle, it doesn’t matter whether you treat multiplication as left- or
right-associative<span class="em">—</span>you get the same result either way. Alas, in the real world
with limited precision, roundoff and overflow mean that associativity can affect
the result of a sequence of multiplications. Consider:</p>
<div class="codehilite"><pre>
<span class="k">print</span> <span class="n">0.1</span> * (<span class="n">0.2</span> * <span class="n">0.3</span>);
<span class="k">print</span> (<span class="n">0.1</span> * <span class="n">0.2</span>) * <span class="n">0.3</span>;
</pre></div>
<p>In languages like Lox that use <a href="https://en.wikipedia.org/wiki/Double-precision_floating-point_format">IEEE 754</a> double-precision floating-point
numbers, the first evaluates to <code>0.006</code>, while the second yields
<code>0.006000000000000001</code>. Sometimes that tiny difference matters.
<a href="https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html">This</a> is a good place to learn more.</p>
</aside>
<p>All of this is correct, but the fact that the first nonterminal in the body of
the rule is the same as the head of the rule means this production is
<strong>left-recursive</strong>. Some parsing techniques, including the one we’re going to
use, have trouble with left recursion. (Recursion elsewhere, like we have in
<code>unary</code> and the indirect recursion for grouping in <code>primary</code> are not a problem.)</p>
<p>There are many grammars you can define that match the same language. The choice
for how to model a particular language is partially a matter of taste and
partially a pragmatic one. This rule is correct, but not optimal for how we
intend to parse it. Instead of a left recursive rule, we’ll use a different one:</p>
<div class="codehilite"><pre>
<span class="i">multiplication</span> → <span class="i">unary</span> ( ( <span class="s">"/"</span> | <span class="s">"*"</span> ) <span class="i">unary</span> )* ;
</pre></div>
<p>We define a multiplication expression as a flat <em>sequence</em> of multiplications
and divisions. This matches the same syntax as the previous rule, but better
mirrors the code we’ll write to parse code. We use the same structure for all of
other binary operator precedence levels giving us this complete expression
grammar:</p>
<div class="codehilite"><pre>
<span class="i">expression</span> → <span class="i">equality</span> ;
<span class="i">equality</span> → <span class="i">comparison</span> ( ( <span class="s">"!="</span> | <span class="s">"=="</span> ) <span class="i">comparison</span> )* ;
<span class="i">comparison</span> → <span class="i">addition</span> ( ( <span class="s">">"</span> | <span class="s">">="</span> | <span class="s">"<"</span> | <span class="s">"<="</span> ) <span class="i">addition</span> )* ;
<span class="i">addition</span> → <span class="i">multiplication</span> ( ( <span class="s">"-"</span> | <span class="s">"+"</span> ) <span class="i">multiplication</span> )* ;
<span class="i">multiplication</span> → <span class="i">unary</span> ( ( <span class="s">"/"</span> | <span class="s">"*"</span> ) <span class="i">unary</span> )* ;
<span class="i">unary</span> → ( <span class="s">"!"</span> | <span class="s">"-"</span> ) <span class="i">unary</span>
| <span class="i">primary</span> ;
<span class="i">primary</span> → <span class="t">NUMBER</span> | <span class="t">STRING</span> | <span class="s">"false"</span> | <span class="s">"true"</span> | <span class="s">"nil"</span>
| <span class="s">"("</span> <span class="i">expression</span> <span class="s">")"</span> ;
</pre></div>
<p>This grammar is more complex than the one we had before, but in return we have
eliminated the previous one’s ambiguity. It’s just what we need to make a
parser.</p>
<h2><a href="#recursive-descent-parsing" name="recursive-descent-parsing"><small>6 . 2</small>Recursive Descent Parsing</a></h2>
<p>There is a whole pack of parsing techniques whose names mostly seem to be
combinations of “L” and “R”<span class="em">—</span><a href="https://en.wikipedia.org/wiki/LL_parser">LL(k)</a>, <a href="https://en.wikipedia.org/wiki/LR_parser">LR(1)</a>, <a href="https://en.wikipedia.org/wiki/LALR_parser">LALR</a><span class="em">—</span>along with
more exotic beasts like <a href="https://en.wikipedia.org/wiki/Parser_combinator">parser combinators</a>, <a href="https://en.wikipedia.org/wiki/Earley_parser">Earley parsers</a>,
<a href="https://en.wikipedia.org/wiki/Shunting-yard_algorithm">the shunting yard algorithm</a>, and <a href="https://en.wikipedia.org/wiki/Parsing_expression_grammar">packrat parsing</a>. For our first
interpreter, one technique is more than sufficient: <strong>recursive descent</strong>.</p>
<p>Recursive descent is the simplest way to build a parser, and doesn’t require
using complex parser generator tools like Yacc, Bison or ANTLR. All you need is
straightforward hand-written code. Don’t be fooled by its simplicity, though.
Recursive descent parsers are fast, robust, and can support sophisticated
error-handling. In fact, GCC, V8 (the JavaScript VM in Chrome), Roslyn (the C#
compiler written in C#) and many other heavyweight production language
implementations use recursive descent. It kicks ass.</p>
<p>It is considered a <strong>top-down parser</strong> because it starts from the top or
outermost grammar rule (here <code>expression</code>) and works its way <span
name="descent">down</span> into the nested subexpressions before finally
reaching the leaves of the syntax tree. This is in contrast with bottom-up
parsers like LR that start with primary expressions and compose them into larger
and larger chunks of syntax.</p>
<aside name="descent">
<p>It’s called “recursive <em>descent</em>” because it walks <em>down</em> the grammar.
Confusingly, we also use direction metaphorically when talking about “high” and
“low” precedence, but the orientation is reversed. In a top-down parser, you
reach the lowest-precedence expressions first because they may in turn contain
subexpressions of higher precedence.</p><img src="image/parsing-expressions/direction.png" alt="Top-down grammar rules in order of increasing precedence.">
<p>CS people really need to get together and straighten out their metaphors. Don’t
even get me started on which direction the stack is supposed to grow.</p>
</aside>
<p>A recursive descent parser is a literal translation of the grammar’s rules
straight into imperative code. Each rule becomes a function. The body of the
rule translates to code roughly like:</p><table>
<thead>
<tr>
<td>Grammar notation</td>
<td>Code representation</td>
</tr>
</thead>
<tbody>
<tr><td>Terminal</td><td>Code to match and consume a token</td></tr>
<tr><td>Nonterminal</td><td>Call to that rule’s function</td></tr>
<tr><td><code>|</code></td><td><code>if</code> or <code>switch</code> statement</td></tr>
<tr><td><code>*</code> or <code>+</code></td><td><code>while</code> or <code>for</code> loop</td></tr>
<tr><td><code>?</code></td><td><code>if</code> statement</td></tr>
</tbody>
</table>
<p>It’s called “<em>recursive</em> descent” because when a grammar rule refers to itself<span class="em">—</span>directly or indirectly<span class="em">—</span>that translates to recursive method calls.</p>
<h3><a href="#the-parser-class" name="the-parser-class"><small>6 . 2 . 1</small>The parser class</a></h3>
<p>Each grammar rule becomes a method inside this new class:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
create new file</div>
<pre>
<span class="k">package</span> <span class="i">com.craftinginterpreters.lox</span>;
<span class="k">import</span> <span class="i">java.util.List</span>;
<span class="k">import static</span> <span class="i">com.craftinginterpreters.lox.TokenType.*</span>;
<span class="k">class</span> <span class="t">Parser</span> {
<span class="k">private</span> <span class="k">final</span> <span class="t">List</span><<span class="t">Token</span>> <span class="i">tokens</span>;
<span class="k">private</span> <span class="t">int</span> <span class="i">current</span> = <span class="n">0</span>;
<span class="t">Parser</span>(<span class="t">List</span><<span class="t">Token</span>> <span class="i">tokens</span>) {
<span class="k">this</span>.<span class="i">tokens</span> = <span class="i">tokens</span>;
}
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, create new file</div>
<p>Like the scanner, it consumes a sequence, only now we’re working at the level of
entire tokens. It takes in a list of tokens and uses <code>current</code> to point to the
next token eagerly waiting to be used.</p>
<p>We’re going to run straight through the expression grammar now and translate
each rule to Java code. The first rule, <code>expression</code>, simply expands to the
<code>equality</code> rule, so that’s straightforward:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>Parser</em>()</div>
<pre>
<span class="k">private</span> <span class="t">Expr</span> <span class="i">expression</span>() {
<span class="k">return</span> <span class="i">equality</span>();
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>Parser</em>()</div>
<p>Each method for parsing a grammar rule produces a syntax tree for that rule and
returns it to the caller. When the body of the rule contains a nonterminal<span class="em">—</span>a
reference to another rule<span class="em">—</span>we <span name="left">call</span> that rule’s
method.</p>
<aside name="left">
<p>This is why left recursion is problematic for recursive descent. The function
for a left-recursive rule immediately calls itself, which calls itself again,
and so on, until the parser hits a stack overflow and dies.</p>
</aside>
<p>The rule for equality is a little more complex:</p>
<div class="codehilite"><pre>
<span class="i">equality</span> → <span class="i">comparison</span> ( ( <span class="s">"!="</span> | <span class="s">"=="</span> ) <span class="i">comparison</span> )* ;
</pre></div>
<p>In Java, that becomes:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>expression</em>()</div>
<pre>
<span class="k">private</span> <span class="t">Expr</span> <span class="i">equality</span>() {
<span class="t">Expr</span> <span class="i">expr</span> = <span class="i">comparison</span>();
<span class="k">while</span> (<span class="i">match</span>(<span class="i">BANG_EQUAL</span>, <span class="i">EQUAL_EQUAL</span>)) {
<span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
<span class="t">Expr</span> <span class="i">right</span> = <span class="i">comparison</span>();
<span class="i">expr</span> = <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Binary</span>(<span class="i">expr</span>, <span class="i">operator</span>, <span class="i">right</span>);
}
<span class="k">return</span> <span class="i">expr</span>;
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>expression</em>()</div>
<p>Let’s step through it. The left <code>comparison</code> nonterminal in the body is
translated to the first call to <code>comparison()</code> and we store that in a local
variable.</p>
<p>Then, the <code>( ... )*</code> loop in the rule is mapped to a <code>while</code> loop. We need to
know when to exit that loop. We can see that inside the rule, we must first find
either a <code>!=</code> or <code>==</code> token. So, if we <em>don’t</em> see one of those, we must be done
with the sequence of equality operators. We express that check using a handy
<code>match()</code> method:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>equality</em>()</div>
<pre>
<span class="k">private</span> <span class="t">boolean</span> <span class="i">match</span>(<span class="t">TokenType</span>... <span class="i">types</span>) {
<span class="k">for</span> (<span class="t">TokenType</span> <span class="i">type</span> : <span class="i">types</span>) {
<span class="k">if</span> (<span class="i">check</span>(<span class="i">type</span>)) {
<span class="i">advance</span>();
<span class="k">return</span> <span class="k">true</span>;
}
}
<span class="k">return</span> <span class="k">false</span>;
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>equality</em>()</div>
<p>This checks to see if the current token is any of the given types. If so, it
consumes the token and returns <code>true</code>. Otherwise, it returns <code>false</code> and leaves
the token as the current one.</p>
<p>The <code>match()</code> method is defined in terms of two more fundamental operations:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>match</em>()</div>
<pre>
<span class="k">private</span> <span class="t">boolean</span> <span class="i">check</span>(<span class="t">TokenType</span> <span class="i">type</span>) {
<span class="k">if</span> (<span class="i">isAtEnd</span>()) <span class="k">return</span> <span class="k">false</span>;
<span class="k">return</span> <span class="i">peek</span>().<span class="i">type</span> == <span class="i">type</span>;
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>match</em>()</div>
<p>This returns <code>true</code> if the current token is of the given type. Unlike <code>match()</code>,
it doesn’t consume the token, it only looks at it.</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>check</em>()</div>
<pre>
<span class="k">private</span> <span class="t">Token</span> <span class="i">advance</span>() {
<span class="k">if</span> (!<span class="i">isAtEnd</span>()) <span class="i">current</span>++;
<span class="k">return</span> <span class="i">previous</span>();
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>check</em>()</div>
<p>This consumes the current token and returns it, similar to how our scanner’s
<code>advance()</code> method did with characters.</p>
<p>These methods bottom out on the last handful of primitive operations:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>advance</em>()</div>
<pre>
<span class="k">private</span> <span class="t">boolean</span> <span class="i">isAtEnd</span>() {
<span class="k">return</span> <span class="i">peek</span>().<span class="i">type</span> == <span class="i">EOF</span>;
}
<span class="k">private</span> <span class="t">Token</span> <span class="i">peek</span>() {
<span class="k">return</span> <span class="i">tokens</span>.<span class="i">get</span>(<span class="i">current</span>);
}
<span class="k">private</span> <span class="t">Token</span> <span class="i">previous</span>() {
<span class="k">return</span> <span class="i">tokens</span>.<span class="i">get</span>(<span class="i">current</span> - <span class="n">1</span>);
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>advance</em>()</div>
<p><code>isAtEnd()</code> checks if we’ve run out of tokens to parse. <code>peek()</code> returns the
current token we have yet to consume and <code>previous()</code> returns the most recently
consumed token. The latter makes it easier to use <code>match()</code> and then access the
just-matched token.</p>
<p>That’s most of the parsing infrastructure we need. Where were we? Right, so if
we are inside the <code>while</code> loop in <code>equality()</code>, then the parser knows it found a
<code>!=</code> or <code>==</code> operator and must be parsing an equality expression.</p>
<p>It grabs the token that was matched for the operator so we can track which kind
of binary expression this is. Then it calls <code>comparison()</code> again to parse the
right-hand operand. It combines the operator and the two operands into a new
<code>Expr.Binary</code> syntax tree node, and then loops around. Each time, it stores the
expression back in the same <code>expr</code> local variable. As it zips through a sequence
of equality expressions, that creates a left-associative nested tree of binary
operator nodes.</p>
<p><span name="sequence"></span></p><img src="image/parsing-expressions/sequence.png" alt="The syntax tree created by parsing 'a == b == c == d == e'" />
<aside name="sequence">
<p>Parsing <code>a == b == c == d == e</code>. Each iteration, we create a new binary
expression using the previous one as the left operand.</p>
</aside>
<p>The parser falls out of the loop once it hits a token that’s not an equality
operator. Finally, it returns the expression. Note that if it doesn’t encounter
a single equality operator, then it never enters the loop. In that case, the
<code>equality()</code> method effectively calls and returns <code>comparison()</code>. In that way,
this method matches an equality operator <em>or anything of higher precedence</em>.</p>
<p>Moving on to the next rule…</p>
<div class="codehilite"><pre>
<span class="i">comparison</span> → <span class="i">addition</span> ( ( <span class="s">">"</span> | <span class="s">">="</span> | <span class="s">"<"</span> | <span class="s">"<="</span> ) <span class="i">addition</span> )* ;
</pre></div>
<p>Translated to Java:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>equality</em>()</div>
<pre>
<span class="k">private</span> <span class="t">Expr</span> <span class="i">comparison</span>() {
<span class="t">Expr</span> <span class="i">expr</span> = <span class="i">addition</span>();
<span class="k">while</span> (<span class="i">match</span>(<span class="i">GREATER</span>, <span class="i">GREATER_EQUAL</span>, <span class="i">LESS</span>, <span class="i">LESS_EQUAL</span>)) {
<span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
<span class="t">Expr</span> <span class="i">right</span> = <span class="i">addition</span>();
<span class="i">expr</span> = <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Binary</span>(<span class="i">expr</span>, <span class="i">operator</span>, <span class="i">right</span>);
}
<span class="k">return</span> <span class="i">expr</span>;
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>equality</em>()</div>
<p>The grammar rule is virtually identical to <code>equality</code> and so is the
corresponding code. The only <span name="handle">differences</span> are the
token types for the operators we match, and the method we call for the operands,
now <code>addition()</code> instead of <code>comparison()</code>. The remaining two binary operator
rules follow the same pattern:</p>
<aside name="handle">
<p>If you wanted to do some clever Java 8, you could create a helper method for
parsing a left-associative series of binary operators given a list of token
types and an operand method handle and unify some of this redundant code.</p>
</aside>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>comparison</em>()</div>
<pre>
<span class="k">private</span> <span class="t">Expr</span> <span class="i">addition</span>() {
<span class="t">Expr</span> <span class="i">expr</span> = <span class="i">multiplication</span>();
<span class="k">while</span> (<span class="i">match</span>(<span class="i">MINUS</span>, <span class="i">PLUS</span>)) {
<span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
<span class="t">Expr</span> <span class="i">right</span> = <span class="i">multiplication</span>();
<span class="i">expr</span> = <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Binary</span>(<span class="i">expr</span>, <span class="i">operator</span>, <span class="i">right</span>);
}
<span class="k">return</span> <span class="i">expr</span>;
}
<span class="k">private</span> <span class="t">Expr</span> <span class="i">multiplication</span>() {
<span class="t">Expr</span> <span class="i">expr</span> = <span class="i">unary</span>();
<span class="k">while</span> (<span class="i">match</span>(<span class="i">SLASH</span>, <span class="i">STAR</span>)) {
<span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
<span class="t">Expr</span> <span class="i">right</span> = <span class="i">unary</span>();
<span class="i">expr</span> = <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Binary</span>(<span class="i">expr</span>, <span class="i">operator</span>, <span class="i">right</span>);
}
<span class="k">return</span> <span class="i">expr</span>;
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>comparison</em>()</div>
<p>That’s all of the binary operators, parsed with the correct precedence and
associativity. We’re crawling up the precedence hierarchy and now we’ve reached
the unary operators:</p>
<div class="codehilite"><pre>
<span class="i">unary</span> → ( <span class="s">"!"</span> | <span class="s">"-"</span> ) <span class="i">unary</span>
| <span class="i">primary</span> ;
</pre></div>
<p>The code for this is a little different:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>multiplication</em>()</div>
<pre>
<span class="k">private</span> <span class="t">Expr</span> <span class="i">unary</span>() {
<span class="k">if</span> (<span class="i">match</span>(<span class="i">BANG</span>, <span class="i">MINUS</span>)) {
<span class="t">Token</span> <span class="i">operator</span> = <span class="i">previous</span>();
<span class="t">Expr</span> <span class="i">right</span> = <span class="i">unary</span>();
<span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Unary</span>(<span class="i">operator</span>, <span class="i">right</span>);
}
<span class="k">return</span> <span class="i">primary</span>();
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>multiplication</em>()</div>
<p>Again, we look at the <span name="current">current<span> token to see how to
parse. If it’s a <code>!</code> or <code>-</code>, we must have a unary expression. In that case, we
grab the token, and then recursively call <code>unary()</code> again to parse the operand.
Wrap that all up in a unary expression syntax tree and we’re done.</p>
<aside name="current">
<p>The fact that the parser looks ahead at upcoming tokens to decide how to parse
puts recursive descent into the category of <strong>predictive parsers</strong>.</p>
</aside>
<p>Otherwise, we must have reached the highest level of precedence, primary
expressions.</p>
<div class="codehilite"><pre>
<span class="i">primary</span> → <span class="t">NUMBER</span> | <span class="t">STRING</span> | <span class="s">"false"</span> | <span class="s">"true"</span> | <span class="s">"nil"</span>
| <span class="s">"("</span> <span class="i">expression</span> <span class="s">")"</span> ;
</pre></div>
<p>Most of the cases for the rule are single terminals, so it’s pretty
straightforward:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>unary</em>()</div>
<pre>
<span class="k">private</span> <span class="t">Expr</span> <span class="i">primary</span>() {
<span class="k">if</span> (<span class="i">match</span>(<span class="i">FALSE</span>)) <span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Literal</span>(<span class="k">false</span>);
<span class="k">if</span> (<span class="i">match</span>(<span class="i">TRUE</span>)) <span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Literal</span>(<span class="k">true</span>);
<span class="k">if</span> (<span class="i">match</span>(<span class="i">NIL</span>)) <span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Literal</span>(<span class="k">null</span>);
<span class="k">if</span> (<span class="i">match</span>(<span class="i">NUMBER</span>, <span class="i">STRING</span>)) {
<span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Literal</span>(<span class="i">previous</span>().<span class="i">literal</span>);
}
<span class="k">if</span> (<span class="i">match</span>(<span class="i">LEFT_PAREN</span>)) {
<span class="t">Expr</span> <span class="i">expr</span> = <span class="i">expression</span>();
<span class="i">consume</span>(<span class="i">RIGHT_PAREN</span>, <span class="s">"Expect ')' after expression."</span>);
<span class="k">return</span> <span class="k">new</span> <span class="t">Expr</span>.<span class="t">Grouping</span>(<span class="i">expr</span>);
}
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>unary</em>()</div>
<p>The interesting branch is the one for handling parentheses. After we match an
opening <code>(</code> and parse the expression inside it, we <em>must</em> find a <code>)</code> token. If
we don’t, that’s an error.</p>
<h2><a href="#syntax-errors" name="syntax-errors"><small>6 . 3</small>Syntax Errors</a></h2>
<p>A parser really has two jobs:</p>
<ol>
<li>
<p>Given a valid sequence of tokens, produce a corresponding syntax tree.</p>
</li>
<li>
<p>Given an <em>invalid</em> sequence of tokens, detect any errors and tell the
user about their mistakes.</p>
</li>
</ol>
<p>Don’t underestimate how important the second job is! In modern IDEs and editors,
the parser is constantly reparsing code<span class="em">—</span>often while the user is still editing
it<span class="em">—</span>in order to syntax highlight and support things like auto-complete. That
means it will encounter code in incomplete, half-wrong states <em>all the time.</em></p>
<p>When the user doesn’t realize the syntax is wrong, it is up to the parser to
help guide them back onto the right path. The way it reports errors is a large
part of your language’s user interface. Good syntax error handling is hard. By
definition, the code isn’t in a well-defined state, so there’s no infallible way
to know what the user <em>meant</em> to write. The parser can’t read your <span
name="telepathy">mind</span>.</p>
<aside name="telepathy">
<p>Not yet at least. With the way things are going in machine learning these days,
who knows what the future will bring?</p>
</aside>
<p>There are a couple of hard requirements for when the parser runs into a syntax
error:</p>
<ul>
<li>
<p><strong>It must detect and report the error.</strong> If it doesn’t detect the <span
name="error">error</span> and passes the resulting malformed syntax tree on
to the interpreter, all manner of horrors may be summoned.</p>
<aside name="error">
<p>Philosophically speaking, if an error isn’t detected and the interpreter
runs the code, is it <em>really</em> an error?</p>
</aside></li>
<li>
<p><strong>It must not crash or hang.</strong> Syntax errors are a fact of life and language
tools have to be robust in the face of them. Segfaulting or getting stuck in
an infinite loop isn’t allowed. While the source may not be valid <em>code</em>,
it’s still a valid <em>input to the parser</em> because users use the parser to
learn what syntax is allowed.</p>
</li>
</ul>
<p>Those are the table stakes if you want to get in the parser game at all, but you
really want to raise the ante beyond that. A decent parser should:</p>
<ul>
<li>
<p><strong>Be fast.</strong> Computers are thousands of times faster than they were when
parser technology was first invented. The days of needing to optimize your
parser so that it could get through an entire source file during a coffee
break are over. But programmer expectations have risen as quickly, if not
faster. They expect their editors to reparse files in milliseconds after
every keystroke.</p>
</li>
<li>
<p><strong>Report as many distinct errors as there are.</strong> Aborting after the first
error is easy to implement, but it’s annoying for users if every time they
fix what they think is the one error in a file, a new one appears. They
want to see them all.</p>
</li>
<li>
<p><strong>Minimize <em>cascaded</em> errors.</strong> Once a single error is found, the parser no
longer really knows what’s going on. It tries to get itself back on track
and keep going, but if it gets confused, it may report a slew of ghost
errors that don’t indicate other real problems in the code. When the first
error is fixed, they disappear, because they merely represent the parser’s
own confusion. These are annoying because they can scare the user into
thinking their code is in a worse state than it is.</p>
</li>
</ul>
<p>The last two points are in tension. We want to report as many separate errors as
we can, but we don’t want to report ones that are merely side effects of an
earlier one.</p>
<p>The way a parser responds to an error and keeps going to look for later errors
is called <strong>“error recovery”</strong>. It was a hot research topic in the 60s. Back
then, you’d hand a stack of punch cards to the secretary and come back the next
day to see if the compiler succeeded. With an iteration loop that slow, you
<em>really</em> wanted to find every single error in your code in one pass.</p>
<p>Today, when parsers complete before you’ve even finished typing, it’s less of an
issue. Simple, fast error recovery is fine.</p>
<h3><a href="#panic-mode-error-recovery" name="panic-mode-error-recovery"><small>6 . 3 . 1</small>Panic mode error recovery</a></h3>
<aside name="panic">
<p>You know you want to push it.</p><img src="image/parsing-expressions/panic.png" alt="A big shiny 'PANIC' button.">
</aside>
<p>Of all the recovery techniques devised in yesteryear, the one that best stood
the test of time is called<span class="em">—</span>somewhat alarmingly<span class="em">—</span><span name="panic">“panic
mode”</span>. As soon as the parser detects an error, it enters panic mode. It
knows at least one token doesn’t make sense given its current state in the
middle of some stack of grammar productions.</p>
<p>Before it can get back to parsing, it needs to get its state and the sequence of
forthcoming tokens aligned such that the next token does match the rule being
parsed. This process is called <strong>synchronization</strong>.</p>
<p>To do that, we select some rule in the grammar that will mark the
synchronization point. The parser fixes its parsing state by jumping out of any
nested productions until it gets back to that rule. Then it synchronizes the
token stream by discarding tokens until it reaches one that can appear at that
point in the rule.</p>
<p>Any additional real syntax errors hiding in those discarded tokens aren’t
reported, but it also means that any mistaken cascaded errors that are side
effects of the initial error aren’t <em>falsely</em> reported either, which is a decent
trade-off.</p>
<p>The traditional place in the grammar to synchronize is between statements. We
don’t have those yet, so we won’t actually synchronize in this chapter, but
we’ll get the machinery in place for later.</p>
<h3><a href="#entering-panic-mode" name="entering-panic-mode"><small>6 . 3 . 2</small>Entering panic mode</a></h3>
<p>Back before we went on this side trip around error recovery, we were writing the
code to parse a parenthesized expression. After parsing the expression, it
looks for the closing <code>)</code> by calling <code>consume()</code>. Here, finally, is that method:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>match</em>()</div>
<pre>
<span class="k">private</span> <span class="t">Token</span> <span class="i">consume</span>(<span class="t">TokenType</span> <span class="i">type</span>, <span class="t">String</span> <span class="i">message</span>) {
<span class="k">if</span> (<span class="i">check</span>(<span class="i">type</span>)) <span class="k">return</span> <span class="i">advance</span>();
<span class="k">throw</span> <span class="i">error</span>(<span class="i">peek</span>(), <span class="i">message</span>);
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>match</em>()</div>
<p>It’s similar to <code>match()</code> in that it checks to see if the next token is of the
expected type. If so, it consumes it and everything is groovy. If some other
token is there, then we’ve hit an error. We report it by calling this:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>previous</em>()</div>
<pre>
<span class="k">private</span> <span class="t">ParseError</span> <span class="i">error</span>(<span class="t">Token</span> <span class="i">token</span>, <span class="t">String</span> <span class="i">message</span>) {
<span class="t">Lox</span>.<span class="i">error</span>(<span class="i">token</span>, <span class="i">message</span>);
<span class="k">return</span> <span class="k">new</span> <span class="t">ParseError</span>();
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>previous</em>()</div>
<p>First, that shows the error to the user by calling:</p>
<div class="codehilite"><div class="source-file"><em>lox/Lox.java</em><br>
add after <em>report</em>()</div>
<pre>
<span class="k">static</span> <span class="t">void</span> <span class="i">error</span>(<span class="t">Token</span> <span class="i">token</span>, <span class="t">String</span> <span class="i">message</span>) {
<span class="k">if</span> (<span class="i">token</span>.<span class="i">type</span> == <span class="t">TokenType</span>.<span class="i">EOF</span>) {
<span class="i">report</span>(<span class="i">token</span>.<span class="i">line</span>, <span class="s">" at end"</span>, <span class="i">message</span>);
} <span class="k">else</span> {
<span class="i">report</span>(<span class="i">token</span>.<span class="i">line</span>, <span class="s">" at '"</span> + <span class="i">token</span>.<span class="i">lexeme</span> + <span class="s">"'"</span>, <span class="i">message</span>);
}
}
</pre></div>
<div class="source-file-narrow"><em>lox/Lox.java</em>, add after <em>report</em>()</div>
<p>This reports an error at a given token. It shows the token’s location and the
token itself. This will come in handy later since we use tokens throughout the
interpreter to track locations in code.</p>
<p>After this is called, the user knows about the syntax error, but what does the
<em>parser</em> do next? Back in <code>error()</code>, it creates and returns a ParseError, an
instance of:</p>
<div class="codehilite"><pre class="insert-before">
class Parser {
</pre><div class="source-file"><em>lox/Parser.java</em><br>
nest inside class <em>Parser</em></div>
<pre class="insert">
<span class="k">private</span> <span class="k">static</span> <span class="k">class</span> <span class="t">ParseError</span> <span class="k">extends</span> <span class="t">RuntimeException</span> {}
</pre><pre class="insert-after">
private final List<Token> tokens;
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, nest inside class <em>Parser</em></div>
<p>This is a simple sentinel class we use to unwind the parser. The <code>error()</code>
method <em>returns</em> it instead of <em>throwing</em> because we want to let the caller
decide whether to unwind or not.</p>
<p>Some parse errors occur in places where the parser isn’t likely to get into a
weird state and we don’t need to <span name="production">synchronize</span>. In
those places, we simply report the error and keep on truckin’. For example, Lox
limits the number of arguments you can pass to a function. If you pass too many,
the parser needs to report that error, but it can and should simply keep on
parsing the extra arguments instead of freaking out and going into panic mode.</p>
<aside name="production">
<p>Another way to handle common syntax errors is with <strong>error productions</strong>. You
augment the grammar with a rule that matches the erroneous syntax. The parser
safely parses it but then reports it as an error instead of producing a syntax
tree.</p>
<p>For example, some languages have a unary <code>+</code> operator, like <code>+123</code>, but Lox does
not. Instead of getting confused when the parser stumbles onto a <code>+</code> at the
beginning of an expression, we could extend the unary rule to allow it:</p>
<div class="codehilite"><pre>
<span class="i">unary</span> → ( <span class="s">"!"</span> | <span class="s">"-"</span> | <span class="s">"+"</span> ) <span class="i">unary</span>
| <span class="i">primary</span> ;
</pre></div>
<p>This lets the parser consume <code>+</code> without going into panic mode or leaving the
parser in a weird state.</p>
<p>Error productions work well because you, the parser author, know <em>how</em> the code
is wrong and what the user was likely trying to do. That means you can give a
more helpful message to get the user back on track, like, “Unary ‘+’ expressions
are not supported.” Mature parsers tend to accumulate error productions like
barnacles since they help users fix common mistakes.</p>
</aside>
<p>In our case, though, the syntax error is nasty enough that we want to panic and
synchronize. Discarding tokens is pretty easy, but how do we synchronize the
parser’s own state?</p>
<h3><a href="#synchronizing-a-recursive-descent-parser" name="synchronizing-a-recursive-descent-parser"><small>6 . 3 . 3</small>Synchronizing a recursive descent parser</a></h3>
<p>With recursive descent, the parser’s state<span class="em">—</span>which rules it is in the middle of
recognizing<span class="em">—</span>is not stored explicitly in fields. Instead, we use Java’s
own call stack to track what the parser is doing. Each rule in the process of
being parsed is a callframe on the stack. In order to reset that state, we need
to clear out those callframes.</p>
<p>The natural way to do that in Java is exceptions. When we want to synchronize,
we <em>throw</em> that ParseError object. Higher up in the method for the grammar rule
we are synchronizing to, we’ll catch it. Since we are synchronizing on statement
boundaries, we’ll catch the exception there. After the exception is caught, the
parser is in the right state. All that’s left is to synchronize the tokens.</p>
<p>We want to discard tokens until we’re right at the beginning of the next
statement. That boundary is pretty easy to spot<span class="em">—</span>it’s one of the main reasons
we picked it. <em>After</em> a semicolon, we’re <span name="semicolon">probably</span>
finished with a statement. Most statements start with a keyword<span class="em">—</span><code>for</code>, <code>if</code>,
<code>return</code>, <code>var</code>, etc. When the <em>next</em> token is any of those, we’re probably
about to start a statement.</p>
<aside name="semicolon">
<p>I say “probably” because we could hit a semicolon separating clauses in a for
loop. Our synchronization isn’t perfect, but that’s OK. We’ve already reported
the first error precisely, so everything after that is kind of “best effort”.</p>
</aside>
<p>This method encapsulates that logic:</p>
<div class="codehilite"><div class="source-file"><em>lox/Parser.java</em><br>
add after <em>error</em>()</div>
<pre>
<span class="k">private</span> <span class="t">void</span> <span class="i">synchronize</span>() {
<span class="i">advance</span>();
<span class="k">while</span> (!<span class="i">isAtEnd</span>()) {
<span class="k">if</span> (<span class="i">previous</span>().<span class="i">type</span> == <span class="i">SEMICOLON</span>) <span class="k">return</span>;
<span class="k">switch</span> (<span class="i">peek</span>().<span class="i">type</span>) {
<span class="k">case</span> <span class="i">CLASS</span>:
<span class="k">case</span> <span class="i">FUN</span>:
<span class="k">case</span> <span class="i">VAR</span>:
<span class="k">case</span> <span class="i">FOR</span>:
<span class="k">case</span> <span class="i">IF</span>:
<span class="k">case</span> <span class="i">WHILE</span>:
<span class="k">case</span> <span class="i">PRINT</span>:
<span class="k">case</span> <span class="i">RETURN</span>:
<span class="k">return</span>;
}
<span class="i">advance</span>();
}
}
</pre></div>
<div class="source-file-narrow"><em>lox/Parser.java</em>, add after <em>error</em>()</div>
<p>It discards tokens until it thinks it found a statement boundary. After catching
a ParseError, we’ll call this and then we are hopefully back in sync. When it
works well, we have discarded tokens that would have likely caused cascaded
errors anyway and now we can parse the rest of the file starting at the next
statement.</p>
<p>Alas, we don’t get to see this method in action, since we don’t have statements
yet. We’ll get to that <a href="statements-and-state.html">in a couple of chapters</a>. For now, if an
error occurs, we’ll panic and unwind all the way to the top and stop parsing.
Since we can only parse a single expression anyway, that’s no big loss.</p>
<h2><a href="#wiring-up-the-parser" name="wiring-up-the-parser"><small>6 . 4</small>Wiring up the Parser</a></h2>
<p>We are mostly done parsing expressions now. There is one other place where we
need to add a little error handling. As the parser descends through the parsing
methods for each grammar rule, it eventually hits <code>primary()</code>. If none of the
cases in there match, it means we are sitting on a token that can’t start an
expression. We need to handle that error too:</p>
<div class="codehilite"><pre class="insert-before">
if (match(LEFT_PAREN)) {