diff --git a/asset/sass/chapter.scss b/asset/sass/chapter.scss index 167c84967..f436fb328 100644 --- a/asset/sass/chapter.scss +++ b/asset/sass/chapter.scss @@ -99,9 +99,13 @@ article.chapter { .design-note { background: hsl(80, 30%, 96%); - code, pre { + code, .codehilite { background: hsl(80, 20%, 93%); } + + .codehilite { + margin: -12px 0 -12px -12px; + } } } diff --git a/asset/style.scss b/asset/style.scss index a2f4ae435..b73e807f2 100644 --- a/asset/style.scss +++ b/asset/style.scss @@ -357,7 +357,7 @@ aside { padding: 1px 2px; } - pre { + .codehilite { padding: 6px; margin: -12px 0; } diff --git a/book/scanning.md b/book/scanning.md index 9e8101644..ac5a05191 100644 --- a/book/scanning.md +++ b/book/scanning.md @@ -1,461 +1,437 @@ ^title Scanning ^part A Tree-Walk Interpreter in Java +**TODO: explain snippet notation in introduction** -**TODO: context lines aren't correct. showing content from later or earlier snippets.** +**TODO: consider reorganizing the headers and subheaders** ---- +The first step in any compiler or interpreter is scanning. The scanner takes in the raw source code as a series characters and groups them into meaningful chunks -- the "words" and "punctuation" that make up the language's artificial grammar. -- first step of lang is scanning -- also great first chapter because pretty easy -- by end of chapter, be able to take any string of lox code and chunk into - tokens to later feed into parser + + +This is a good starting for us too because the code isn't very hard -- pretty much a switch statement with delusions of grandeur. It will let us get warmed up before we tackle some of the more interesting material later. + +By the end of this chapter, we'll have a full-featured fast scanner that can handle any string of Lox source code and produce the tokens that we'll feed into the parser in the next chapter. + +## The Interpreter Framework + +Since this is our first real chapter, before we get to actually scanning some code, we need to tie together the basic application framework for our interpreter, jlox. + +Because everything starts with a class in Java, we'll begin with like so: ^code lox-class -- doesn't do much -- still, makes you get your project set up and figure out ide and stuff -- also explains notation here for code snippets -- note file name this should go in +Stick that in a text file, and go get your IDE and Makefile or whatever set up. I'll be right here when you're ready. Good? OK, let's keep going. -- lox is scripting lang -- two ways to run code -- if give jlox path to file -- loads and runs it +Lox is a scripting language, which means it executes directly from source. There are actually two ways you can run some code. If you start jlox from the command line and give it a path to a file, it reads it and executes it: ^code run-file -- other way is "interactive prompt" -- lets user incrementally build up program one line at time -- if run jlox with no arg, enters this mode +If you like a closer emotional connection to your interpreter, you can also run it interactively. If you fire up jlox without any arguments, it drops you into a prompt where you can enter and execute code one line at a time. + + -- reads line of input, executes it and loops -- (ctrl-c exits) -- called "repl" -- lisp history -- both use: +^code prompt + +To escape that infinite loop in jlox, hit Control-C or yank the plug out of your machine if you have anger management problems. + +Both the prompt and the file runner are thin wrappers around this core function: ^code run -- eventually, this will plumb through parser and interpreter -- for now, just prints tokens so we can see what scanner produces +It's not super useful yet since we haven't written the interpreter, but baby steps, you know? Right now, we'll just have it print out the tokens our forthcoming scanner spits out so that we can see if we're making progress. Feel free to run this periodically as you work you way through the chapter to see how it does. + +### Error handling + +While we're setting things up, another key piece of infrastructure is *error handling*. Textbooks sometimes gloss over this because it's more an engineering concern than a formal computer science-y problem. But if you care about making a language that's actually *usable*, then how your interpreter handles errors is a vital concern. + +The tools your language provides for dealing with errors make up a large portion of your language's user interface. When their code is working, they aren't thinking about your language at all -- they're headspace is all about *their program*. It's only when things go sideways that they focus on your beautiful language and its implementation. -### error handling +When that happens, it's up to us to give the user all of the information they need to understand what went wrong and guide them tactfully back to where they are trying to go. Doing that well means thinking about error handling all through the implementation of our interpreter, starting now. -- another key part of interpreter is how it manages errors -- often left out of textbooks, but vital when comes to real impl -- users need help them most when program isn't doing what they want -- error handling pervasive part of job -- sooner start, the better -- confess: pretty simple here, though +Having said all that, for *this* interpreter, what we'll build is pretty bare bones. I'd love to talk about interactive debuggers, static analyzers and other fun stuff, but there's only so much time. -- put in framework now and use it later +For now, we'll start with: ^code lox-error -- tells user error occurred on given line -- telling user that error occurred not very useful -- need to tell them where -- if get error trying to add num to bool, not helpful to say "some + somewhere - in prog is bad, good luck finding it!" -- better would be line, column and length -- best would allow multiple source locations since some errors involved multiple - points in code +This tells users some syntax error occurred on a given line. This is really the bare minimum to be able to claim you even *have* error reporting. Imagine you accidentally leave a dangling comma in some function call and the interpreter prints out: -- just line to keep book simpler +``` +Error: Unexpected "," somewhere in your program. Good luck finding it! +``` + +So we at least need to point them to the right line. Even better would be the beginning and end column so they know where in the line. Even better than *that* is to *show* the user the offending line, like: -- reason defining in lox class is because of hadError +```text +Error: Unexpected "," in argument list. + + 15 | function(first, second,); + ^-- Here. +``` + +I'd love to implement something like that in this book but the honest truth is that it's a lot of grungy string munging code. Very useful for users, but not super fun to read in a book and not very technically interesting. So we'll stick with just a line number. In your interpreters, please do kick it up a notch. + +The primary reason we're sticking this error reporting function in the main Lox class is because of that `hadError` field. It's defined here: ^code had-error (1 before) -- when error occurs while loading script, want to set exit code -- nice to be good command line citizen -- [yes static is hacky] +If a syntax error occurs when running a script, we want to exit with a non-zero exit code like a good command line citizen should: ^code exit-code (1 before, 1 after) -- also generally code to separate error *reporting* code from error *generating* - code -- scanner detects error, but not really its job to know how best to present it - to user -- in prod language, should pass in some kind of ErrorReporter interface to - scanner -- abstract how error displayed -- can print to stderr, show popup on screen, add errors to ide's error log, etc. -- to keep simple, don't have actual abstraction here, but do at least split it - out some - -- shell is in place -- once have scanner class with scanTokens() working, can start using -- before get to that, talk about tokens +We also need to reset it in the prompt. If the user makes a mistake, they should be able to keep going: + +^code reset-had-error (1 before, 1 after) + +The other reason I pulled the error reporting out here instead of stuffing it into the scanner and other phases where the error occurs is to remind you that it's a good engineering practice to separate the code that *generates* the errors from the code that *reports* them. + +The various phases of the front end will detect errors, but it's not really their job to know how to present that to a user. In a full-featured language implementation, you will likely have multiple ways errors can get displayed: on stderr, in an IDE's error window, logged to a file, etc. You don't want that code jammed in your parser. + +Ideally, we would have an actual abstraction, some kind of "ErrorReporter" interface that gets passed to the scanner and parser so that we can swap out different reporting strategies. For our simple interpreter here, I didn't do that, but I did at least separate out the code for error reporting. + + + +And with that, our basic shell is in place. Once we have a Scanner class with some `scanTokens()` method, we can start running it. Before we get to that, les talk about these mysterious "tokens" and the prizes they may or may not be redeemed for. ## Tokens and Lexemes -- what is token? -- smallest sequence of chars that is meaningful -- in `name = "lox";` "name", "=", `"lox"` and `;` all meaningful -- `na` is not, neither is `ox"`. -- scanner's just is to go through string of chars, find meaningful units -- each is called lexeme -- lexeme just raw sequence of chars - -- in process of recognizing lexemes, also figure out other useful stuff - -### token type - -- if lexeme is a word, like `while` can also recognize that it's keyword `while` -- since keywords affect grammar of language, parser will often need logic like, - "if next token is `while` then ..." -- technically, is redundant with lexeme -- could compare strings -- but very slow and kind of ugly -- so at point when we recognize lexeme, which also store which *type* of token - it represents -- which keyword, punctuation, operator, or literal -- simple enum +Here's a line of Lox code: + +```lox +var language = "lox"; +``` + +Here, `var` is the keyword for declaring a variable. That three-character sequence *means* something. If we yank, say, `gua` out of the middle of `language`, those three characters don't mean anything on their own. + +That's what lexical analysis is really about. Our job is to scan through the list of characters and group them together into the smallest possible sequences that still have a well-defined meaning. Each of these is called a **lexeme**. + +In that line of code, the lexemes are: + +``` +var +language += +"lox" +; +``` + +**TODO: illustrate** + +The lexemes are just the raw substrings of the source code. However, in the process of recognizing those and drawing boundaries between each one, we also stumble upon some other useful information. Things like: + +### Lexeme type + +If the lexeme is an identifier whose name matches one of the language's reserved words, like `while` or `if`, we can recognize that now. Since keywords are part of the grammatical structure of the language, the parser often has logic like, "If the next token is `while` then parse a while statement." + +Technically, the parser can determine that right from the lexeme by comparing the strings. But that's slow and kind of ugly. Instead, at the point that we recognize a lexeme, we'll also remember which *kind* of lexeme it represents. We'll have a different type for each keyword, operator, bit of punctuation, and literal value. + +It's a simple enum: ^code token-type -### literal +### Literal value + +Some lexemes represent literal values -- numbers and strings and the like. +Since the scanner has to walk each character in the literal to correctly identify it, it can also convert it to its actual runtime value as used by the interpreter later. + +For example, after the scanner walks over the characters `123` in a number literal, we can convert it the actual numeric value 123. + +### Location information + +Back when I was on my soapbox about error handling, we saw that we need to tell users *where* errors occurred. We have to keep track of that information through every phase of the interpreter, starting here. + +In our simple interpreter, we just track which line the token appears on, but more sophisticated implementations would track the column and length too. -- some lexemes for literals -- at point that scanner detects literal, can also produce runtime value -- if lexeme is number `123` can convert to actual number value 123 + + +We take all of this and wrap it up in a class: ^code token-class -- java really verbose for dumb data object, but that's it +And *that's* what a **token** is -- a bundle containing the raw lexeme along with the other things the scanner knows about it. ## Regular Languages and Expressions -- now know what we need to produce, let's produce it -- core of scanner is loop -- starting at beginning of source, figure out what lexeme first char is part of -- consume as many chars as belong to that lexeme -- produce token -- repeat with rest of string -- when whole string is done, done scanning - -- process of matching chars might seem familiar -- if ever used regular expression, might consider using regex to do it -- ex: if source is `breakfast = "croissant";`, first lexeme is `breakfast` - identifier -- could use regex like `[a-zA-Z_][a-zA-Z_0-9]*` to match it -- captures underlying rule that identifier start with letter or underscore - followed by zero or more letters, underscores or digits -- you have deep insight here - -- rules that determine what chars are allowed in each kind of lexeme are called - "lexical grammar" -- in lox, as in most languages, rules are simple enough to fit within - restriction called "regular language" -- lot of interesting theory here about what makes language regular, how it - ties to fsms -- most other pl books cover well, not getting into here -- same "regular" in "regular" expression -- you *can* make a scanner that uses regexs to match lexemes -- could use java's regex lib -- also tools like lex/flex that will take whole file of regex rules and - generate scanner - -- want to understand how they work -- hand-build scanner for our language's rules -- basic scan loop - -## The Scanner - -- let's sketch out class +Now that we know what we're trying to produce, let's, well, produce it. The code of the scanner is a loop. Starting at the beginning of the source code, it figures out what lexeme the first character belongs to. Then it consumes any following characters that belong to that same lexeme. + +When it hits the end of that lexeme, it emits a token. Then it loops back and does it again, starting from the very next character in the source code. It keeps doing that, eating characters and occasionally, uh, excreting tokens, until it runs out of characters. + +**TODO: illustrate** + +The first step inside the loop where we look at the first couple of characters and figure out which kind of lexeme it *matches* might sound familiar. If you're familiar with regular expressions, you might consider defining a regex for each kind of lexeme and use those to match characters. For example, Lox has the same identifier rules as C, and the regex `[a-zA-Z_][a-zA-Z_0-9]*` matches one. + +If you did think of regular expressions by now, your intuition is a deep one. The rules that determine how characters are associated with different lexemes for a language are called its **lexical grammar**. In Lox, as in most languages, the rules of that grammar are simple enough to within a boundary called a **[regular language][]**. That's the same "regular" as in regular expressions. + +[regular language]: https://en.wikipedia.org/wiki/Regular_language + +You very precisely *can* recognize all of the different lexemes for Lox using +regexes if you want to, and there's a pile of interesting theory underlying why that is and what that means. For +jlox, we could even use Java's regex library for our scanner. Or we could break +out a tool like [Lex][] or [Flex][] to take all of the regular expressions for +Lox's lexical grammar and spit out an entire scanner for us. + + + +[lex]: http://dinosaur.compilertools.net/lex/ +[flex]: https://github.com/westes/flex + + + +But our goal is to understand how a scanner works inside, so we won't be +outsourcing that task. We're about hand-crafted goods here. + +## A Scanner for Lox + +Without further ado, let's make ourselves a scanner. ^code scanner-class -- (seems like creating awful lot of files. early chapters do lot of framework - gets better later) + -- like said, scanner walks string -- finds range of chars that map to lexeme -- when hits end, creates token -- loop until reach end -- then done -- core looks like: +We store the raw source code as a simple string, and we have an empty list that we will fill with tokens as we generate them. The aforementioned loop that does that looks like this: ^code scan-tokens -- eat way through string emitting tokens as we go -- when done, add special eof token to end of list -- not strictly needed, but makes parser little cleaner -- few fields to keep track of where we are +It works its way through the source code until it runs out of characters. When it's done, it adds one final special "end of file" token to the end. That isn't strictly needed, but it will make our parser a little cleaner. + +This loop depends on a couple of fields to keep track of where in the source code we are: ^code scan-state (1 before, 2 after) -- current index of next char to consume in string -- tokenStart beginning of next token -- since later tokens will be more than one char long, need to remember beginning -- when produce token, will be substring from start to current -- line is current line number +The `start` and `current` fields are indexes into the string -- the first character in the current lexeme being scanned, and the character we're currently considering. The other field tracks the line that contains `current`. We'll keep that updated as we go so we can produce tokens that know what line they occur on. -- little helper fn +Then we have one little helper function: ^code is-at-end -## Recognizing Lexemes +### Recognizing lexemes + +Each turn of the loop, we scan the next token. This is the real heart of the scanner. We'll start simple. Imagine if every lexeme was only a single character long. To implement that, you can just consume the next character and pick a token type for it. -- finally get to real heart of scanner -- start simple -- imagine all tokens only single character -- how implement? -- easy, just consume next char -- based on what it is, produce token of right type -- lox does have few single-char tokens, so let's do those +This works fine for several of Lox's real lexemes, so lets start there: ^code scan-token -- uses couple of helper fns +Again, we need a couple of helper methods: ^code advance-and-add-token -- advance adds next char to current lexeme and returns it -- can call even when don't know what lexeme is yet, do know it's going to be - some lexeme +The `advance()` method adds the next character to the lexeme we're currently building and then returns it. We can call this even before we know what kind of lexeme we're building since we know the character is going into *some* lexeme. -- addtoken grabs current lexeme and line info and adds new token to end of list -- also have one for literal we'll use later +If `advance()` is the input then `addToken()` is the output. It grabs the text of the current lexeme and creates a new token for it. Later, we'll use the other overloaded version here to handle tokens with literal values. -### invalid tokens +### Lexical errors -- before fill in rest of language, what happens if next character doesn't - match any lexeme? -- lox doesn't use `@` -- what if user types that in -- add little error handling +Before we get too far in, let's take a moment to think about errors at the lexical level. Lox doesn't use the `@` character. What happens if a user throws a source file containing that at our interpreter? Right now, it just gets silently added to the next token. That ain't right. -^code char-error (1 before, 1 after) +Let's fix that: -- note still called advance(), so still consume char and move on -- important so don't get stuck in infinite loop -- note also don't stop scanning after this -- even though error occurred, want to keep going -- source file may have more than one error -- if possible, should find them all in one go -- no fun if user gets one error, fixes, then another appears and so on -- so strategy is to report error and try to keep truckin' -- won't actually *execute* code though -- if any compile error occurs, no running -- just want to find as many compile errors as we can -- would be better here to consume multiple unrecognized chars in one go so - don't shotgun errors if user has pile of bad chars - -### operators - -- ok, get back to filling in grammar -- have punctuation and single-char operators - -- not all operators -- what about `!`? -- it's a single char token too, right? -- not if followed by `=` -- if `!` char appears by itself, should be `!` token -- but if very next char is `!=`, then should be `!=` token -- likewise `<`, `>`, and `=` -- all can be followed by `=` - -- for those, need to check next char too +^code char-error (1 before, 1 after) -^code two-char-tokens (1 before, 2 after) +Note that the erroneous character was still consumed by the call to `advance(). That's important so that we don't get stuck in an infinite loop. -- uses helper fn +Note also that we keep scanning. There may be other errors later in the program. It gives our users a better experience if we can detect as many of those as possible in one go. It's no fun if they see one tiny error and fix it, only to have the next error appear, and so on. Syntax error whack-a-mole is a drag. -^code match + -### maximal munch +Even though we keep scanning, though, we won't try to *execute* this code. As soon as the first error is hit, `hadError` gets set so we know not to try to run it. -- our language doesn't have '--', but if it did, how would handle: +### Operators - a---b +We have single-character lexemes covered, but that doesn't cover all of Lox's operators. What about `!`? It's a single character, right? Sometimes, yes, but not if it's followed by a `=`. In that case, it should be a `!=` lexeme. Likewise, `<`, `>`, and `=` can all be followed by `=`. -- could be valid, if scanner broke it up like: +For those, we need to look at the second character: - a - --b +^code two-char-tokens (1 before, 2 after) -- or even +Those use this new method: - a - - - b +^code match -- but means when scanner is looking at first two `--`, would have to know what - grammar context it is in to know where to split up -- do scanners do that? -- no: adds way too much entanglement -- basically destroys separation between scanner and parser -- instead, simple rule, called "maximal munch" -- scanner always eats as many characters as it can when forming current token -- so above is scanned as `a -- - b` even though that causes later parse error -- simple is better +It's like a conditional `advance()`. It only consumes the current character if it's what we're looking for. -### comments and whitespace +### Comments and whitespace -- missing one operator, `/` -- little trickier because of `//` +We're still missing one operator, `/`. That one needs a little special handling because we use `//` to begin a comment. -^code slash +^code slash (1 before, 2 after) -- same general idea, after match one `/` if next char is `/`, need to handle - it differently -- comment consumes any char to end of line +This is roughly similar to the other two-character operators. However, when we match a second `/`, then we know we're in a comment. At that point, we keep consuming characters until we hit the end of the line. -^code comment +This is our general strategy for handling longer lexemes. Once we've detected the beginning of a lexeme, we shunt off to some code specific to that kind of lexeme that keeps eating characters until it sees the end. -- another helper +We've got another helper: ^code peek -- sort of like advance, but doesn't consume -- **lookahead** -- only looks at current unconsumed char, so *1* char lookahead -- important to keep this number small, affects perf of scanner -- grammar of language defines how small it can be -- if not *constant* adds lot of complexity to grammar +It's sort of like `advance()`, but doesn't consume the character. This is called **lookahead**. Since it only looks at the current unconsumed character, we have *one character of lookahead*. The smaller this number is, generally, the faster our scanner will run. The lexical grammar dictates how much lookahead we need. Fortunately, most languages in wide use are designed to be scanned with one or two characters of lookahead. + + -- don't want to use match() because want to handle newline later to keep track - of line -- note, don't emit token -- comments consumed, but not turned into token -- just discard -- no addToken() call -- that way rest of pipeline doesn't have to worry about them -- brings to other thing can discard, whitespace: +After we consume all of the characters in the comment, we don't called `addToken()`. The lexeme just gets discarded the next time we loop around and start a new one. This is deliberate. Comments, by design, aren't interpreted by the language. By discarding them now, we simplify the parser since it won't have to worry about them. + +Newlines and other whitespace are also ignored, and now is a good time to do that: ^code whitespace (1 before, 3 after) -- just line comments, spaces and other whitespace are consumed -- (remember, already advanced c) -- but emit no tokens -- newline char little special -- also discarded, but increment line -- that's all that's needed to keep track of what line we're on +For spaces and tabs, we simply go back to the beginning of the scan loop. That will start a new lexeme after the whitespace character. For newlines we do the same thing, but we also increment the line counter. (This is why we used `peek()` to find the newline after a comment instead of `match()`. We want that newline to get here and update `line`.) -- code more free-form now -- can correctly scan +Our scanner is starting to feel more real now. It can handle fairly free-form code like: ```lox // this is a comment -(()){} // grouping stuff -!*+-/=<> // operators +(( )){} // grouping stuff +!*+-/=<> <= == // operators ``` -- making progress! - ### String literals -- if can handle tokens like `!=` that are two chars, ready to tackle longer - ones like number and string literals -- start with strings -- string token always starts with `"`, so begin there +Now that we know how to handle multiple-character lexemes, we can add support for literals. We'll do strings first, since that always begin with a specific character, `"`: ^code string-start (1 before, 2 after) -- calls: +That calls: ^code string -- like two-char operators, consume additional characters after first -- but here, do it in a loop until we hit closing `"` -- also need to safely handle hitting the end of the source without finding the - closing quote -- report unterminated string error +Like with comments, we keep consuming characters until we hit the `"` that ends the string. We also need to gracefully handle running out of input before the string is closed and report an error for that. -- otherwise, produce actual string literal value by stripping off quotes -- store that in token -- lox has no string escape sequences -- if did, would handle them here so literal had real chars +For no particular reason, Lox supports multi-line strings. There are pros and cons to that, but prohibiting them was a little more complex that allowing them, so I left them in. This does mean we also need to update `line` when we hit a newline inside a string. -- note allow multiline strings -- need to update line when newline appears in string too -- could make this an error but most langs seem to end up supporting +Finally, the last interesting bit is that when we create the token, we also produce the actual *value* of the string literal that will be used later by the interpreter. Here, that conversion just requires a `substring()` to strip off the quotes. That's because we don't support escape sequences in Lox. If we did, we'd process those here. ### Number literals -- lox has one number type, but both int and floating point literals -- int is series of digits -- floating point is series of digits, followed by `.` and more digits -- unlike some other langs, don't allow leading or trailing dots -- just +All numbers in Lox are floating point at runtime, but it supports both integer and decimal literals. That means a number literal is a series of digits optionally followed by a `.` and one or more digits: + + ```lox 1234 12.34 ``` -- don't want cases for every digit in switch, so stuff in default +We don't allow a leading or trailing decimal point, so these are both invalid: -^code digit-start (1 before, 1 after) +```lox +.1234 +1234. +``` + +We could easily support the former, but I left it out to keep things simple. The latter gets weird if we ever want to allow methods on numbers like `123.sqrt()`. -- little helper +To recognize the beginning of a number lexeme, we look for any digit. It's kind of tedious to add cases for every decimal digit, so we'll stuff it in the default instead: -- (not use `Character.isDigit()` has stuff like devangari) +^code digit-start (1 before, 1 after) + +This relies on: ^code is-digit -- like string, goes to separate fun to scan rest of number + + +Once we know we are in a number, we branch to a separate method to consume the rest of the literal, like we do with strings: ^code number -- scan sequential digits -- when run out of digits, look for fractional part -- only allow it if there is digit after dot -- (could allow trailing dot, but little weird in language with `.` method - syntax. don't currently allow methods on number literals like `123.abs()`, but wouldn't want to rule out.) -- need two chars of lookahead +It consumes as many digits as it finds for the integer part of the literal. + +Then it looks for a fractional part, which is a decimal point (`.`) followed by at least one digit. This requires another character of lookahead since we don't want to consume the `.` until we're sure there is a digit *after* it. So we'll add: ^code peek-next -- then use Java to convert number to string -- could do this ourselves -- common interview question -- but kind of silly +If we have a fractional part, again, we consume as many digits as we can find. -- last literals are boolean and null, but handle those as keywords, which - gets us too... +Finally, we convert the lexeme to its numeric value. Our interpreter will use Java's `double` type to represent numbers, so we produce a value of that type. We're using Java's own parsing method to convert the lexeme to a real Java double. We could implement this ourselves, but really, unless you're trying to cram for an upcoming programming interview, it's not worth our time. -### Identifiers and keywords +The remaining literals are Booleans and `null`, but we'll handle those as keywords, which gets us to... -- almost everything -- in beginning was word, but for us its at the end +### Identifiers and keywords -- might think we could handle reserved words like we handle multi-char operators like `<=` -- like: +Our scanner is almost done. The only things left in the lexical grammar to implement is identifiers and their close cousins the reserved words. You might think we could match keywords like we handle multiple-character operators like `<=`. Something like: ```java case 'o': @@ -465,63 +441,156 @@ case 'o': break; ``` -- consider if user types in 'orchid' -- [fortran aside] -- remember maximal munch! +But now consider if a user names a variable `orchid`. We don't want the scanner to see the first to letters `or` and immediately emit an `or` keyword token. This gets us to an important principle in lexical grammars called **maximal munch**. When the scanner is matching lexemes if two lexical grammar rules match some chunk of characters, *the longest one wins*. + +That rule states that if we match `orchid` as an identifier and `or` as a keyword, the latter wins. It's also why we tacitly assume above that `<=` should be scanned as a single `<=` token and not `<` followed by `=`. + + -**TODO: make sure context lines look right here** +This means we can't easily detect a reserved word until we've reached the end of what might instead be an identifier. After all, a reserved word *is* an identifier, it's just one that has been claimed by the language for its own use. That's where the term **"reserved word"** comes from. + +Instead, we'll assume any lexeme starting with a letter or underscore is an identifier: ^code identifier-start (3 before, 3 after) -- like number, put in default so don't need cases for every char that can start identifier -- calls +That calls: ^code identifier -- both of those use these obvious helpers +Those use these helpers: ^code is-alpha -- to handle keywords, after finish identifier, see if lexeme matches any of - known set of reserved words -- [if use flex to generate lexer, rolls keywords into main fsm. advanced hand-written lexers still sometimes use fsm for this since perf critical] -- each keyword has its own token type -- makes parser simpler -- so need to associate keywords with token types -- use a map +Now identifiers are working. To handle keywords, we just see if the identifier's lexeme is one of the reserved words. If so, we'll use a token type specific to that keyword. (That will make parsing easier later.) We'll define this set of reserved words in a map: ^code keyword-map -- static since immutable global property of lox language -- [may be first time ever used static block in java] - -- then when scanning ident, see if keyword +Then, after we scan an identifier, we check to see if it matches one of these keywords: ^code keyword-type (2 before, 1 after) -- after scanning ident, look up lexeme in map -- if found, use that type -- otherwise, must be ident - -## challenges - -- challenge - - many langs use newlines as statement separator - - have to handle case where newline occurs in place that should not end - statement - - explain how js, ruby, python, go, and lua handle that - - which do you prefer? - - python's lexer isn't regular, why not? - - aside from separating tokens, spaces aren't used for much. it does - come into play in CoffeeScript, Ruby, and C preprocessor. where? - - scanner discards whitespace and comments - - not meaningful to lang, so don't need - - some scanners keep them - - why? +If so, we use that keyword's token type. Otherwise, it's a regular user-defined identifier. + +And with that, we now have a complete scanner for the entire Lox lexical grammar. Fire up the REPL and type some valid and invalid code in. Does it produce the tokens you expect? Try to come up with some interesting edge cases and see if it handles them as it should. + +
+ +## Challenges + +1. The lexical grammars of Python and Haskell are not *regular*. What does that + mean, and why aren't they? + +1. Aside from separating tokens -- distinguishing `print foo` from + `printfoo` -- spaces aren't used for much in most languages. However, they + do affect how code is parsed in CoffeeScript, Ruby, and the C preprocessor. + Where and what effect do they have in each language? + +1. Our scanner here, like most, discards comments and whitespace since those + aren't needed by the parser. Why might you want to write a scanner that does + *not* discard those? What would it be useful for? + +1. Add support to Lox's scanner for C-style `/* ... */` block comments. Make + sure the handle newlines in them. Consider allowing them to nest. Is adding + support for nesting more work than you expected? Why? + +
+ +
+ +## Design Note: Implicit Semicolons + +Programmers today are spoiled for choice in languages and have gotten picky about the look and feel of its syntax. They want their language to look clean and modern. One bit of syntactic lichen that almost every new language eliminates (and some ancient ones like BASIC never had) is `;` as an explicit statement separator. + +Instead, they treat a newline as a statement separator where it makes sense to do so. The "where it makes sense" part is the interesting bit. While *most* statements are on their own line, sometimes you need to split it in the middle and break it across a couple of lines. Those newlines should not be treated as separators. + +Most of the obvious cases are easy to detect, but there are a handful of nasty cases you run into: + +* A return value on the next line: + + :::js + return + "value" + + Is "value" the value being returned, or do we have a return statement with no value followed by an expression statement containing a string literal? + +* A parenthesized expression on the next line: + + :::js + func + (parenthsized) + + Is this a call to `func(parenthesized)`, or two expression statements, one for `func` and one for a parenthesized expression? + +* A `-` on the next line: + + :::js + first + -second + + Is this `first - second` -- an infix subtraction -- or two expression statements, one for `first` and one to negate `second`? + +In all of these, treating the newline as a separator not would both produce valid code, but possibly not the code the user wants. Across languages, there is an unsettling variety in the rules they use to decide which newlines are separators. Here are a couple: + +* [Lua][] completely ignores newlines, but carefully controls its grammar such that no separator between statements is needed at all in most cases. This is perfectly valid: + + :::lua + a = 1 b = 2 + + It avoids the `return` problem above by requiring a `return` statement to be the very last statement in a block. If there is a value after `return` before the keyword `end`, it *must* be for the return. For the other two cases, they allow an explicit `;` and expect users to use that. In practice, that almost never happens because there's no point in a parenthesized or unary negation expression statement. + +* [Go][] handles it in the scanner. If a newline appears following one of a handful of token types that are known to potentially end a statement, the newline is treated like a semicolon. The Go team provides a canonical code formatter, [gofmt][], and the ecosystem is fervent about its use, which ensures that idiomatic styled code works well with this simple rule. + +* [Python][] treats all newlines as significent unless an explicit backslash is used at the end of a line to continue it to the next line. Also, newlines anywhere inside a pair of brackets (`()`, `[]`, or `{}`) are ignored. Idiomatic style strongly prefers the latter. + + This rule works well for Python because it is a strongly statement-oriented language. In particular, Python's grammar disallows a statement ever appearing inside an expression. This is also true of C, but not true of many other languages which have a "lambda" or function literal syntax. + + For example, in JavaScript, you can have: + + :::js + console.log(function() { + statementInAnExpression(); + }); + + Python would need a different set of rules for implicitly joining lines if you could get back *into* a statement where newlines should become meaningful while still nested inside brackets. + + + +* JavaScript's "[automatic semicolon insertion][asi]" rules are the outliers. Where other languages assume most newlines *are* meaningful and there are just a few in multi-line statements that should be ignored, JS assumes the opposite. It treats all of your semicolons as meaningless whitespace unless that generates a parse error. If it does, it goes back and figures out the minimal set of newlines to turn into semicolons to get to something grammatically valid. + + This design note would turn into a design essay it I went into complete detail about how that even works, much less all the various ways that that is a bad idea. It's a mess. JavaScript is the only language I know where many style guides demand explicit semicolons after every statement even though the language theoretically lets you elide them. + +[lua]: https://www.lua.org/pil/1.1.html +[go]: https://golang.org/ref/spec#Semicolons +[gofmt]: https://golang.org/cmd/gofmt/ +[python]: https://docs.python.org/3.5/reference/lexical_analysis.html#implicit-line-joining +[asi]: https://www.ecma-international.org/ecma-262/5.1/#sec-7.9 + +If you're designing a new language, you almost surely *should* avoid an explicit statement separator. Programmers are creatures of fashion like other humans and semicolons are as passé as ALL CAPS KEYWORDS. Just make sure you pick a set of rules that make sense for your language's particular grammar and idioms. And, uh, don't do what JavaScript did. + +
\ No newline at end of file diff --git a/java/com/craftinginterpreters/lox/Lox.java b/java/com/craftinginterpreters/lox/Lox.java index 84b100eb7..47502eb3b 100644 --- a/java/com/craftinginterpreters/lox/Lox.java +++ b/java/com/craftinginterpreters/lox/Lox.java @@ -27,7 +27,7 @@ public static void main(String[] args) throws IOException { } else if (args.length == 1) { runFile(args[0]); } else { - repl(); + runPrompt(); } } //> run-file @@ -44,17 +44,23 @@ private static void runFile(String path) throws IOException { //< Evaluating Expressions not-yet } //< run-file -//> repl - private static void repl() throws IOException { +//> prompt + private static void runPrompt() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader reader = new BufferedReader(input); - for (;;) { + for (;;) { // [repl] System.out.print("> "); run(reader.readLine()); +//> reset-had-error + hadError = false; +//< reset-had-error +//> Evaluating Expressions not-yet + hadRuntimeError = false; +//< Evaluating Expressions not-yet } } -//< repl +//< prompt //> run private static void run(String source) { Scanner scanner = new Scanner(source); diff --git a/java/com/craftinginterpreters/lox/Scanner.java b/java/com/craftinginterpreters/lox/Scanner.java index 68649583a..2ca0ef53d 100644 --- a/java/com/craftinginterpreters/lox/Scanner.java +++ b/java/com/craftinginterpreters/lox/Scanner.java @@ -8,7 +8,7 @@ import static com.craftinginterpreters.lox.TokenType.*; -class Scanner { +class Scanner { // [files] //> keyword-map private static final Map keywords; @@ -78,7 +78,8 @@ private void scanToken() { //> slash case '/': if (match('/')) { - comment(); + // A comment goes until the end of the line. + while (peek() != '\n' && !isAtEnd()) advance(); } else { addToken(SLASH); } @@ -122,12 +123,6 @@ private void scanToken() { } } //< scan-token -//> comment - private void comment() { - // A comment goes until the end of the line. - while (peek() != '\n' && !isAtEnd()) advance(); - } -//< comment //> identifier private void identifier() { while (isAlphaNumeric(peek())) advance(); @@ -157,9 +152,8 @@ private void number() { while (isDigit(peek())) advance(); } - double value = Double.parseDouble( - source.substring(start, current)); - addToken(NUMBER, value); + addToken(NUMBER, + Double.parseDouble(source.substring(start, current))); } //< number //> string @@ -218,7 +212,7 @@ private boolean isAlphaNumeric(char c) { //> is-digit private boolean isDigit(char c) { return c >= '0' && c <= '9'; - } + } // [is-digit] //< is-digit //> is-at-end private boolean isAtEnd() { diff --git a/java/com/craftinginterpreters/lox/Token.java b/java/com/craftinginterpreters/lox/Token.java index ac6188c0d..dad6ecca4 100644 --- a/java/com/craftinginterpreters/lox/Token.java +++ b/java/com/craftinginterpreters/lox/Token.java @@ -5,7 +5,7 @@ class Token { final TokenType type; final String lexeme; final Object literal; - final int line; + final int line; // [location] Token(TokenType type, String lexeme, Object literal, int line) { this.type = type; diff --git a/note/chapter/parsing-expressions.md b/note/chapter/parsing-expressions.md index 09055de7e..cd4293174 100644 --- a/note/chapter/parsing-expressions.md +++ b/note/chapter/parsing-expressions.md @@ -38,3 +38,11 @@ https://www.cs.cmu.edu/~pattis/misc/ebnf.pdf - is an empty source file a valid program? - write the bnf for a language you know. if the language is complex, just do an interesting subset. what parts are hard? + +-- + +error recovery: + +The tricky part, of course, is that the first error may *cause* later **cascaded errors**. For example, if they accidentally started a string with `'` instead of `"`, then the rest of the string literal will likely cause a number of bogus syntax errors when the scanner and parser tries to treat it like code. + +There is an art, called **error recovery** to getting back to a good state after an error is found to minimize the number of later spurious errors. We'll talk more about it during parsing. diff --git a/note/log.txt b/note/log.txt index 54f736622..8f9d5a644 100644 --- a/note/log.txt +++ b/note/log.txt @@ -1,3 +1,9 @@ +2017/01/01 - 772 words design note for scanner, aside markers in code +2016/12/31 - 1081 words first draft scanner, mostly done +2016/12/30 - 1085 words first draft scanner +2016/12/29 - 722 words first draft scanner +2016/12/28 - 561 words first draft scanner +2016/12/27 - 1127 words first draft scanner 2016/12/26 - finish outlining and splitting, reallow multiline strings 2016/12/25 - fix some bugs in chapter splitting, make multiline strings and error 2016/12/24 - slice up more scanning code into snippets diff --git a/site/contents.html b/site/contents.html index f389c81fe..f1c5a1f61 100644 --- a/site/contents.html +++ b/site/contents.html @@ -118,7 +118,6 @@

Design Note: Statements and Expr

II.A Tree-Walk Interpreter in Java

  1. Scanning - (coming soon!)

    Tokens @@ -129,6 +128,7 @@

    II.Reserved words Error reporting

    +

    Design Note: Implicit Semicolons

  2. Representing Code (coming soon!) diff --git a/site/scanning.html b/site/scanning.html index bd29dfa69..ce3cced9e 100644 --- a/site/scanning.html +++ b/site/scanning.html @@ -23,7 +23,14 @@

    Scanning4

    @@ -48,7 +55,14 @@

    Scanning4

    Scanning4

    @@ -66,9 +80,9 @@

    Scanning4

    Scanning