The LEXER Package
lexer package is a tokenizer for Common Lisp that makes heavy use of the
Creating a Lexer Function
lexer package allows you to use regular expressions to pattern match against an input buffer and return tokens. It does this much in the same way that Lex does. Using the
define-lexer macro, you can define a function that will attempt to match a list of patterns against a buffer.
(define-lexer lexer (state-var) &body patterns)
The state-var is the current lexical state: it tracks the lexical buffer being parsed as well as a stack of lexical functions (defined with
lexbuf)... more on this later.
Here is a simple example:
CL-USER > (define-lexer my-lexer (state) ("%s+" (values :next-token)) ("=" (values :eq)) ("%a%w*" (values :ident $$)) ("%d+" (values :int (parse-integer $$)))) MY-LEXER
NOTE: If you don't understand the
$$ symbol in the example above, please see this README.
Each pattern should either return
nil - indicating the end of the input buffer has been reached - or (up to) two values: the class of the token and the value of the token. Returning
:next-token for the class is special, and indicates that this token should just be skipped.
Now that we have a lexer function, the
tokenize function can be called to parse a source string.
(tokenize lexer string &optional source)
lexer is our function and
string is what will be tokenized. The
source argument can be used to identify where
string came from (e.g. a pathname), as it will be used in error reporting.
NOTE: A lexical state object is created for you always, and there is never a need to create one yourself.
Let's give it a try:
CL-USER > (tokenize 'my-lexer "x = 10") (#<LEXER::TOKEN IDENT "x"> #<LEXER::TOKEN EQ> #<LEXER::TOKEN INT 10>)
If all the patterns in your lexer function fail to match, then a
lex-error condition is signaled, letting you know exactly where the problem is located at.
CL-USER > (tokenize 'my-lexer "x = $10" "REPL") Error: Lexing error on line 1 of "REPL" 1 (abort) Return to level 0. 2 Return to top loop level 0.
<script> tags, and in many languages quoted strings are a mini-DSL unto themselves.
The lexer functions you create with
define-lexer all take a
lexstate object as a parameter. The
lexstate actually contains a stack of lexers, the top-most which is the one being currently used to tokenize the input source. Within a lexer, you can push, pop, and swap to different lexers, while also returning tokens.
;; push a new lexer, return a token (push-lexer state lexer class &optional value) ;; pop the current lexer, return a token (pop-lexer state class &optional value) ;; swap to a different lexer, return a token (swap-lexer state lexer class &optional value)
Each of these will change the current lexer, and also return a token at the same time! This is very useful as the token can be used to signal to the grammar to change parsing rules.
NOTE: Remember that
:next-token is treated special. If you return
:next-token while also changing lexers, the new lexer will not be called until after a complete token has been returned from your current lexer! You'll almost never want to do this.
Let's give this a spin by creating a simple CSV parser. It should be able to parse integers and strings, and strings should be able to escape characters and contain commas.
First, let's define the CSV lexer:
CL-USER > (define-lexer csv-lexer (state) ("%s+" (values :next-token)) ;; tokens ("," (values :comma)) ("%-?%d+" (values :int (parse-integer $$))) ;; string lexer ("\"" (push-lexer state #'string-lexer :quote))) CSV-LEXER
Notice how when we hit a
" character, we're going to push a new lexer onto the the
lexstate and return a
:quote token. The
:quote token will signal to the grammar that we're now parsing a string.
Next, let's define our string lexer.
CL-USER > (define-lexer string-lexer (state) ("\"" (pop-lexer state :quote)) ;; characters ("\\n" (values :chars #\newline)) ("\\t" (values :chars #\tab)) ("\\(.)" (values :chars $1)) ("[^\\\"]+" (values :chars $$)) ;; end of line/source ("%n|$" (error "Unterminated string"))) STRING-LEXER
This lexer has several interesting things going on. First, we see that when we find the next
" character that we pop the lexer (returning to the CSV lexer) and also return a
:quote token. Next, we can see that it handles escaped characters and then any number of characters up until the next backspace (
\\) or quote (
"). Finally, if it reaches the end of the line or file, it will signal an error.
Let's try tokenizing to see what we get.
CL-USER > (tokenize #'csv-lexer "1,\"hello, world\",2") (#<LEXER::TOKEN INT 1> #<LEXER::TOKEN COMMA> #<LEXER::TOKEN QUOTE> #<LEXER::TOKEN CHARS "hello, world"> #<LEXER::TOKEN QUOTE> #<LEXER::TOKEN COMMA> #<LEXER::TOKEN INT 2>)
Creating a Lexer State
Until now, we've been using the
tokenize function to implicitly create a lexer state, which is used by our
define-lexer functions to generate tokens. However, we can do this ourselves and read tokens on-demand as well.
with-lexer will create a lexer state for us, which can then be passed to
read-token to fetch the next token from the
lexbuf in the state object.
(with-lexer (var lexer string &key source start end) &body body)
With this macro, we can only read a portion of the input string, and there's no need to generate a list of all the tokens. We can just read until we get what we want and then stop.
Using our CSV lexer above...
CL-USER > (with-lexer (lexer 'csv-lexer "1,2,3") (print (read-token lexer)) (print (read-token lexer)) (print (read-token lexer))) #<TOKEN INT 1> #<TOKEN COMMA> #<TOKEN INT 2>
A Generic Token Reader
In addition to working with
token objects, sometimes it's easier to just work with the parsed token class and value. The
with-token-reader macro allows you to do just that. You give it a lexical state (created with
with-lexer) and it creates a function you can use to read tokens one by one, returning the class and value as multiple values.
(with-token-reader (var lexer) &body body)
The token-reader will also wrap body in a
handler-case, which will signal a error, providing you with the line, source, and lexeme of the error.
The reason for this macro is that most parsing libraries in Common Lisp expect a lexer function with no arity that returns both the class and value of the next token. Using this macro you can provide it easily.
Here are a few libraries that parse this way:
And an example usage:
(with-lexer (lexer 'csv-lexer string) (with-token-reader (token-reader lexer) (parse 'my-parser token-reader)))
More (Usable) Examples
Here are some lexers used to parse various file formats. As with this package, they are released under the Apache 2.0 license and are free to use in your own projects.
More examples coming as I need them...
If you get some good use out of this package, please let me know; it's nice to know your work is valued by others. If you find a bug or made a nice addition (especially if you improved performance!) please feel free to drop me a line and let me know at firstname.lastname@example.org.