Smithereen is a library for building Top-Down Operator Precedence parsers.
If you're already familiar with TDOP, you will notice that I'm using nonstandard names for the methods that Pratt called "nud" and "led". Part of my goal in writing this library was to understand the algorithm better, and to make it easier for others to understand. I found that Pratt's names were a barrier to understanding, so I'm using "prefix" and "infix" instead (although I still think there are better names out there waiting to be discovered).
I was intrigued when I read Crockford's chapter in Beautiful Code. It seemed to me that top-down operator precedence promised parsers as easy to write and understand as recursive descent, but with much better performance and space-efficiency. Also, as a Rubyist, I was drawn by Crockford's claim that TDOP "is most effective when used in a dynamic, functional programming language." He goes on to say that TDOP:
wants a dynamic language, but dynamic language communities historically have had no use for the syntax that Pratt's technique conveniently realizes.
Clearly, if ever there was a counterexample to that trend, it's the Ruby community.
- Many of the important method names (most of which originated with Pratt) are poor, hindering understanding to someone who does not already grasp the algorithm.
A Python implementation by Fredrik Lundh is easier to follow simply because he starts small and builds up the pieces bit-by-bit, rather than presenting a full parser for a complex grammar right from the start. But of course the Python style of implementation differs in many respects from what you would expect to see in Ruby.
In truth, the core of TDOP is very small, and there's simply not a lot there that is independent of particular grammars. However, I've tried to provide a well-factored set of tools to help with the core TDOP algorithm and several related problems, including scope, symbol-table management, tree building, block structure, and the distinction between expression- and statement-oriented languages.
Smithereen has one distinctive design characteristic, the full implications of which are still unclear. It may have to be changed if it proves to have too great a performance impact.
One interesting characteristic of TDOP in an object-oriented language is that it makes sense for tokens to actively participate in the parsing process: tokens parse their own subexpressions.
It seemed to me that the best way to deal with this in Ruby was for
the parser to extend tokens with modules. So a Smithereen parser's
symbol table contains a module for each token type, and the parser
token.extend(token_module) as each token is received from
I like that design, but it's unusual even in Ruby for such things to be done in the inner loop of an algorithm that strives to perform well. I plan to do some profiling to assess the performance of this design before committing to it for the long term.
- Add some kind of tracing output to make it easier to understand (and debug) the algorithm.
- Reusable example groups for lexers and lexer tokens.
- Refactor scoping.
- Investigate separating parsing from tree building.
- Figure out a good way to unit-test prefix, infix, and stmt methods.
- Address all the TODO and ??? comments
- Whether to hide the
- Whether to raise the token or have another error strategy
- Names: prefix, infix, extend_with_infixes, symbolize, symbol_module, lexer token, take_token, deftoken.
- The names advance_if_whatever and how those methods work.
- Whether to hide the
- Add notes and stats rake tasks.
- Measure performance:
- stack depth
- Fork the project.
- Make your feature addition or bug fix.
- Add tests for it. This is important so I don't break it in a future version unintentionally.
- Commit, do not mess with Rakefile, VERSION, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
- Send me a pull request. Bonus points for topic branches.
Copyright (c) 2009, 2010 Glenn Vanderburg. See LICENSE for details.