Skip to content
Find file
Fetching contributors…
Cannot retrieve contributors at this time
304 lines (213 sloc) 8.72 KB


Canopy is a parser compiler for JavaScript, based on Parsing Expression Grammars and heavily influenced by Treetop.


git clone git://
cd canopy
gem install jake
npm install
npm test


Parsers are generated from grammar definitions; Canopy uses a fairly standard notation for PEGs. You can create a parser by calling Canopy.compile() with a string containing the grammar for the parser. The parser's name will be the name of the grammar followed by Parser. For example:

Canopy.compile('grammar Integer\
  integer <- [1-9] [0-9]*')


// -> {textValue: '3178', offset: 0, elements: [
//      {textValue: '3', offset: 0},
//      {textValue: '178', offset: 1, elements: [
//        {textValue: '1', offset: 1},
//        {textValue: '7', offset: 2},
//        {textValue: '8', offset: 3}
//      ]}
//    ]}

Canopy.compile() both evaluates and returns the source code for the parser it has generated, so you can take that source and save it in a file for later use. The bin directory contains a Node script that generates parser files from grammar files, for example to compile Canopy's own grammar parser, I do this:

node bin/canopy source/canopy/meta_grammar.peg

Grammar definitions

The grammar definition syntax of Canopy is very straighforward and uses common PEG idioms. In contrast to other parser compilers, it does not allow inline code but instead allows you to specify named types for syntax nodes; these types can refer to modules containing methods you want to attach to the nodes. A good example grammar is Canopy's metagrammar, which defines the grammar syntax using the syntax itself. See:


A PEG file should contain the name of the grammar, followed by a list of parsing rules. Each rule maps a name to a parsing expression, the allowed elements of which are listed below. The first rule is assumed to be the root of the parse tree, and can have any name you like.


A string match produces a single syntax node when the input exactly matches the string. Strings are enclosed in double quotes and may contain escape characters.

grammar StringTest
  root <- "Hello, world"

StringTestParser.parse("Hello, world")
// -> {textValue: "Hello, world", offset: 0}


The wildcard character . matches any single character.

grammar WildcardTest
  root <- .

// -> {textValue: "x", offset: 0}

Character classes

A character class is a list of characters enclosed in brackets, just like you'd use in regular expressions. They can contain ranges like [a-z] and are case sensitive.

grammar CharClassTest
  root <- [0-9]

// -> {textValue: "5", offset: 0}


Any expression may be followed by one of the repetition characters, ?, * or +, meaning "zero or one", "zero or more" and "one or more" occurences of the preceeding expression. The ? operator passes through the match if one is found, or an empty-string node otherwise. The others return nodes containing lists of their matches.

grammar MaybeTest
  root <- "a"?

// -> {textValue: "a", offset: 0}

// -> {textValue: "", offset: 0}

// -> null

grammar ZeroOrMoreTest
  root <- "a"*

// -> {textValue: "", offset: 0, elements: []}

// -> {textValue: "aa", offset: 0, elements: [
//      {textValue: "a", offset: 0},
//      {textValue: "a", offset: 1},
//    ]}

grammar OneOrMoreTest
  root <- "a"+

// -> null

// -> {textValue: "aa", offset: 0, elements: [
//      {textValue: "a", offset: 0},
//      {textValue: "a", offset: 1},
//    ]}


Positive and negative lookaheads are denoted by placing & or ! in front of an expression. A lookahead works by inspecting the input for a match and then proceeding without consuming input. For example, the word foo followed a space could be written "foo" &" ", whereas a non-space character could be written !" " .. If the lookahead fails then null is returned, otherwise an empty-string node is produced.

Note that, for example, when the expression "foo" &" " is matched against the string "foo ", the match succeeds but only the string "foo" is consumed. The space remains unconsumed in the input.


A sequence is two or more expressions separated by spaces. The input matches a sequence only if it matches each expression in the sequence in turn.

grammar SequenceTest
  root <- "foo" "bar"

// -> {textValue: "foobar", offset: 0, elements: [
//      {textValue: "foo", offset: 0},
//      {textValue: "bar", offset: 3}
//    ]}

// -> null

Items in sequences can be labelled to make the resulting parse tree easier to navigate. To label an expression, prefix it with a name and a colon (:). The elements array is still available, but some data will be copied to named fields on the sequence node:

grammar LabelTest
  root <- first:[a-z] (", " letter:[a-z])*

LabelTestParser.parse("a, b, c")
// -> {textValue: "a, b, c", offset: 0, elements: [
//      {textValue: "a", offset: 0},
//      {textValue: ", b, c", offset: 1, elements: [
//        {textValue: ", b", offset: 1, elements: [
//          {textValue: ", ", offset: 1},
//          {textValue: "b", offset: 3}
//        ],
//          letter: {textValue: "b", offset: 3}
//        },
//        {textValue: ", c", offset: 4, elements: [
//          {textValue: ", ", offset: 4},
//          {textValue: "c", offset: 6}
//        ],
//          letter: {textValue: "c", offset: 6}
//        }
//      ]}
//    ],
//      first: {textValue: "a", offset: 0}
//    }


Referring to other rules is what gives grammars their expressive power, allowing them to express recursive structures. To refer to another rule, you just enter its name as part of an expression. When used as part of a sequence, a reference creates labelled nodes in the output.

grammar ReferenceTest
  first   <- second "done."
  second  <- "start."

// -> {textValue: "start.done.", offset: 0, elements: [
//      {textValue: "start." offset: 0},
//      {textValue: "done.", offset: 6}
//    ],
//      second: {textValue: "start." offset: 0}
//    }


Choices are lists of expressions separated by " / " and denote a prioritized choice between several alternatives. The input is matched against each alternative in turn, and the first match to succeed is kept.

grammar ChoiceTest
  root <- "foo" / "bar"

// -> {textValue: "foo", offset: 0}

// -> {textValue: "bar", offset: 0}

// -> null

Note that once a choice has succeeded, no further inputs are tried, so choices are unambiguous. This is true even if the choice causes later matching to fail, so for example this grammar will never parse the string "foobar" since we always pick the first branch of the choice.

grammar ChoiceTest
  root <- ("f" / "foo") "bar"

Choices bind more loosely than sequences, for example the expression

"foo" "bar" / "abc" "123"

should be interpreted as

("foo" "bar") / ("abc" "123")

Type annotations

Expressions can be tagged with types, allowing you to use mixins to add methods to syntax nodes. Type labels bind to sequences; if you want to tag a choice then wrap the choice in parentheses.

grammar TypeTest
  root <- first:"foo" second:"bar" <Extension>

Extension = new JS.Module({
  convert: function() {
    return this.first.textValue +

// -> "fooBAR"

grammar ChoiceTypeTest
  root    <- (alpha / beta) <Extension>
  alpha   <- first:"a" second:"z"
  beta    <- first:"j" second:"c"

// -> "aZ"

// -> "jC"
Jump to Line
Something went wrong with that request. Please try again.