Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
 
 
es6
 
 
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Occam Parsers

Occam's parsers.

Contents

Introduction

Three parsers are documented:

  • A BNF parser, actually a variant of extended BNF.
  • A basic parser, for illustrative purposes, and for developing new grammars.
  • The common parser, which can be extended.

All parsers share common functionality. The last two parse content according to rules defined in the aforementioned variant of extended BNF. The BNF parser on the other hand has its rules hard-coded. These rules can be defined in the self same variant that they implement:

  document              ::=  ( rule | error )+ ;

  rule                  ::=  name "::=" definitions ";" ;

  name                  ::=  [name] ;

  definitions           ::=  definition ( "|" definition )* ;

  definition            ::=  part+ ;

  part                  ::=  nonTerminalPart quantifier*

                          |  terminalPart quantifier*

                          |  noWhitespacePart

                          ;

  nonTerminalPart       ::=  choiceOfParts

                          |  sequenceOfParts

                          |  ruleName lookAheadModifier?

                          ;

  terminalPart          ::=  significantTokenType

                          |  regularExpression

                          |  terminalSymbol

                          |  endOfLine

                          |  epsilon

                          |  wildcard

                          ;

  noWhitespacePart      ::=  "<NO_WHITESPACE>" ;

  sequenceOfParts       ::=  "(" part part+ ")" ;

  choiceOfParts         ::=  "(" part ( "|" part )+ ")" ;

  ruleName              ::=  [name] ;

  significantTokenType  ::=  [type] ;

  regularExpression     ::=  [regular-expression] ;

  terminalSymbol        ::=  [string-literal] ;

  endOfLine             ::=  "<END_OF_LINE>" ;

  epsilon               ::=  "ε" ;

  wildcard              ::=  "." ;

  quantifier            ::=  optionalQuantifier

                          |  oneOrMoreQuantifier

                          |  zeroOrMoreQuantifier

                          ;

  lookAheadModifier     ::=  <NO_WHITESPACE>"!" ;

  optionalQuantifier    ::=  <NO_WHITESPACE>"?" ;

  oneOrMoreQuantifier   ::=  <NO_WHITESPACE>"+" ;

  zeroOrMoreQuantifier  ::=  <NO_WHITESPACE>"*" ;

  error                 ::=  . ;

Installation

With npm:

npm install occam-parsers

You can also clone the repository with Git...

git clone https://github.com/jecs-imperial/occam-parsers.git

...and then install the dependencies with npm from within the project's root directory:

npm install

You can also run a development server, see the section on building later on.

Usage

Import the requisite parser and its corresponding lexer from this package and the occam-lexers package, respectively. Then call their fromNothing(...) factory methods.

import { BasicLexer } from "occam-lexers";
import { BasicParser } from "occam-parsers"

const basicLexer = BasicLexer.fromNothing(),
      basicParser = BasicParser.fromNothing();

const content = `

        ...

      `,
      tokens = basicLexer.tokenise(content),
      node = basicParser.parse(tokens);

...

The tokens returned from the lexers's tokenise(...) method can be passed directly to the parser's parse(...) method, which itself returns a node or null.

Examples

There are three examples, one for each parser. To view them, open the index.html file in the root of the repository. Each example shows a representation of the parse tree, which is useful for debugging.

Features

Quantifiers

  • * zero or more
  • + one or more
  • ? optional

These bind tightly to the symbols to their left and can be chained. Take note that both the *+ and ?+ chains will cause an infinite loop and must be avoided.

Regular expressions

A regular expression is distinguished by the usual leading and trailing forward slashes:

/\d+/

Matching significant token types

This can be done with a symbol that is identical to the significant token type in question, contained within square brackets. For example:

 name                       ::=   [unassigned] ;

Matching end of line tokens

This can be done with a special <END_OF_LINE> special symbol. For example:

 verticalSpace              ::=   <END_OF_LINE>+ ;

Matching no whitespace

This can be done with the <NO_WHITESPACE> special symbol. For example:

 parenthesisedTerms         ::=   <NO_WHITESPACE>"(" terms <NO_WHITESPACE>")" ;

It is conventional to leave no whitespace between the symbol and its subsequent part.

Sequences of parts

This can be done with the brackets. For example:

 terms                      ::=   term ( "," term )* ;

Choosing between parts

The vertical bar symbol | is overloaded and can be used in conjunction with brackets to choose between parts as opposed to definitions. For example:

 justifiedStatement         ::=   statement ( "by" | "from" ) reference <END_OF_LINE> ;

Look-ahead

Consider the following rules:

  ABC  ::=  AAB BC ;

  AAB  ::=  "a" "b" | "a";

   BC  ::=  "b" "c" ;

These will not parse the tokens a, b, c because the first definition of the AAB rule will parse the a and b tokens, leaving only the c token for the BC rule to parse. This situation can be addressed by making the AAB rule look ahead, that is, try each of its definitions in turn until one is found that allows the remainder of the parent rule to parse. The look-ahead modifier is an exclamation mark, thus the rules above become:

  ABC  ::=  AAB! BC ;

  AAB  ::=  "a" "b" | "a";

   BC  ::=  "b" "c" ;

Now the ABC rule will indeed parse the tokens a, b, c, because the second definition of the AAB rule will be tried after the first definition fails to allow the BC rule name part to parse.

It seems that the parser parses with roughly linear complexity as a function of the length of the input, however it is most likely that look-ahead parses take exponential time. For this reason, look-ahead should be used sparingly. Also bear in mind that look-ahead is carried out to arbitrary depth and this it affects the behaviour of the ?, *, + quantifiers, which become lazy. Another reason to use it with caution.

Building

Automation is done with npm scripts, have a look at the package.json file. The pertinent commands are:

npm run build-debug
npm run watch-debug

You can also start a small development server:

npm start

The example will then be available at http://localhost:8888/ and will reload automatically when changes are made.

Contact

About

Occam's parsers.

Resources

License

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.