Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Support for implicit whitespace handling? #22

Open
bendiken opened this Issue April 29, 2011 · 2 comments

2 participants

Arto Bendiken Michael Jackson
Arto Bendiken

I couldn't find this mentioned in the documentation, and the only previous discussion on this that I've found was in issue #3, so asking here...

In my experience writing Citrus grammars is very productive, except for one thing: whitespace handling. Transcribing various standard BNF grammars (SQL, SPARQL, etc) into Citrus form is presently more painful than it could be, given that every terminal needs an explicit `space` appended to it as these grammars all assume that the input has been tokenized.

To keep the production rule definitions sane, as well as to keep them consistent with those in the standard grammar being transcribed, this incentivizes workarounds like the following:

grammar Keywords
  rule space [ \t\n\r] end
  rule ALL      `ALL`      space* end
  rule AS       `AS`       space* end
  rule BY       `BY`       space* end
  rule DISTINCT `DISTINCT` space* end
  rule FROM     `FROM`     space* end
  rule GROUP    `GROUP`    space* end
  rule HAVING   `HAVING`   space* end
  rule JOIN     `JOIN`     space* end
  rule SELECT   `SELECT`   space* end
  rule UNION    `UNION`    space* end
  rule WHERE    `WHERE`    space* end
  ...
end

grammar Tokens
  rule space [ \t\n\r] end
  rule digit                 [0-9] space* end
  rule double_quote          '"' space* end
  rule percent               '%' space* end
  rule ampersand             '&' space* end
  rule left_paren            '(' space* end
  rule right_paren           ')' space* end
  rule asterisk              '*' space* end
  rule plus_sign             '+' space* end
  ...
end

grammar MyGrammar
  include Keywords
  include Tokens

  rule query_specification
    SELECT set_quantifier? select_list table_expression
  end

  rule set_quantifier
    DISTINCT | ALL
  end

  ...
end

The above approach works fine, of course, but seems rather redundant and not a little laborious.

Is there by any chance a magic option I've missed somewhere that would automatically consume any trailing whitespace after recognizing a terminal? Alternatively, is there perhaps a way to feed the #parse method with a sequence of tokens (at its simplest, an Enumerable of strings) instead of giving it an input string?

Thanks for taking the time to read this, and kudos for the awesome job you've done on Citrus so far: the documentation is superb and the source code is a pleasure to read.

Michael Jackson
Owner

I would love to add an "ignore whitespace" feature to Citrus, but I'm not yet sure what the best approach would be. Ideally, there would be an ignore directive that would be a grammar feature that could take an arbitrary rule and check for it in between successful matches.

Arto Bendiken

For the time being, what I ended up using is an approach like the following:

Citrus.require(File.join(File.dirname(__FILE__), 'my_grammar'))

module MyGrammar
  KEYWORDS = %w(ALL AND ANY AS ASC BY CROSS DESC DISTINCT) # etc.

  TERMINALS = {
    :ampersand             => '&',
    :asterisk              => '*',
    :colon                 => ':',
    :comma                 => ',',
    :digit                 => /[0-9]/,
    # etc.
  }

  # Define grammar rules for all keywords:
  KEYWORDS.each do |keyword|
    rule keyword.to_sym do
      all(/\b#{keyword}\b/i, zero_or_one(:separator)) do
        keyword.to_sym
      end
    end
  end

  # Define grammar rules for all terminals:
  TERMINALS.each do |rule_name, token|
    rule rule_name do
      all(token, zero_or_one(:separator)) { token.to_sym }
    end
  end
end

In the preceding, the referenced separator is then a rule in the Citrus grammar definition (my_grammar.citrus) that gobbles up any whitespace and comments, in essence tokenizing the input.

This approach works fine for my present purposes, but if it could be somehow more directly supported by Citrus core, that'd still be useful; glad to hear it's on your radar. I'm not sure, either, what the best approach to this could be, but perhaps one or another of the other packrat parsers out there might have inspiration to share?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.