Skip to content

BNF grammar and validation

Zarquan edited this page Apr 18, 2017 · 14 revisions

ADQL grammar

The following analysis is based on the BNF grammar definition from the ADQL-2.1 working draft dated 2nd May 2016.

This is the version of the BNF grammar definition that was imported into the Lyonetia project on the 15th June 2016.

For clarity we will use the term current grammar to refer to this version of the BNF language definition.

For clarity we will use the term new grammar to refer to the new, machine readable and verifiable, version of the ADQL grammar that we are attempting to develop.

For simplicity, the following discussion is based on the simple definitions and examples given in the Wikipedia entries for BNF, ABNF and EBNF.

We know that more technical and definitive definitions must exist for all of these, but for now the Wikipedia entries are enough for our purposes.

If you know of alternative descriptions for any of these terms please add them to this document as additional section or just as links or footnotes.

Standard BNF

It is clear that the current grammar does not use a standard form of BNF. The following list is not intended to be exhaustive, but it does highlight some of the key issues.

  • Standard BNF does not support multi-line rules.
  • String literals should be enclosed in quotes
  • Non-standard syntax for denoting optional elements.
  • Non-standard syntax for denoting repeated elements.
  • Missing declaration of where white space is required.
  • Missing declaration of where optional white space is allowed.

Alternative forms

Before we try to re-write the current grammar to match a strict definition of BNF, it is worth looking at some of the BNF variants or extensions that have been developed to make it simpler to use, or to adapt it to a specific application.

Augmented BNF

Augmented BNF, define by RFC 5234 is used for defining IETF communication protocols. Augmented BNF itself is probably not a good fit for our purpose, however it is included here as a comparison to show how the rules of BNF can be interpreted in a different form.

Extended BNF

Extended BNF is better suited to defining a formal language, such as a programming language or in our case a query language.

There are many different variants of EBNF. However, the International Organization for Standardization has defined an ISO standard form ISO/IEC 14977:1996 used in part for defining other ISO standards.

The full text of the standard for ISO 14977 can be downloaded as part of the standards that are publicly available under JTC1

Advantages over BNF

The following features of EBNF may be relevant to our purpose:

  • multi-line rules, using ; as the terminating character.
  • multi-line comments, using (* and *).
  • optional elements, using [...].
  • repeated elements, using {...}.

However, it is worth noting that in general it is possible to translate from one form of EBNF into another, and that all forms of EBNF can be translated into BNF.

In which case may be better for us to define our own form of EBNF, that matches our own application, rather than adopting the ISO 14977 definition completely.

Next steps

Based on what we have found so far, we can see two main approaches we could take to creating the machine readable new grammar for ADQL.

The first approach would be to start with the current grammar and attempt to modify it produce a machine readable form that can be validated as an EBNF grammar.

The problem with this method is that it is uncertain whether the current grammar is even consistent within itself.

The current grammar was developed over a period of time with different people contributing different sections, with no formal method in place for verifying that a rule change did not cause unintended side effects.

The danger of using the current grammar as the starting basis for the developing the new grammar is that we get tied up in loops trying to untangle unintended contradictions inherent in the current grammar.

An alternative approach be to start with a blank slate, and develop the new grammar for ADQL from the ground up.

Although this sounds like a lot of work, we can still make use the current grammar as a basis for adding rules by taking one rule at a time from the current grammar and adding it to the new grammar, along with a set of test queries that validates the rule.

The advantage of this approach is that we can ensure that the new grammar is always machine readable and verifiable right from the start.

Collaborative development

Starting with a clean slate and adding the rules one at a time also lends itself better to collaborative development.

We can make use of the GitHub pull request mechanism for merging each new rule into the main specification as it is developed.

If we start with a reference copy of the current grammar and comment out each rule, or part of rule, as it is transferred to the new grammar, this will enable us to keep track of how much has been done and what remains to be validated.

In contrast, attempting to manage a collaboration where everyone has a go at hacking the current grammar at the same time feels like a recipe for chaos.

Initial experiments

bnfparser2

This sub-project uses the bnfparser2 BNF parser to test a simple EBNF definition of a query language.

We intend to develop this example further to include a majority of the rules from the current grammar.

Other projects ..

We invite others to suggest alternatives tools and method that we can use to develop the new grammar.

We can then present these to the working group, either on the mailing list or at the May interop meeting in Shanghai.

Sources