Skip to content
EfraimFeinstein edited this page Jun 20, 2012 · 1 revision

This page describes the use of the [XSLT and XQuery] (https://github.com/opensiddur/opensiddur/tree/master/code/grammar-parser) grammar parsers, which are necessary for parsing XPointer and XPointer schemes in XSLT and XQuery. They may also be used for any other type of text parsing.

The idea for the implementation is based on [YAPP] (http://www.o-xml.org/yapp), a parser written in XSLT 1.0, although the two share no code. The new parser was written in XSLT 2.0 to take advantage of the language's native support for regular expressions. It was later translated to XQuery so it could be used directly in XQuery code without a serialization step. Most grammars that can be represented in [EBNF] (http://en.wikipedia.org/wiki/Extended_Backus–Naur_Form) can be parsed.

We now introduce new XML namespaces and the conventional prefixes used in the documentation:

To use the parser, you must first define a grammar in the language described below. The grammar is stored as XML.

In your XSLT stylesheet, include the parser code (grammar2.xsl2) or, for XQuery, import grammar2.xqm.

The parser is called using the function call (XSLT):

 func:grammar-parse($string as xs:string, $start-term as xs:string, $grammar as node()) as element()

or (XQuery):

 grammar:parse($string as xs:string, $start-term as xs:string, $grammar as node()) as element()

$string is the string to be parsed, $start-term is the named term where parsing should begin, and $grammar is the XML grammar document; it may be element(p:grammar) or document-node().

The result of func:grammar-parse() may be passed to:

 func:grammar-clean($parsed-grammar as element()) as element()

This function returns the parsed grammar with the r:anonymous elements that represent anonymous terms (p:expAnon, p:termRefAnon, as defined in the grammar) removed.

Example grammars from our project are: A partial [XPointer] (https://github.com/opensiddur/opensiddur/blob/master/code/grammar-parser/xpointer.xml) implementation, the [extended XPointer range() function] (https://github.com/opensiddur/opensiddur/blob/master/code/grammar-parser/xptr-tei.xml) defined by the TEI, and a grammar for our extended version of [Sacred Texts Markup Language] (https://github.com/opensiddur/opensiddur/blob/master/code/grammar-parser/stml-grammar.xml).

#Defining a grammar ##Root element= A grammar is defined in an XML file with root element p:grammar. The root element may include other sub-grammars, also contained in p:grammar elements. If more than one grammar is included in the same hierarchy, all the grammars are combined in each parsing run.

##Terms Each grammar is composed of one or more named terms, represented by the p:term element. Each p:term element is given a unique name using the name attribute. Terms are composed of an ordered list of content matchers:

  • Regular expressions (p:exp, p:expAnon)
  • References to other named terms (p:termRef, p:termRefAnon)
  • Choice grouping constructs (p:choice)
  • Cardinality groupings (p:zeroOrMore, p:oneOrMore, p:zeroOrOne)
  • At most one end-of-data indicator (p:end)
  • Empty or nothing (p:empty)

The list of elements (content matchers) defines the expected values of a string that matches the term. A string that conforms to the list is said to match.

When run through the parser, each p:term or p:exp element named by an name attribute will result in:

  • (r:{name}, r:remainder?)
  • (r:no-match, r:remainder?)

r:{name} contains the part of the string that matched the term. If the term was not matched, r:no-match is returned. r:remainder contains the remaining part of the string that could not be matched with the given term.

When anonymous elements are defined in the grammar with p:termRefAnon and p:expAnon, they return r:anonymous instead of r:{name}. These may be removed by passing the result of the parse run to func:grammar-clean() (XSLT) or grammar:clean() (XQuery).

##Content matchers Content matchers attempt to match the current position in a string to their defined pattern.

###Regular Expressions Regular expressions may be matched using the p:exp element. The regular expression is the text content of the element. All special characters must be escaped using the normal conventions of regular expressions.

A matched p:exp element returns an element in the r namespace whose node name is defined by the p:exp element's name attribute.

Note, using the XSLT parser, it is not possible to match an expression that can entirely evaluate to an empty string (eg, \s*). Instead, use

<p:choice>
 <p:exp>\s+</p:exp>
 <p:empty/>
</p:choice>

###Term references Term references (p:termRef) are how named terms are matched inside other named terms. The alias attribute may be used to reuse a pattern named name, but give it a different result element name, r:{alias}.

##Anonymous content matchers The p:termRefAnon and p:expAnon elements work like their named counterparts, except that matches are returned as r:anonymous elements. Because the result is not named, p:termRefAnon does not support the alias parameter.

Running the func:grammar-clean() (XSLT) or grammar:clean() (XQuery) function on the return value of the grammar parser will remove all r:anonymous elements, leaving their content.

##Choice groupings Choice groupings (the p:choice element) indicate that their position in the term may contain any one of the referenced contents. The p:choice fails to match if none of the choices match. If two choices both match the text, the string with the longer match is chosen. If multiple matches are of equal length, the first one listed in the grammar is chosen.

In addition to any of the contents of p:term, p:choice may also include two other elements:

  • p:group - an anonymous ordered grouping of content matchers.
  • p:empty - The possibility that the choice matches to the empty string.

##Cardinality An ordered list of term references, regular expressions, and choice groupings may also be grouped under the p:zeroOrMore, p:zeroOrOne, or p:oneOrMore elements, which will match if all the references in the group are either repeated 0 or more (present or repeated), 0 or 1 (present or not), or 1 or more times, respectively.

##License The grammar parser is released under the [GNU Lesser GPL 3 (or later)] (http://www.gnu.org/licenses/lgpl.html).

##Questions/Bug reports Questions may be addressed to the [opensiddur-tech] (http://groups.google.com/group/opensiddur-tech/) email list; Please discuss bugs there before reporting them to [our issue tracker] (https://github.com/opensiddur/opensiddur/issues).