-
Notifications
You must be signed in to change notification settings - Fork 12
Grammar Parser
This page describes the use of the [XSLT and XQuery] (https://github.com/opensiddur/opensiddur/tree/master/code/grammar-parser) grammar parsers, which are necessary for parsing XPointer and XPointer schemes in XSLT and XQuery. They may also be used for any other type of text parsing.
The idea for the implementation is based on [YAPP] (http://www.o-xml.org/yapp), a parser written in XSLT 1.0, although the two share no code. The new parser was written in XSLT 2.0 to take advantage of the language's native support for regular expressions. It was later translated to XQuery so it could be used directly in XQuery code without a serialization step. Most grammars that can be represented in [EBNF] (http://en.wikipedia.org/wiki/Extended_Backus–Naur_Form) can be parsed.
We now introduce new XML namespaces and the conventional prefixes used in the documentation:
- http://jewishliturgy.org/ns/functions/xslt (func) - used in the XSLT code
- http://jewishliturgy.org/transform/grammar (grammar) - used in the XQuery code
- http://jewishliturgy.org/ns/parser (p)
- http://jewishliturgy.org/ns/parser-result (r)
To use the parser, you must first define a grammar in the language described below. The grammar is stored as XML.
In your XSLT stylesheet, include the parser code (grammar2.xsl2) or, for XQuery, import grammar2.xqm.
The parser is called using the function call (XSLT):
func:grammar-parse($string as xs:string, $start-term as xs:string, $grammar as node()) as element()
or (XQuery):
grammar:parse($string as xs:string, $start-term as xs:string, $grammar as node()) as element()
$string is the string to be parsed, $start-term is the named term where parsing should begin, and $grammar is the XML grammar document; it may be element(p:grammar) or document-node().
The result of func:grammar-parse() may be passed to:
func:grammar-clean($parsed-grammar as element()) as element()
This function returns the parsed grammar with the r:anonymous elements that represent anonymous terms (p:expAnon, p:termRefAnon, as defined in the grammar) removed.
Example grammars from our project are: A partial [XPointer] (https://github.com/opensiddur/opensiddur/blob/master/code/grammar-parser/xpointer.xml) implementation, the [extended XPointer range() function] (https://github.com/opensiddur/opensiddur/blob/master/code/grammar-parser/xptr-tei.xml) defined by the TEI, and a grammar for our extended version of [Sacred Texts Markup Language] (https://github.com/opensiddur/opensiddur/blob/master/code/grammar-parser/stml-grammar.xml).
#Defining a grammar
##Root element=
A grammar is defined in an XML file with root element p:grammar
. The root element may include other sub-grammars, also contained in p:grammar
elements. If more than one grammar is included in the same hierarchy, all the grammars are combined in each parsing run.
##Terms
Each grammar is composed of one or more named terms, represented by the p:term
element. Each p:term
element is given a unique name using the name
attribute. Terms are composed of an ordered list of content matchers:
- Regular expressions (
p:exp
,p:expAnon
) - References to other named terms (
p:termRef
,p:termRefAnon
) - Choice grouping constructs (
p:choice
) - Cardinality groupings (
p:zeroOrMore
,p:oneOrMore
,p:zeroOrOne
) - At most one end-of-data indicator (
p:end
) - Empty or nothing (
p:empty
)
The list of elements (content matchers) defines the expected values of a string that matches the term. A string that conforms to the list is said to match.
When run through the parser, each p:term
or p:exp
element named by an name
attribute will result in:
- (r:{name}, r:remainder?)
- (r:no-match, r:remainder?)
r:{name}
contains the part of the string that matched the term.
If the term was not matched, r:no-match
is returned.
r:remainder
contains the remaining part of the string that could not be matched with the given term.
When anonymous elements are defined in the grammar with p:termRefAnon
and p:expAnon
, they return r:anonymous
instead of r:{name}
. These may be removed by passing the result of the parse run to func:grammar-clean()
(XSLT) or grammar:clean()
(XQuery).
##Content matchers Content matchers attempt to match the current position in a string to their defined pattern.
###Regular Expressions
Regular expressions may be matched using the p:exp
element. The regular expression is the text content of the element. All special characters must be escaped using the normal conventions of regular expressions.
A matched p:exp
element returns an element in the r
namespace whose node name is defined by the p:exp
element's name
attribute.
Note, using the XSLT parser, it is not possible to match an expression that can entirely evaluate to an empty string (eg, \s*
). Instead, use
<p:choice>
<p:exp>\s+</p:exp>
<p:empty/>
</p:choice>
###Term references
Term references (p:termRef
) are how named terms are matched inside other named terms. The alias
attribute may be used to reuse a pattern named name
, but give it a different result element name, r:{alias}
.
##Anonymous content matchers
The p:termRefAnon
and p:expAnon
elements work like their named counterparts, except that matches are returned as r:anonymous
elements. Because the result is not named, p:termRefAnon
does not support the alias
parameter.
Running the func:grammar-clean()
(XSLT) or grammar:clean()
(XQuery) function on the return value of the grammar parser will remove all r:anonymous
elements, leaving their content.
##Choice groupings
Choice groupings (the p:choice
element) indicate that their position in the term may contain any one of the referenced contents. The p:choice
fails to match if none of the choices match. If two choices both match the text, the string with the longer match is chosen. If multiple matches are of equal length, the first one listed in the grammar is chosen.
In addition to any of the contents of p:term
, p:choice
may also include two other elements:
-
p:group
- an anonymous ordered grouping of content matchers. -
p:empty
- The possibility that the choice matches to the empty string.
##Cardinality
An ordered list of term references, regular expressions, and choice groupings may also be grouped under the p:zeroOrMore
, p:zeroOrOne
, or p:oneOrMore
elements, which will match if all the references in the group are either repeated 0 or more (present or repeated), 0 or 1 (present or not), or 1 or more times, respectively.
##License The grammar parser is released under the [GNU Lesser GPL 3 (or later)] (http://www.gnu.org/licenses/lgpl.html).
##Questions/Bug reports Questions may be addressed to the [opensiddur-tech] (http://groups.google.com/group/opensiddur-tech/) email list; Please discuss bugs there before reporting them to [our issue tracker] (https://github.com/opensiddur/opensiddur/issues).