This POC investigates and demonstrates how it is possible to create a custom Domain Specific Language (DSL) to allow the definition of FHIR resources using a procedural definition language. The POC is fairly complex, but it is complete of unit and integration tests and should be "simple" to transition it to production code.
This is FDL grammar definition in a modified version of Backus-Naur form (BNF), with a simplified syntax (since FDL parser will not be automatically generated).
program -> statement+ EOF ;
statement -> expression ";" ;
expression -> assignment ;
assignment -> deference "=" primitive ;
deference -> receiver "." IDENTIFIER ("[" NUMBER "]")? ;
receiver -> declaration | STRING ;
primitive -> typedef | assignment ;
typedef -> "(" STRING "as" (type | ("date" ("=>" STRING)?)) ")" ;
declaration -> ELEMENT ("[" matcher "]")? ;
type -> "boolean" | "date" | "decimal" | "integer" ;
matcher -> NUMBER | STRING ;
A listing of FDL comprises multiple expressions. The most simple expression is the instantiation of a FHIR resource:
Patient;
This will produce an empty Patient FHIR resource.
Every statement ends with a semicolon (;
) and is insensitive to spaces or newlines, so it can be formatted freely.
Comments can be marked using //
. All the characters from the comment mark to the end of the line will be ignored:
// This is a comment.
Patient.active="true";
A single statement can declare multiple resources. Resources can be matched appending brackets with a matcher to the
resource name. This will produce two Patient
resources:
Patient[0];
Patient[1];
Patient[1];
The matcher can be a number, or a string, so the following are all valid declarations and will produce three resources:
Practictioner[123];
Practictioner["specialist"];
Practictioner["nurse"];
If the resource is declared without a matcher, it will default to [0]
, so Patient
will be equivalent to Patient[0]
. If using a number, bear in mind that it does not represent a cardinality, it is a matcher, so the sequence does not
matter, and the declaration can start at whatever number.
Internally, a number matcher will be handled as a string, so Patient[1]
and Patient["1"]
reference the same
resource.
Resources can have attributes, which can be initialized:
Patient.active="true";
This will create a Patient
resource with its active
attribute initialized to true
. Note that "true" is a string
here, this means it will be evaluated at runtime if the attributes expects a boolean value. Currently, true
, yes
and
y
are translated to true
, false
, no
, n
are translated to false
, case-insensitive.
The value assigned must be a string, so Patient.active=true;
will raise an error.
Some fields have cardinality which can be expressed using a list access notation: field[index]
. The index must be an
integer number and counting starts at 0
(do not confuse it with the resource matcher). Since this maps directly to
the position of the internal list representation, the sequence of numbers does matter.
It is important to respect the order of the indexed fields. Not starting at 0
or skipping numbers in the sequence
will rise a runtime error and produce an empty Bundle (see Errors).
Since a human can have multiple given names, the following is a valid statement:
Practictioner["nurse"].name[0]=HumanName.given[0]="Emily";
If the cardinality is not specified, it will default to [0]
.
FDL offers a simplified notation. The standard notation, requires that the FHIR element type is specified in recursive assignments, such as:
Practictioner["nurse"].name[0]=HumanName.given[0]="Emily";
Simplified notation, allows not to specify assigned elements. The following is equivalent to the statement above:
Practictioner["nurse"].name[0].given[0]="Emily";
// since both fields are at zero position, the following is even simpler
Practictioner["nurse"].name.given="Emily";
What happens here is that FDL infers the type of name
from the context and the instantiation of the HumanName
element is implicit if that element doesn't already exist.
Both notations work at the same time and are compatible. It is up to the user to decide which one he/she prefers.
The FHIR specification marks as 'required' those attributes that have restricted values, assignable from a list of predefined elements. From the specification:
To be conformant, codes in this element SHALL be from the specified value set.
In programming terms they can be associated to enumerations, and that is the actual internal representation in FDL.
From a syntax point of view, there is no difference from a normal assignment, but the interpreter will raise an error if the value is not allowed.
For example Immunization.status="completed";
is syntactically and semantically valid; Immunization.status="refused"
is syntactically valid, but semantically wrong (since "refused" is not a valid value) and will raise a runtime error.
In FHIR they are defined as
The base definition for complex elements defined as part of a resource definition - that is, elements that have children that are defined in the resource. Data Type elements do not use this type, though a few data types specialize it (Timing, Dosage, ElementDefinition). For instance,
Patient.contact
is an element that is defined as part of the patient resource, so it automatically has the type BackboneElement.
In FDL they are easily defined simply referring at them as sub elements:
Patient.contact.telecom=ContactPoint.value="12345678";
Patient.contact.telecom=ContactPoint.use="home";
Under the hood, the handling is complex, but what is worth knowing from a user standpoint is that the first time a BackboneElement is accessed, FDL will create its instance.
Of course, simplified notation works here as well:
Patient.contact.telecom.value="12345678";
Patient.contact.telecom.use="home";
FDL supports dates. If a specific property receives a date, for example Patient.birthDate
, FDL tries its best to
parse the date string.
The following (and many more) formats are parsable:
Patient.birthDate="8/27/1969";
Patient.birthDate="Aug 27, 1969";
Patient.birthDate="27 august 1969";
FHIR supports various data types. FDL is conceived to receive only strings as values (as extracted by the OCR), for this reason, it makes a lot of guess-work trying to interpret the type correctly.
When assigning to field, it is possible to optionally specify the type of the value. Again, it is optional, but if done, it will help the interpreter to reduce possible type mismatches, and it makes the processing faster since all the guess work is skipped.
Note that if the wrong type is declared, the interpreter will try to guess before giving-up.
The syntax to specify a type is ("value" as type)
. Supported types are:
boolean
date
decimal
integer
To implement (refer to FHIR specification to decide which are worth implementing):
instant
dateTime
time
To specify a boolean
:
Patient.active=("yes" as boolean);
To specify the type, the syntax is as follows:
Patient.birthDate=("8/7/1978" as date);
Optionally, it is possible to also specify the date format (ex. if it uses uncommon separators) using the fat arrow operator '=>':
Patient.birthDate=("7.8/1979" as date => "M.d/yyyy");
If the FHIR attribute takes a decimal
or a integer
, it can be specified with:
Goal.target.detail=("10" as integer);
Quantity.value=("11.2" as decimal);
FDL raises two kinds of errors: static errors and runtime errors.
Are raised during the lexing and parsing phases. These are mostly syntax errors, for example mismatched brackets, unexpected elements at certain positions and so on. These errors do not block the parsing phase, but are collected and reported all together to facilitate fixing.
These errors are raised during the interpretation phase, and are semantic errors, for example declarations of invalid resources, wrong assignment types etc. All of these errors stop the interpretation, an empty Bundle is produced and the error is reported.
FDL is implemented as a generic embeddable language, but it is primarily intended to be integrated into SpitFHIR.
SpitFHIR will receive a list of key/value pairs where key is the FDL statement and value is the extracted data. To
reference value in the statement, FDL uses a $
(dollar sign).
For example:
{
"Patient.active=$;": "yes",
"Patient.name[0]=HumanName.family=$;": "Smith",
"Patient.birthDate=($ as date);": "8/7/1980"
}
will be hydrated into:
Patient.active="yes";
Patient.name[0]=HumanName.family="Smith";
Patient.birthDate=("8/7/1980" as date);
and then interpreted as explained above.
Note that $
will automatically add double quotes to the value, so it is not necessary to add them (adding them will
produce unexpected results).
FDL is implemented in a classic Lexer
-> Parser
-> StaticAnalyzer
-> Interpreter
style.
The Lexer
scans a source string and produces a list of Token
; the Parser
receives the list of Token
and builds
an Abstract Syntax Tree. The parsing produces an AST meant to be interpreted in
recursive descent (other approaches may be evaluated, such
as Pratt algorithm, maybe associated with a bytecode base VM).
The Abstract Syntax Tree is a tree data structure where each node represents a statement or an expression. The AST allows keeping track of the order of the evaluation of the expression, enforcing operator precedence and associativity.
Writing an AST data structure by hand is tedious and error-prone. The AST data structure of FDL is generated via
metaprogramming. A grammar definition YAML file is included in the project and, upon compilation, the Expr
and Stmt
classes are generated. The plugin is a submodule of this project.
To navigate the AST and perform the operation required for each node, FDL makes use of a Visitor pattern. The Visitor pattern allows to put the logic into a separate class, instead of integrating it into each node.
In FDL implementation, the Visitor is coded in two analogous abstract
classes: Expr
and Stmt
. Each of them
represents the Element of the Visitor pattern, and each of them also declares Visitor
as a sub-interface. Also,
each AST node class is declared as a sub-class of Expr
or Stmt
.
Both StaticAnalyzer
and Interpreter
implement the Visitor and define each method required to visit each node,
performing their own logic.
The StaticAnalyzer
is the first pass on the AST and performs the static analysis. In particular, it resolves the
generated objects in a way that the Interpreter
can leverage to know exactly each object "which" object is.
The Interpreter
is the second pass on the AST and performs the actual interpretation and production of the FHIR
resources and the final Bundle. It makes heavy use of Java reflection to generate FHIR resources using the HapiFHIR
library.
The plugin is a Mojo that hooks in the generate-sources phase, parses the source YAML grammar definition file and
generates the AST data structure (definition of Visitor
, the two Elements classes Stmt
and Expr
and the
definition of each supported node).