Skip to content

Example 5. Dynamic schema on read

jameskrobinson edited this page Aug 8, 2021 · 5 revisions

This final set of examples outline an approach allowing dynamic schema on read.

Hive is a "schema on read" data store - when you query data, Hive projects a given schema onto the data. However, this schema is typically known ahead of query execution, as it's part of the DDL of the table that Hive projects onto the data. (I say "typically" as Avro schema's can be separate from a tables DDL, but this is outside of the scope of this discussion). When working with complex XML, it is desirable to be able to project a schema onto the data at query time. We've seen the UDTF doing this with the generic xml to csv template, but as outlined at the end of example 4, this approach is not particularly robust.

The approach I've taken instead is to use an XML schema file (an XSD) to allow for the UDTF's resultset column population to be determined at query time.

XSD is typically used to validate XML. As such, it defines the structure that an XML string must conform to, and therefore it's possible to use it to determine the hierarchy of a given XML string. This approach assumes that your XML does actually conform to an XSD. If it does not (this is indicative of a wider problem!) it's possible for you to generate a token XSD from samples of your XML.

You will recall we used this mocked up Summit ERD earlier, to illustrate the kinds of data we'll be querying:

Simplified swap trade schema

We have XML that has this hierarchy in place, and we've queried it in a number of ways. But, each time we did this, we either told the UDTF what the schema was (via the HiveHeader string) or we let the XSL figure out the schema from the data itself. The XSD approach differs from this, and works as follows:

  • User submits a select query, including a call to the UTDF:

select xsltable(xml_text,'/home/james_k_robinson/xsd/summit/swap-with-flows.xsd', 'NODE_LEVEL=EVENT', 'DEEP_COPY=YES') from summit.trade_xml;

  • UDTF process - step 1: - parse the supplied XSD file, to find the definition of the node supplied in the NODE_LEVEL parameter. In this case, the XSD data looks like below, for the requested EVENT node:

event_xsd

  • UDTF process - step 2: - use the definition of EVENT in the XSD to construct a set of columns and data types, and the XSL required to retrieve them from XML. This process uses a template called xsd_to_xsl.xsl which resides in the same directory as the XSD file.
  • UDTF process - step 3: - using the dynamically created XSL from step 2, transform the given XML data into tabular data, and then return this data to Hive;
  • Note that each XSD will have its own xsd_to_xsl.xsl file - this file is the real workhorse of this process. For Summit data, which is very relational, this XSL file is quite simple. The XSL file for ISO20022 data is more complex (due to increased nesting) and the XSL file for FPML is the most complex. It should be possible (it's in my backlog) to have a single, source-agnostic XSL file, but between the three that exist presently most bases are covered.
  • Also on the backlog - pulling column types from the XSD. Presently they're all strings, but it should be straightforward to sort this out...
  • Lets look at the output of this process for the EVENT example:

XSD approach for event

  • Unlike the previous example using the generic csv approach, we now have real columns;
  • The columns correspond with the definition of EVENT in the XSD;

Lets look at a few more examples:

Asset data:

select xsltable(xml_text,'/home/james_k_robinson/xsd/summit/swap-with-flows.xsd', 'NODE_LEVEL=ASSET', 'DEEP_COPY=NO') from summit.trade_xml;

XSD approach for asset

SWIFT ISO20022 data - each ISO message has its own XSD, which are publicly available. The example here is a CAMT053 (a statement of account) message. Note that for ISO and FPML data, we have to pass two node parameters - one stating which node level we want to retrieve from the XML, and one stating what data type this corresponds to the XSD. For Summit, these match up, but for ISO and FPML they generally do not, hence the requirement for two parameters.

Get the transaction level data from the CAMT053 statement:

select xsltable(xml_text,'/home/james_k_robinson/xsd/iso20022/camt.053.001.02.xsd', 'NODE_LEVEL=Ntry','NODE_TYPE=ReportEntry2', 'DEEP_COPY=NO') from iso.camt53_xml;

ISO Ntry

And, get the balance data, with no need for any code changes:

select xsltable(xml_text,'/home/james_k_robinson/xsd/iso20022/camt.053.001.02.xsd', 'NODE_LEVEL=Bal','NODE_TYPE=CashBalance3', 'DEEP_COPY=NO') from iso.camt53_xml;

iso bal

The XSL which parses the ISO and FPML XSD files is more complex that that used for the Summit XSD, as it contains logic for processing sub-nodes, as long as those nodes are not repeating. For example, the CAMT053 XSD defines a type called EntryTransaction2 as per below. Notice that each element is itself a complex type. The XSL loops through each complex type, assuming it's not repeating, in order to form a single data row.

iso xsd

So, when querying at the EntryTransaction level, you'll get results for each of the nested types above, for example:

select xsltable(xml_text,'/home/james_k_robinson/xsd/iso20022/camt.053.001.02.xsd', 'NODE_LEVEL=TxDtls','NODE_TYPE=EntryTransaction2') from iso.camt53_xml;

iso explode

The column names are prefixed with the sub-type names from the XSD, eg refs, amtdtls, etc.

Finally, lets look at the most complex of our XML types, FPML. This is heavily nested and didn't lend itself to the generic csv approach as a result. However, using the "loop through sub-nodes" approach adopted for ISO processing, it's possible to flatten out FPML successfully:

For a swap trade, the majority of the data resides in two "swapStream" nodes. The query below pulls back the swapStream data for a given trade, according to the definition of the "InterestRateStream" type in the XSD. Where there are sub-types that are defined as repeating groups (as shown for the business day calendars on one of the legs), a deep copy is undertaken.

select xsltable(xml_text,'/home/james_k_robinson/xsd/fpml/fpml_ird_complete.xsd', 'NODE_LEVEL=swapStream','NODE_TYPE=InterestRateStream', 'DEEP_COPY=YES') from fpml.trade_xml where trade_id = 'TW9235';

fpml_xsd_swapstream

fpml deep copy

As with the ISO example, you can query at other node levels (in this case, the cashflow schedule) without any config changes:

fpml xsd flows

This concludes the examples of the UDTF's functionality.

Feel free to have a look at the backlog.