Skip to content
Convert XML to avro file with xml2avro groovy processor in NiFi
Branch: master
Clone or download
maxbback Add files via upload
XML and XSD files that are more representing a real use case
Latest commit d4bbd1f Jan 17, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
CustomersOrder.xml Add files via upload Jan 17, 2019
CustomersOrder.xsd Add files via upload Jan 17, 2019
README.md
xml2avro.groovy Add files via upload Jan 7, 2019
xml2csv_xml2avro.groovy

README.md

xml2avro

Flatten out an XML files into avro files

XML to avro processing with NiFi and Groovy

This NiFi processor written in Groovy is converting a XML tree to tables by flatening it out.

All root branches are converted to tables, if this contains new branches they are coverted to separate tables if they are of the type one to many if not it is flattened out into the parent to minimize number of tables.

For more information and description read the articles behind this code on;

http://max.bback.se/index.php/2018/07/16/xml2avro-nifi-processor/

It is very easy to combine this script with my avro script to write avro records instead of csv records.

Xml to table (CSV) transformation with Groovy and NiFi

XMl is a common format used by many applications, so it will be one of the formats you need to support in your Big Data platform.

XML is a structured format that is good for application but not so convenient if you are used to work with SQL and tables so I decided to design and write code for transforming XML to CSV.

You might thing that converting from XML to CSV is a very simple thing, but it is not XML is a hierarchical tree structured data format while CSV is a flat column based format.

A branch in XML can key value pairs and branches, branches is of type one2one or ont2many

When we convert an XML the best is if we have a schema so we know how the XML file can look like as we then has the description for how the DB table shall be defined.

My design is based on that we have a XSD schema as a base to secure that we know which table columns we need.

In below schema we have example of many of the patterns we often see in a schema, lets discuss them. Element person has an attribute name and can exist in any number of instances and needs to be its own table. Element person has direct element childs;

Full_name is a simple element that is of type string and is a column in person table Child_name is a simple element of type string that can exist in 0-5 times and can either be its own table och be 5 columns in person table “child_name1-5” Child_name_complex is unbound and has to be its own table Element child_name_complex has direct element childs; Length is a simple element of type decimal and is a column in child_name_complex table Age is also a simple element and will also be a column in child_name_complex Mother is a complex type but can only exist 1 time and can then be flattened out to be columns in child_name_complex <xs:element name="persons"> xs:complexType xs:sequence <xs:element name="person" maxOccurs="unbounded"> xs:complexType xs:sequence <xs:element name="full_name" type="xs:string"/> <xs:element name="child_name" type="xs:string" minOccurs="0" maxOccurs="5"/> <xs:element name="child_name_complex" minOccurs="0" maxOccurs="unbounded"> xs:complexType xs:sequence <xs:element name="length" type="xs:decimal" minOccurs="0" nillable="true"/> <xs:element name="age" type="xs:decimal" minOccurs="0" nillable="true"/> <xs:element name="mother" minOccurs="0" maxOccurs="1"> xs:complexType xs:sequence <xs:element name="name" type="xs:string" minOccurs="0" nillable="true"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element>

Rules for XML to tables We need simple rules to follow when we are flattening out the XML tree to tables and columns.

Element with maxOccurs=”unbound” must be its own table Element with maxOccurs=”1” can be flattened out to be column in its parent but child elements must be renamed to include the parent name a. Example mother_name for mother element name Attributes is column in its parent Element of type simple that has a limited maxOccurs can be flattened out to columns, but if the value is high it’s better to generate a new table Tables of child’s need to be unique this can be accomplished be including parent table name in the table name or make each table unique by adding a suffix to them. Sometimes XSD schema is very open which will result in many tables, if you know your data and the alternatives of its look, it is better to make a more strict schema which will dramatically lower the number of tables, if you use many 1t1 branches remove them and make them simple elements instead as you then will get shorter column names.

Read the full story in the word document.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.