# Validation and Transformation of XML with Python using LXML

In [1]:
# Standard Includes
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# "magic" to display graphs in the notebook
%matplotlib inline


If we are validating xml, then we need to use LXML instead of ElementTree (xml.etree.ElementTree). LXML supports validation using Document Type Definition (DTD) and XML Schema Definition (XSD). LXML also supports transformations using xtensible Stylesheet Language Transformations (XSLT). 

It does this by "wrapping" a standard library called libxml2. Wrapping just means that the developers have provided access to a library from another language, in this case C, to you in a different langauge, in this case Python. Why does this matter? Since libxml2 is written in C, a compiled language rather than an interpretted language like Python, it means that it is faster. For a lot of the interactive-style programming we are doing in this class, speed doesn't matter as much as if it was on a server with lots of users. We are more interested in libxml2's more complete set of features for handling XML.

In [2]:
# Import lxml; this format allows us to type less to use the library.
from lxml import etree

The variable 'books_xml' contains a vary simple xml document in the format of a string. Triple single quotes (''') are the way you create strings that span multiple lines in Python.

You should notice some things about the XML:
1) The first tag is weird. <?xml version="1.0"?> is required at the start of all xml documents. (You've probably seen it at the start of HTML files before)
2) x:books is a weird name for a tag. The "x" is the "namespace". This is imporant for valid xml as it says that the xml document belongs to the xml schema.


(books.xml and books.xsd are from https://msdn.microsoft.com/en-us/library/ms762271(v=vs.85).aspx).

In [3]:
books_xml = '''<?xml version="1.0"?>
<x:books xmlns:x="urn:books"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="urn:books books.xsd">
   <book id="bk001">
      <author>Hightower, Kim</author>
      <title>The First Book</title>
      <genre>Fiction</genre>
      <price>44.95</price>
      <pub_date>2000-10-01</pub_date>
      <review>An amazing story of nothing.</review>
   </book>
   <book id="bk003">
      <author>Nagata, Suanne</author>
      <title>Becoming Somebody</title>
      <genre>Biography</genre>
      <review>A masterpiece of the fine art of gossiping.</review>
   </book>
   <book id="bk002">
      <author>Oberg, Bruce</author>
      <title>The Poet's First Poem</title>
      <genre>Poem</genre>
      <price>24.95</price>
      <review>The least poetic poems of the decade.</review>
   </book>
</x:books>'''

What do you notice about 'books_xsd'?

First off, it looks like XML. That is because XML Schemas are valid XML documents. You should also notice that it is in the same namespace as the XML file. That is how we know, and a computer program could know, that the two documents belong together.

You can learn all you ever want to know about XML Schema here: https://www.w3schools.com/xml/schema_intro.asp

In [1]:
books_xsd = '''<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            targetNamespace="urn:books"
            xmlns:bks="urn:books">
  <xsd:element name="books" type="bks:BooksForm"/>
  <xsd:complexType name="BooksForm">
    <xsd:sequence>
      <xsd:element name="book" 
                  type="bks:BookForm" 
                  minOccurs="0" 
                  maxOccurs="unbounded"/>
      </xsd:sequence>
  </xsd:complexType>
  <xsd:complexType name="BookForm">
    <xsd:sequence>
      <xsd:element name="author"   type="xsd:string"/>
      <xsd:element name="title"    type="xsd:string"/>
      <xsd:element name="genre"    type="xsd:string"/>
      <xsd:element name="price"    type="xsd:float" />
      <xsd:element name="pub_date" type="xsd:date" />
      <xsd:element name="review"   type="xsd:string"/>
    </xsd:sequence>
    <xsd:attribute name="id"   type="xsd:string"/>
  </xsd:complexType>
</xsd:schema>'''

Here we create an Python XML object

In [5]:
books_obj = etree.fromstring(books_xml)

Now, that we have that object. We can test if it is valid.
First we have to read the XML Schema. We do that just like it was xml, becuase it is xml.
Then we convert the XML object into a XML Schema object.

In [6]:
xsd_obj = etree.fromstring(books_xsd)
books_schema = etree.XMLSchema(xsd_obj)

Now that it is an XML Schema object, we have a new method "validate". This will tell us if our XML is valid or not.

In [7]:
books_schema.validate(books_obj)

False

WHAT? It is invalid?

Since it is a small file, we could read both the schema and the XML to figure out the problem. But, that would become tiresome on a really large file.

Wouldn't it be nice to see some errors to figure out the issue?

In [8]:
# Create a "XMLParser" object from the schema
books_parser1 = etree.XMLParser(schema = books_schema)
# This validates and parses the xml at the same time. Which means when we run it, we should see an explosion!
books1 = etree.fromstring(books_xml, books_parser1)

XMLSyntaxError: Element 'review': This element is not expected. Expected is ( price ). (<string>, line 0)

Aha, look up there! The element 'review" wasn't expected where it was found, the parser was expecting 'price' instead. Hmm...

Scroll up and look at where we created 'books_xml'. Seriously, go do it.

Ok. Did you see anything?

No... Did you scroll?

Ok fine. Look here:

In [None]:
books_xml = '''<?xml version="1.0"?>
<x:books xmlns:x="urn:books"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="urn:books books.xsd">
   <book id="bk001">
      <author>Hightower, Kim</author>
      <title>The First Book</title>
      <genre>Fiction</genre>
      <price>44.95</price>
      <pub_date>2000-10-01</pub_date>
      <review>An amazing story of nothing.</review>
   </book>
   <book id="bk003">
      <author>Nagata, Suanne</author>
      <title>Becoming Somebody</title>
      <genre>Biography</genre>
      <review>A masterpiece of the fine art of gossiping.</review>
   </book>
   <book id="bk002">
      <author>Oberg, Bruce</author>
      <title>The Poet's First Poem</title>
      <genre>Poem</genre>
      <price>24.95</price>
      <review>The least poetic poems of the decade.</review>
   </book>
</x:books>'''

Lets look at the book elements, since those are the repeating things that have all the data in them.

The first goes: author, title, genre, price, pub_date
The second goes: author, title, genre, review
The third goes: author, title, genre, price, review

Hmm... I think it is the second one that is causing the problem becuase it says  that it wasn't expecting to see 'review' yet and the other two books have a price tag that comes before the review tag.

If you look at the XML Schema above, you will see that those elements are contained in a sequence tag, which means that the tags have to appear in that specific order.

Our problem isn't that they are out of order, instead we have tags that are missing.

Well, if you were getting that XML from a vendor, you'd probably call up the vendor and tell them that you have to have the price or you can't sell their book. Right? So problem solved...

What if all you really needed to know were those 4 pieces of information in the second book? Then you could change the schema to be like below. This new schema says that those last 3 tags are optional (you can have 0 to 1 instances of them).

In [10]:
books_xsd2 = '''<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            targetNamespace="urn:books"
            xmlns:bks="urn:books">
  <xsd:element name="books" type="bks:BooksForm"/>
  <xsd:complexType name="BooksForm">
    <xsd:sequence>
      <xsd:element name="book" 
                  type="bks:BookForm" 
                  minOccurs="0" 
                  maxOccurs="unbounded"/>
      </xsd:sequence>
  </xsd:complexType>
  <xsd:complexType name="BookForm">
    <xsd:sequence>
      <xsd:element name="author"   type="xsd:string"/>
      <xsd:element name="title"    type="xsd:string"/>
      <xsd:element name="genre"    type="xsd:string"/>
      <xsd:element name="price"    type="xsd:float" minOccurs="0" maxOccurs="1"  />
      <xsd:element name="pub_date" type="xsd:date" minOccurs="0" maxOccurs="1" />
      <xsd:element name="review"   type="xsd:string" minOccurs="0" maxOccurs="1" />
    </xsd:sequence>
    <xsd:attribute name="id"   type="xsd:string"/>
  </xsd:complexType>
</xsd:schema>'''

In [11]:
# create a schema object with this new XML Schema
xsd2_obj = etree.fromstring(books_xsd2)
books_schema2 = etree.XMLSchema(xsd2_obj)
books_schema2.validate(books_obj)


True

And now it is valid. Yay!

In [12]:
books_parser = etree.XMLParser(schema = books_schema2)
books = etree.fromstring(books_xml, books_parser)

So we'd think that since the XML Schema specifies what kind of types tags should contain, that our lives would be made easier because of this and we wouldn't have to do conversions our selves...

Not the case:

In [13]:
type(books_obj.find('book/price').text)

str

Well drat...

What is the point then?

In [14]:
for b in books.findall('book'):
    print(b.find('author').text)
    if b.find('price') != None:
        print(" " + b.find('price').text, type(b.find('price').text))

Hightower, Kim
 44.95 <class 'str'>
Nagata, Suanne
Oberg, Bruce
 24.95 <class 'str'>


Parsing XML (or anything) is a lot of work. You write a lot of code and that code depends on the format of the document you are parsing in order to work. If you change the document's structure, then your code breaks. If you create a contract between the two systems in the form of an XML Schema, then you don't have to worry about the document changing unexpectedly.

## XPATH

I mentioned that we could identify or name a node by its path, just like we did in the command line. ElementTree supports a light version of this that we've used when we say something like:

In [15]:
books.find('book/author').text

'Hightower, Kim'

But, Donal, you say, you are using LXML, you say... 
That line above is 100% compatible between ElementTree and LXML.

See:

In [16]:
import xml.etree.ElementTree as ET
ET_book = ET.fromstring(books_xml)
ET_book.find('book/author').text

'Hightower, Kim'

But this is where things get diffent:

In [17]:
for node in books.xpath("//book"):
    print(node.xpath('author/text()'), node.xpath('title/text()'))

['Hightower, Kim'] ['The First Book']
['Nagata, Suanne'] ['Becoming Somebody']
['Oberg, Bruce'] ["The Poet's First Poem"]


You will notice the ET_book doesn't have an xpath method:

In [18]:
for node in ET_book.xpath("//book"):
    print(node.xpath('author/text()'), node.xpath('title/text()'))

AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath'

XPATH is very powerfull and you can do a lot with it.

If you want to know more about it then: https://www.w3schools.com/xml/xpath_intro.asp

# XSLT

What does it mean to transform something?

It means to change one thing into another, right?

So with XSLT we can change and XML document into another kind of document. This can be a different kind of XML document like below:

In [19]:
xslt_root = etree.XML('''\
    <xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:template match="/">
        <authors>
            <xsl:for-each select="//book">
                <author>
                    <xsl:value-of select="author" />
                </author>
            </xsl:for-each>
        </authors>
        </xsl:template>
</xsl:stylesheet>''')
transform = etree.XSLT(xslt_root)
authors = transform(books).getroot()

In [20]:
authors.tag

'authors'

In [21]:
for t in authors.getchildren():
    print(t.tag, t.text)

author Hightower, Kim
author Nagata, Suanne
author Oberg, Bruce


Now we have an XML file that is just a list of authors. That is neat, I guess...

Where it becomes really powerful is if you are trying to present a document multiple ways. You could change the document into HTML, PDF (with some extra tools), Word...

We don't use them much in data analysis. But out in the world, these tools are all over the place. If you come across them, then you will know what they are doing and how to take advantage of them.