# Parsing XML with Python

Python provides many libraries for working with xml files such as 

- **`lxml`**: It is a clean, fast and strict library for dealing with xml fils. It's also the most accepted library. It also supports `xpath` and `xslt`.

- **`BeautifulSoup`**: It is flexible but a bit slower than `lxml`. The good thing is if your xml markup is messed up, it will try to correct it. It's perfect for dealing with web scrapped data in HTML formats. For clean xml, it might be too slow. It has been discussed in details in chapter `Chapter S2.05 - REST API - Server & Clients`

- **`xml`** : It has native integration in Python and is fast & clean but do not support xpath and xslt. We will discuss about it more in details in this chapter

Read about others on the Python [official wiki](https://wiki.python.org/moin/PythonXml)

## `lxml`

Based on my experience, `lxml` will meet most of our needs in handling clean data. Here **"`Clean`"** is the keyword. It will not be able to properly handle invalid `html` or `xml`. It will just throw error message.

> This is a third party library and thus needs to be installed using `pip`

### From file to XML object

Opening an xml file is actually quite simple : you open it and you parse it. Who would have guessed ?

In [1]:
from lxml import etree

with open("code/data/books.xml") as file:    
    parsed = etree.parse(file)

print(parsed)

<lxml.etree._ElementTree object at 0x7f82285bc4c8>


As you can see, we obtained an instance of type `lxml.etree._ElementTree`

In [2]:
# New etree parser, with empty text nodes removed

parser = etree.XMLParser(remove_blank_text=True)

with open("code/data/books.xml") as file:
    parsed = etree.parse(file, parser)

print(parsed)


<lxml.etree._ElementTree object at 0x7f82285d19c8>


Few useful arguments of `XMLParser` are as follows:

- *attribute_defaults* : Use DTD (if available) to add the default attributes
- *dtd_validation* : Validate against DTD while parsing
- *load_dtd* : Load and parse the DTD while parsing
- *ns_clean* : Clean up redundant namespace declarations
- *recover* : Try to fix ill-formed xml
- *remove_blank_text* : Removes blank text nodes
- *resolve_entities* : Replace entities by their value (Default : on)

You can then create a new parser according to its standards or clean namespace attribute. In this context, *ns_clean* would transform


`<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag xmlns:a="xmlns1" /><tag /></root>`

into

`<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag /><tag /></root>`

### String to XML object

`lxml` parses `strings` using `fromstring` function which is similar to `parse` which is used to parses files as shown in the below example. 

In [3]:
xml = '<root xmlns:a="xmlns1" xmlns:b="xmlns2"><tag xmlns:c="xmlns3" /><tag xmlns:a="xmlns1" /><tag /></root>'
parsed = etree.fromstring(xml)
print(parsed)

<Element root at 0x7f82285d1888>


**Questions:**

Can you parse a xml document made of one tag "humanities" with two children "field" named "classics" and "history"? 

In [4]:
# Put your code here





### Errors handling

As previously stated `lxml` is quite strict about `xml` validity. Let's try following example to understand it in details.
Below we have three invalid xml strings, lets see how they are handled by `lxml`

In [5]:
from lxml.etree import XMLSyntaxError
xml = """
"""
# 
xml2 = """
<start>this is a text</start1>
"""
#
xml3 = """
<start attr="test">
"""
for x in [xml, xml2, xml3]:
    try:
        print("*** XML:", x)
        etree.fromstring(x)
    except XMLSyntaxError as e:
        print(e)

*** XML: 

Start tag expected, '<' not found, line 2, column 1 (<string>, line 2)
*** XML: 
<start>this is a text</start1>

expected '>', line 2, column 29 (<string>, line 2)
*** XML: 
<start attr="test">

Premature end of data in tag start line 2, line 3, column 1 (<string>, line 3)


As we can see, most of the errors are detailed enough so we can correct the XML file manually.

### Node properties and methods

> **!! Warning : namespaces !!** : 
> ----
> In lxml, namespaces are expressed using the Clark notation. This mean that, if a namespace defines a node, this node will be named using the following syntax "`{namespace}tagname`. Here is an example :

In [6]:
# With no namespace
print(etree.fromstring("<root />"))
# With namespace
print(etree.fromstring("<root xmlns='http://localhost' />"))

<Element root at 0x7f82285d1ac8>
<Element {http://localhost}root at 0x7f82285d1ac8>


Below is the chart detailing the functionalities offered by `lxml`

![Cheatsheet](images/CheatsheetElement.svg)

Let's see what that means in real life :

In [7]:
# First, we will need some xml
xml = """
<div type="Book" n="1">
    <l n="1">The Rigveda is undoubtedly the oldest literary monument of the Indo-European languages.</l>
    <mj:l n="2" xmlns:mj="http://www.mayankjohri.org/ns/1.0">But the exact period when the hymns were composed is a matter of conjecture.</mj:l>
    <l n="3">All that we can say with any approach to certainty is that the oldest of them cannot </l>
    <l n="4">date from later than the thirteenth century bc.</l>
    <l n="5">This assertion is based on the following grounds. B</l>
    <l n="6">uddhism, which began to spread in India about 500 bc, </l>
    <l n="7">presupposes the existence not only of the Vedas, </l>
</div>
"""
book = etree.fromstring(xml)
print(book)


<Element div at 0x7f8228334288>


If we want to retrieve the attributes of our div, we can do as follow :

In [8]:
type_div = book.get("type")
print(type_div)
print(book.get("n"))

print(book.attrib)
attributes_div = dict(book.attrib)
print(attributes_div)

list_attributes_div = book.items()
print(list_attributes_div)

Book
1
{'type': 'Book', 'n': '1'}
{'type': 'Book', 'n': '1'}
[('type', 'Book'), ('n', '1')]


Lets get more details from the xml file, we can use the following to get them :

- `getchildren()` will returns a list of children tags, such as div.
- list(`book`) will transform `book` in a list of sentences.

Both syntaxes return the same results, so it's up to you to decide which one you prefer. 

In [9]:
from pprint import pprint

children = book.getchildren()
pprint(children)

[<Element l at 0x7f82285bc408>,
 <Element {http://www.mayankjohri.org/ns/1.0}l at 0x7f8228334108>,
 <Element l at 0x7f8228334348>,
 <Element l at 0x7f82283343c8>,
 <Element l at 0x7f8228334408>,
 <Element l at 0x7f82283344c8>,
 <Element l at 0x7f8228334508>]


To get one element from the list of elements,

In [10]:
line_1 = children[0] 
pprint(line_1)

<Element l at 0x7f82285bc408>


Now that we have access to our children, we can have access to their text :

In [11]:
print(line_1.text)

The Rigveda is undoubtedly the oldest literary monument of the Indo-European languages.


Ok, we are now able to get some stuff done. Remember the namespace naming ? Sometimes it's useful to retrieve namespaces and their prefix :

In [12]:
line_2 = children[1]
print(line_2.nsmap)
print(line_2.prefix)
print(line_2.tag)

{'mj': 'http://www.mayankjohri.org/ns/1.0'}
mj
{http://www.mayankjohri.org/ns/1.0}l


**What you've learned** :

- How to parse a xml file or a string representing xml through `etree.parse()` and `etree.fromstring()`
- How to configure the way xml is parsed with `etree.XMLParser()`
- What is an attribute and a method
- Properties and methods of a node
- XMLParseError handling
- Clark's notation for namespaces and tags.

### XPath and XSLT with lxml

#### XPath

XPath is a powerful tool for traversing an xml tree. XML is made of nodes such as tags, comments, texts. These nodes have attributes that can be used to identify them. For example, with the following xml :

> `<div><l n="1"><p>Text</p> followed</l><l n="2">by line two</div>`

the node p will be accessible by `/div/l[@n="1"]/p`. LXML has great support for complex XPath, which makes it the best friend of Humanists dealing with xml :

In [13]:
# We generate some xml and parse it

## TODO 
xml = """<div>
            <l n="1">
                <p>Text</p> 
                <p>new p</p>
                followed
                <test>
                    <p>p3</p>
                </test>
            </l>
            <l n="2">
                by line two
            </l>
            <p>test</p>
            <p><l n="3"> line 3</l></p>
        </div>"""
div = etree.fromstring(xml)
print(div)
# When doing an xpath, the results will be a list
print("-"*20)
ps = div.xpath("/div/l")
for p in ps:
    print(p)
print("-"*20)
# print(ps)
print([value.values()[0] for value in ps])
print(ps[0].text == "Text")

<Element div at 0x7f8228334a88>
--------------------
<Element l at 0x7f8228334b48>
<Element l at 0x7f8228334c48>
--------------------
['1', '2']
False


As you can see, the xpath returns a list. This behaviour is intended, since an xpath can retrieve more than one item :

In [14]:
print(div.xpath("//l"))

[<Element l at 0x7f8228334b48>, <Element l at 0x7f8228334c48>, <Element l at 0x7f8228334708>]


You see ? The xpath `//l` returns two elements, just like python does in a list. Now, let's apply some xpath to the children and see what happens :

In [15]:
# We assign our first line to a variable
line_1 = div.xpath("//l")[0]
#print(dir(line_1))
print(line_1.attrib['n'])

# We look for p
print(line_1.xpath("p")) # This works
print(line_1.xpath("./p")) # This works too

print(line_1.xpath(".//p")) # This still works 

print(line_1.xpath("//p")) # entire doc


1
[<Element p at 0x7f8228334dc8>, <Element p at 0x7f8228334e08>]
[<Element p at 0x7f8228334dc8>, <Element p at 0x7f8228334e08>]
[<Element p at 0x7f8228334e08>, <Element p at 0x7f8228334748>, <Element p at 0x7f8228334d88>]
[<Element p at 0x7f8228334e08>, <Element p at 0x7f8228334748>, <Element p at 0x7f8228334d88>, <Element p at 0x7f8228334e88>, <Element p at 0x7f8228334f48>]


As you can see, you can do xpath from any node in lxml. One important thing though : xpath `//tagname` *will return to the root* if you do not add a dot in front of it such as **`.`**`//tagname`. This is really important to remember, because most xpath resolvers do not behave this way.

Another point to kepe in mind : if you write your xpath incorrectly, Python will raise an *XPathEvalError * error

In [16]:
from lxml.etree import XPathEvalError
try:
    line_1.xpath("wrong:xpath:never:works")
except XPathEvalError as e:
    print(e.__str__())

Invalid expression


### Xpath with namespaces and prefix

As you've seen, lxml use Clark's naming convention for expressing namespaces. This is extremely important regarding xpath, because you will be able to retrieve a node using it under certain conditions :

In [17]:
# We create a valid xml object
xml = """<root>
<tag xmlns="http://localhost">Text</tag>
<tei:tag xmlns:tei="http://www.tei-c.org/ns/1.0">Other text</tei:tag>
<teiTwo:tag xmlns:teiTwo="http://www.tei-c.org/ns/2.0">Other text</teiTwo:tag>
</root>"""
root = etree.fromstring(xml)
# We register every namespaces in a dictionary using prefix as keys :
ns = {
    "local" : "http://localhost", # Even if this namespace had no prefix, we can register one for it
    "tei" : "http://www.tei-c.org/ns/1.0",
    "two": "http://www.tei-c.org/ns/2.0"
}

print([d.text for namespace in ns 
       for d in root.xpath("//{namespace}:tag".format(namespace=namespace), 
                           namespaces=ns) ])

['Text', 'Other text', 'Other text']


**What you have learned** :

- Each node and xml document has an `.xpath()` method which takes as its first parameter xpath
- Method `xpath()` always returns a list, even for a single result
- Method `xpath()` will return to the root when you don't prefix your `//` with a dot.
- An incorrect XPath will issue a `XPathEvalError`
- Method `xpath()` accepts a `namespaces` argument : you should enter a dictionary where keys are prefixes and values namespaces
- Unlike `findall()`, `xpath()` does not accept Clark's notation

### XSLT

XSLT stands for *Extensible Stylesheet Language Transformations*. It's an xml-based language made for transforming xml documents to xml or other formats such as LaTeX and HTML. XSLT is really powerful when dealing with similarly formated data. It's far easier to transform 100 documents with the exact same structure via XSLT than in Python or any other language.

While Python is great at dealing with weird transformations of xml, the presence of XSLT in Python allows you to create production chains without leaving your favorite IDE.

To do some XSL, lxml needs two things : first, an xml document representing the xsl that will be parsed and entered into the function `etree.XSLT()`, and second, a document to transform.

In [18]:
# Here is an xml containing an xsl: for each text node of an xml file in the xpath /humanities/field,
#     this will return a node <name> with the text inside
xslt_root = etree.fromstring("""
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/">
        <fields><xsl:apply-templates /></fields>
    </xsl:template>
    <xsl:template match="/humanities/field">
        <name><xsl:value-of select="./text()" /></name>
    </xsl:template>
</xsl:stylesheet>""")
# We transform our document to an xsl 
xslt = etree.XSLT(xslt_root)

# We create some xml we need to change 
xml = """<humanities>
    <field>History</field>
    <field>Classics</field>
    <field>French</field>
    <field>German</field>
</humanities>"""
parsed_xml = etree.fromstring(xml)
# And now we process our xml :
transformed = xslt(parsed_xml)
print(transformed)

<?xml version="1.0"?>
<fields>
    <name>History</name>
    <name>Classics</name>
    <name>French</name>
    <name>German</name>
</fields>



Did you see what happened ? We used `xslt(xml)`. `etree.XSLT()` transforms a xsl document into a function, which then takes one parameter (in this case an xml document). But can you figure out what this returns ? Let's ask Python :

In [19]:
print(type(transformed))
print(type(parsed_xml))

<class 'lxml.etree._XSLTResultTree'>
<class 'lxml.etree._Element'>


The result is not of the same type of element we usually have, even though it does share most of its methods and attributes :

In [20]:
print(transformed.xpath("//name"))

[<Element name at 0x7f822833e248>, <Element name at 0x7f822833e388>, <Element name at 0x7f822833e288>, <Element name at 0x7f822833e3c8>]


And has something more : you can change its type to string !

In [21]:
string_result = str(transformed)
print(string_result)

<?xml version="1.0"?>
<fields>
    <name>History</name>
    <name>Classics</name>
    <name>French</name>
    <name>German</name>
</fields>



XSLT is more complex than just inputing xml. You can do XSLT using parameters as well. In this case, your parameters will be accessibles as a named argument to the generated function. If your XSL has a `name` xsl-param, the function given back by `etree.XSLT` will have a `name` argument :

In [22]:
# Here is an xml containing an xsl: for each text node of an xml file in the xpath /humanities/field,
#     this will return a node <name> with the text inside
xslt_root = etree.fromstring("""
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:param name="n" />
    <xsl:template match="/humanities">
        <fields>
            <xsl:attribute name="n">
                <xsl:value-of select="$n"/>
            </xsl:attribute>
            <xsl:apply-templates select="field"/>
        </fields>
    </xsl:template>
    <xsl:template match="/humanities/field">
        <name><xsl:value-of select="./text()" /></name>
    </xsl:template>
</xsl:stylesheet>""")
# We transform our document to an xsl 
xslt = etree.XSLT(xslt_root)

# We create some xml we need to change 
xml = """<humanities>
    <category>Humanities</category>
    <field>History</field>
    <field>Classics</field>
    <field>French</field>
    <field>German</field>
</humanities>"""
parsed_xml = etree.fromstring(xml)
# And now we process our xml :
transformed = xslt(parsed_xml, n="'Humanities'") # Note that for a string, we encapsulate it within single quotes
print(transformed)

# Be aware that you can use xpath as a value for the argument, though it can be rather complex sometimes
transformed = xslt(parsed_xml, n=etree.XPath("//category/text()"))
print(transformed)

<?xml version="1.0"?>
<fields n="Humanities"><name>History</name><name>Classics</name><name>French</name><name>German</name></fields>

<?xml version="1.0"?>
<fields n="Humanities"><name>History</name><name>Classics</name><name>French</name><name>German</name></fields>



### Using ElementTree

In [23]:
from xml.etree import ElementTree

with open('code/data/books.xml', 'rt') as f:
    tree = ElementTree.parse(f)

print(tree)

<xml.etree.ElementTree.ElementTree object at 0x7f8228340710>


### Traversing the Parsed Tree

To visit all of the children in order, use iter() to create a generator that iterates over the ElementTree instance.

In [24]:
from xml.etree import ElementTree
from itertools import islice
    
with open('code/data/books.xml', 'r') as f:
    tree = ElementTree.parse(f)

# only getting 5 elements from the generator
for node in islice(tree.iter(), 5):
    print (node.tag, node.attrib)
    print("-----")

catalog {}
-----
book {'id': 'bk101'}
-----
author {}
-----
title {}
-----
genre {}
-----


In [25]:
### To print only the groups of names and feed URLs for the podcasts, 
# leaving out of all of the data in the header section by iterating 
# over only the outline nodes and print the text and xmlUrl attributes.

from xml.etree import ElementTree

with open('code/data/podcasts.opml', 'rt') as f:
    tree = ElementTree.parse(f)

print(len( list(tree.iter('outline'))))
for node in tree.iter('outline'):
    name = node.attrib.get('text')
    url = node.attrib.get('xmlUrl')
    if name and url:
        print ('\t%s :: %s' % (name, url))
    else:
        print (name)

18
Science and Tech
	APM: Future Tense :: http://www.publicradio.org/columns/futuretense/podcast.xml
	Engines Of Our Ingenuity Podcast :: http://www.npr.org/rss/podcast.php?id=510030
	Science & the City :: http://www.nyas.org/Podcasts/Atom.axd
Books and Fiction
	Podiobooker :: http://feeds.feedburner.com/podiobooks
	The Drabblecast :: http://web.me.com/normsherman/Site/Podcast/rss.xml
	tor.com / category / tordotstories :: http://www.tor.com/rss/category/TorDotStories
Computers and Programming
	MacBreak Weekly :: http://leo.am/podcasts/mbw
	FLOSS Weekly :: http://leo.am/podcasts/floss
	Core Intuition :: http://www.coreint.org/podcast.xml
Python
	PyCon Podcast :: http://advocacy.python.org/podcasts/pycon.rss
	A Little Bit of Python :: http://advocacy.python.org/podcasts/littlebit.rss
	Django Dose Everything Feed :: http://djangodose.com/everything/feed/
Miscelaneous
	dhellmann's CastSampler Feed :: http://www.castsampler.com/cast/feed/rss/dhellmann/


### Finding Nodes in a Document

Walking the entire tree like this searching for relevant nodes can be error prone. The example above had to look at each outline node to determine if it was a group (nodes with only a text attribute) or podcast (with both text and xmlUrl). To produce a simple list of the podcast feed URLs, without names or groups, for a podcast downloader application, the logic could be simplified using findall() to look for nodes with more descriptive search characteristics.

As a first pass at converting the above example, we can construct an XPath argument to look for all outline nodes.

In [26]:
for node in tree.findall('.//outline'):
    url = node.attrib.get('xmlUrl')
    if url:
        print( url)
    else:
        print(node.attrib.get("text"))

Science and Tech
http://www.publicradio.org/columns/futuretense/podcast.xml
http://www.npr.org/rss/podcast.php?id=510030
http://www.nyas.org/Podcasts/Atom.axd
Books and Fiction
http://feeds.feedburner.com/podiobooks
http://web.me.com/normsherman/Site/Podcast/rss.xml
http://www.tor.com/rss/category/TorDotStories
Computers and Programming
http://leo.am/podcasts/mbw
http://leo.am/podcasts/floss
http://www.coreint.org/podcast.xml
Python
http://advocacy.python.org/podcasts/pycon.rss
http://advocacy.python.org/podcasts/littlebit.rss
http://djangodose.com/everything/feed/
Miscelaneous
http://www.castsampler.com/cast/feed/rss/dhellmann/


In [27]:
print(dir(tree))
print(tree.getroot)

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_root', '_setroot', 'find', 'findall', 'findtext', 'getiterator', 'getroot', 'iter', 'iterfind', 'parse', 'write', 'write_c14n']
<bound method ElementTree.getroot of <xml.etree.ElementTree.ElementTree object at 0x7f82280e3cc0>>


Another version can take advantage of the fact that the outline nodes are only nested two levels deep. Changing the search path to .//outline/outline mean the loop will process only the second level of outline nodes.

In [28]:
for node in tree.findall('.//outline/outline'):
    url = node.attrib.get('xmlUrl')
    print (url)

http://www.publicradio.org/columns/futuretense/podcast.xml
http://www.npr.org/rss/podcast.php?id=510030
http://www.nyas.org/Podcasts/Atom.axd
http://feeds.feedburner.com/podiobooks
http://web.me.com/normsherman/Site/Podcast/rss.xml
http://www.tor.com/rss/category/TorDotStories
http://leo.am/podcasts/mbw
http://leo.am/podcasts/floss
http://www.coreint.org/podcast.xml
http://advocacy.python.org/podcasts/pycon.rss
http://advocacy.python.org/podcasts/littlebit.rss
http://djangodose.com/everything/feed/
http://www.castsampler.com/cast/feed/rss/dhellmann/


### Parsed Node Attributes

The items returned by findall() and iter() are Element objects, each representing a node in the XML parse tree. Each Element has attributes for accessing data pulled out of the XML. This can be illustrated with a somewhat more contrived example input file, data.xml:

In [29]:
from xml.etree import ElementTree

with open('code/data/data.xml', 'rt') as f:
    tree = ElementTree.parse(f)

node = tree.find('./with_attributes')
print (node.tag)

for name, value in sorted(node.attrib.items()):
    print ('  %-4s = "%s"' % (name, value))

with_attributes
  foo  = "bar"
  name = "value"
  testtest = "thisis atest"


In [30]:
for path in [ './child', './child_with_tail' ]:
    node = tree.find(path)
    print(node.tag)
    print ('  child node text:', node.text)
    print ('  and tail text  :', node.tail)

child
  child node text: This child contains text.
  and tail text  : 
  
child_with_tail
  child node text: This child has regular text.
  and tail text  : And "tail" text.
  


### Parsing Strings

To work with smaller bits of XML text, especially string literals as might be embedded in the source of a program, use XML() and the string containing the XML to be parsed as the only argument.

In [31]:
from xml.etree.ElementTree import XML

parsed = XML('''
<root>
  <group>
    <child id="a">This is child "a".</child>
    <child id="b">This is child "b".</child>
  </group>
  <group>
    <child id="c">This is child "c".</child>
  </group>
</root>
''')

print ('parsed =', parsed)

for elem in parsed:
    print (elem.tag)
    if elem.text is not None and elem.text.strip():
        print ('  text: "%s"' % elem.text)
    if elem.tail is not None and elem.tail.strip():
        print ('  tail: "%s"' % elem.tail)
    for name, value in sorted(elem.attrib.items()):
        print('  %-4s = "%s"' % (name, value))
    print

parsed = <Element 'root' at 0x7f8228369e08>
group
group


In [32]:
from xml.etree.ElementTree import Element, tostring

top = Element('top')

children = [
    Element('child', num=str(i))
    for i in range(3)
]

top.extend(children)

print(top)

<Element 'top' at 0x7f82280e4318>


## XML support in Python

Python has rich support for XML by having multiple libs to parse XML documents. Lets dicuss them in details. Following are the sub-modules supported nativly by Python

- xml.etree.ElementTree: the ElementTree API, a simple and lightweight XML processor
- xml.dom: the DOM API definition
- xml.dom.minidom: a minimal DOM implementation
- xml.dom.pulldom: support for building partial DOM trees
- xml.sax: SAX2 base classes and convenience functions
- xml.parsers.expat: the Expat parser binding

### xml.etree.ElementTree

we can import `ET` using the following command

In [33]:
import xml.etree.ElementTree as ET

XML can parse either the xml file using the following code, 

In [34]:
old_books = 'code/data/old_books.xml'
nasa_data = 'code/data/nasa.xml'

#### Opening xml file

Opening an xml file is actually quite simple : you open it and you parse it. Who would have guessed ?

In [35]:
tree = ET.parse(old_books)
root = tree.getroot()
print(tree)

<xml.etree.ElementTree.ElementTree object at 0x7f8228350cc0>


or read string using the following code

In [36]:
xml_book = """<?xml version="1.0"?>
<books>
    <book title="Ṛg-Veda Khilāni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="Ṛgveda-Saṃhitā">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>
</books>
"""
root = ET.fromstring(xml_book)

As an Element, `root` also has tag and to following code can be used to find the tag

In [37]:
print(root.tag)

books


We can use `len` to find the number of direct child nodes. As in our example we have two `book` nodes,  

In [38]:
print(len(root))

2


#### Reading root as binary text

In [39]:
from pprint import pprint
pprint(ET.tostring(root))

(b'<books>\n    <book title="&#7770;g-Veda Khil&#257;ni">\n        <editor>Jo'
 b'st Gippert</editor>\n        <publication>Frankfurt: TITUS</publication>\n'
 b'        <year>2008</year>\n        <web_page>http://titus.uni-frankfurt.d'
 b'e/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>\n    </book>    \n  '
 b'  <book title="&#7770;gveda-Sa&#7747;hit&#257;">\n        <editor>Jost Gi'
 b'ppert</editor>\n        <publication>Frankfurt: TITUS</publication>\n     '
 b'   <year>2000</year>\n        <web_page>http://titus.uni-frankfurt.de/tex'
 b'te/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>\n    </book>\n</books>')


#### Reading element as formatted text

In [40]:
dec_root = ET.tostring(root).decode()
print(dec_root)
print(type(dec_root))

<books>
    <book title="&#7770;g-Veda Khil&#257;ni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="&#7770;gveda-Sa&#7747;hit&#257;">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>
</books>
<class 'str'>


#### All attributes available to an element

we can use our time tested and trusted  `dir` to get the list of all the attributes of an `element`

In [41]:
print(dir(root))

['__class__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'attrib', 'clear', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind', 'itertext', 'keys', 'makeelement', 'remove', 'set', 'tag', 'tail', 'text']


we can use `for` loop to traverse the direct descendents nodes.

In [42]:
for ele in root:
    print(ele)

<Element 'book' at 0x7f82280e4ea8>
<Element 'book' at 0x7f822835d5e8>


as shown above we get element nodes using for loop, lets get more information from them by enhancing the existing code

In [43]:
for ele in root:
    print(ele.tag, ele.attrib)

book {'title': 'Ṛg-Veda Khilāni'}
book {'title': 'Ṛgveda-Saṃhitā'}


we can also find the nodes using indexes. 

In [44]:
print(root[1])

<Element 'book' at 0x7f822835d5e8>


If more than one attibutes are present then individual attributes can be accessas similar to dictionary

In [45]:
print(root[1].attrib['title'])

Ṛgveda-Saṃhitā


In [46]:
print(root[0][1].text)

Frankfurt: TITUS


#### Reading Large XML file using `iterparse`

In [47]:
for event, elem in ET.iterparse(old_books):
    print(event, elem)

end <Element 'editor' at 0x7f82280f26d8>
end <Element 'publication' at 0x7f82280f2728>
end <Element 'year' at 0x7f82280f2778>
end <Element 'web_page' at 0x7f82280f27c8>
end <Element 'book' at 0x7f82280f2688>
end <Element 'editor' at 0x7f82280f2868>
end <Element 'publication' at 0x7f82280f28b8>
end <Element 'year' at 0x7f82280f2908>
end <Element 'web_page' at 0x7f82280f2958>
end <Element 'book' at 0x7f82280f2818>
end <Element 'books' at 0x7f82280f2638>


In [48]:
file_name = 'code/data/nasa.xml'
x = 0
for event, elem in ET.iterparse(file_name):
    print("Event:", event, "Elem:", elem)
    if x > 10:
        pass
        break
    else:
        x += 1

Event: end Elem: <Element 'title' at 0x7f82280f2cc8>
Event: end Elem: <Element 'altname' at 0x7f82280f2d18>
Event: end Elem: <Element 'altname' at 0x7f82280f2d68>
Event: end Elem: <Element 'altname' at 0x7f82280f2db8>
Event: end Elem: <Element 'title' at 0x7f82280f2ef8>
Event: end Elem: <Element 'initial' at 0x7f82280f2f98>
Event: end Elem: <Element 'initial' at 0x7f82280f7048>
Event: end Elem: <Element 'lastName' at 0x7f82280f7098>
Event: end Elem: <Element 'author' at 0x7f82280f2f48>
Event: end Elem: <Element 'initial' at 0x7f82280f7138>
Event: end Elem: <Element 'lastName' at 0x7f82280f7188>
Event: end Elem: <Element 'author' at 0x7f82280f70e8>


### Finding interesting elements

Most of the time, we are only interested in part of the whole xml document or only one attribues of all elements.
In this section, we will discuss technologies which will help us in solving the above situations.

Lets assume, our manager (say **Mr. Pauly**) has asked us to read `old_book.xml` file and find the name of editors of all the books. We can solve the request using any of the below methods. 

#### Using iter

This is the recommemded method as its an iterator, thus will consume less memory

In [49]:
for editor in root.iter('editor'):
    print(editor)
    print(editor.text)

<Element 'editor' at 0x7f82280e4868>
Jost Gippert
<Element 'editor' at 0x7f822835d598>
Jost Gippert


as, you can see we were able to directly select editor tags

#### using findall

It finds only elements with a tag which are direct children of the current element.

In [50]:
for editor in root.findall('book'):
    print(editor)
    print(editor.tag)

<Element 'book' at 0x7f82280e4ea8>
book
<Element 'book' at 0x7f822835d5e8>
book


In [51]:
print(root.findall('editor'))

[]


As you can see that `editor` is not direct children for the current element `root`, thus we got empty value. But we can use relative xpath to get the editors

In [52]:
print(root.findall('.//editor'))
for editor in root.findall('.//editor'):
    print(editor.tag, " : ", editor.text)

[<Element 'editor' at 0x7f82280e4868>, <Element 'editor' at 0x7f822835d598>]
editor  :  Jost Gippert
editor  :  Jost Gippert


#### Using find

It find the first child with a particular tag

In [53]:
print(root.find('book'))

<Element 'book' at 0x7f82280e4ea8>


In [54]:
print(root.find('editor'))

None


As you can see that `editor` is not direct children for the current element `root`, thus we got empty value. But we can use relative xpath to get the editors as shown in the previous example.

> <center>! ! **NOTE** ! !</center>
> ---
> Do try to avoid `find` and `findall` as they are not iterators and can & will consume more memory :)

### Accessing Element Attributes

In [55]:
ele = root.find('book')
ele.get('title')

'Ṛg-Veda Khilāni'

### Building XML documents

We can build a XML document using `Element` & `SubElement` functions of `ElementTree`

In [56]:
a = ET.Element('a')
b = ET.SubElement(a, 'b')
b.attrib["B"] = "TEST"
c = ET.SubElement(a, 'c')
d = ET.SubElement(a, 'd')
e = ET.SubElement(d, 'e')
f = ET.SubElement(e, 'f')

ET.dump(a)
print(ET.tostring(a).decode())

<a><b B="TEST" /><c /><d><e><f /></e></d></a>
<a><b B="TEST" /><c /><d><e><f /></e></d></a>


### Parsing XML with Namespaces

```xml
<?xml version="1.0"?>
<actors xmlns:fictional="http://characters.example.com"
        xmlns="http://people.example.com">
    <actor>
        <name>John Cleese</name>
        <fictional:character>Lancelot</fictional:character>
        <fictional:character>Archie Leach</fictional:character>
    </actor>
    <actor>
        <name>Eric Idle</name>
        <fictional:character>Sir Robin</fictional:character>
        <fictional:character>Gunther</fictional:character>
        <fictional:character>Commander Clement</fictional:character>
    </actor>
</actors>
```

In [57]:
xml_text = """<?xml version="1.0"?>
<actors xmlns:fictional="http://characters.example.com"
        xmlns="http://people.example.com">
    <actor>
        <name>John Cleese</name>
        <fictional:character>Lancelot</fictional:character>
        <fictional:character>Archie Leach</fictional:character>
    </actor>
    <actor>
        <name>Eric Idle</name>
        <fictional:character>Sir Robin</fictional:character>
        <fictional:character>Gunther</fictional:character>
        <fictional:character>Commander Clement</fictional:character>
    </actor>
</actors>"""

In [58]:
root = ET.fromstring(xml_text)
for actor in root.findall('{http://people.example.com}actor'):
    name = actor.find('{http://people.example.com}name')
    print(name.text)
    for char in actor.findall('{http://characters.example.com}character'):
        print('   |->', char.text)

John Cleese
   |-> Lancelot
   |-> Archie Leach
Eric Idle
   |-> Sir Robin
   |-> Gunther
   |-> Commander Clement


### XPath support

| Syntax            | Meaning                                                                                                                                                                                                                                   |
|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| tag               | Selects all child elements with the given tag. For example, spam selects all child elements named spam, and spam/egg selects all grandchildren named egg in all children named spam.                                                      |
| *                 | Selects all child elements. For example, */egg selects all grandchildren named egg.                                                                                                                                                       |
| .                 | Selects the current node. This is mostly useful at the beginning of the path, to indicate that it’s a relative path.                                                                                                                      |
| //                | Selects all subelements, on all levels beneath the current element. For example, .//egg selects all eggelements in the entire tree.                                                                                                       |
| ..                | Selects the parent element. Returns None if the path attempts to reach the ancestors of the start element (the element find was called on).                                                                                               |
| [@attrib]         | Selects all elements that have the given attribute.                                                                                                                                                                                       |
| [@attrib='value'] | Selects all elements for which the given attribute has the given value. The value cannot contain quotes.                                                                                                                                  |
| [tag]             | Selects all elements that have a child named tag. Only immediate children are supported.                                                                                                                                                  |
| [tag='text']      | Selects all elements that have a child named tag whose complete text content, including descendants, equals the given text.                                                                                                               |
| [position]        | Selects all elements that are located at the given position. The position can be either an integer (1 is the first position), the expression last() (for the last position), or a position relative to the last position (e.g. last()-1). |

### Modifying an XML File

The ElementTree.write() method can be used to save the updated document to specified file. 

In [59]:
root = ET.fromstring(xml_book)
ele = root.find('book')

ele.attrib['name'] = "John Cleese"
updated_xml = 'code/data/updated_old_book.xml'
tree.write(updated_xml)

In [60]:
with open(updated_xml) as f:
    print(f.read())

<books>
    <book title="&#7770;g-Veda Khil&#257;ni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="&#7770;gveda-Sa&#7747;hit&#257;">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>	
</books>


### XML vulnerabilities

he XML processing modules are not secure against maliciously constructed data. An attacker can abuse XML features to carry out denial of service attacks, access local files, generate network connections to other machines, or circumvent firewalls.

The following table gives an overview of the known attacks and whether the various modules are vulnerable to them.

| kind | sax | etree | minidom | pulldom | xmlrpc |
|---------------------------|------------|------------|------------|------------|------------|
| billion laughs | Vulnerable | Vulnerable | Vulnerable | Vulnerable | Vulnerable |
| quadratic blowup | Vulnerable | Vulnerable | Vulnerable | Vulnerable | Vulnerable |
| external entity expansion | Vulnerable | Safe (1) | Safe (2) | Vulnerable | Safe (3) |
| DTD retrieval | Vulnerable | Safe | Safe | Vulnerable | Safe |
| decompression bomb | Safe | Safe | Safe | Safe | Vulnerable |

## Common Errors and causes

In [63]:
xml_book = """
<?xml version="1.0"?>
<books>
    <book title="Ṛg-Veda Khilāni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="Ṛgveda-Saṃhitā">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>
</books>
"""
try:
    root = ET.fromstring(xml_book)
except Exception as e:
    print(e)

XML or text declaration not at start of entity: line 2, column 0


due to blank first line this error happens, to avoid this error remove the blank spaces from the start of string, as shown below

In [64]:
xml_book = """<?xml version="1.0"?>
<books>
    <book title="Ṛg-Veda Khilāni">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2008</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/rvkh/rvkh.htm</web_page>
    </book>    
    <book title="Ṛgveda-Saṃhitā">
        <editor>Jost Gippert</editor>
        <publication>Frankfurt: TITUS</publication>
        <year>2000</year>
        <web_page>http://titus.uni-frankfurt.de/texte/etcs/ind/aind/ved/rv/mt/rv.htm</web_page>
    </book>
</books>
"""
root = ET.fromstring(xml_book)

## Reference & Inspirations

- https://en.wikipedia.org/wiki/XML
- https://docs.python.org/3.6/library/xml.html
- https://en.wikipedia.org/wiki/Billion_laughs
- https://www.ibm.com/developerworks/library/x-hiperfparse/index.html
- https://github.com/mikekestemont/ghent1516/blob/master/Chapter%208%20-%20Parsing%20XML.ipynb 