##### Diving In

This chapter is unlike any other chapter in the book, in that it focusses all its attention on a XML processing library in python. Specifically it utilizes the standard library ElementTree API and the compatible API supported by lxml.


Due to the nature of the chapter, this notebook is essentially examples of utilizing the API

`XML` isn't about code it is about data. One common use of `XML` is "syndication feeds" that list the latest articles on a blog, forum or other frequently updated website.

Most popular blogging software can produce a feed and update it whenever new articles, discussion threads or blog posts are published.

Lets take a look a the xml data we'll be working with

In [4]:
s = ""
with open('examples/feed.xml') as xml_file:
    for line in xml_file:
        s += line
    
print(s)

<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
  <title>dive into mark</title>
  <subtitle>currently between addictions</subtitle>
  <id>tag:diveintomark.org,2001-07-29:/</id>
  <updated>2009-03-27T21:56:07Z</updated>
  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Dive into history, 2009 edition</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
    <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
    <updated>2009-03-27T21:56:07Z</updated>
    <published>2009-03-27T17:20:42Z</published>
    <category scheme='http://diveintomark.org' term='diveintopython'/>
    <category scheme='http://diveintomark.org' term='docbook'/>
    <category scheme='http://diveintomark.org' term='html'/>
    <summary 

##### Crash course in XML

XML is heirarchical structure data. A document has a single root element. Each element is defined within a namespace. A XML document may specify a default namespace.

Elements can be nested to any depth. Elements can have zero or more attributes. The attributes of elements are _unordered_ name value pairs.

Elements can contain text in addition to children. Elements that contain no text and no children are empty.

XML documents specify an "encoding". The details of how the parser is able to interpret the document header to determine the encoding while having not encountered it year, is a specification detail I've not looked up.


Lets look at our document and limit it to only one entry in the blog

```xml
<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
  <title>dive into mark</title>
  <subtitle>currently between addictions</subtitle>
  <id>tag:diveintomark.org,2001-07-29:/</id>
  <updated>2009-03-27T21:56:07Z</updated>
  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Dive into history, 2009 edition</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
    <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
    <updated>2009-03-27T21:56:07Z</updated>
    <published>2009-03-27T17:20:42Z</published>
    <category scheme='http://diveintomark.org' term='diveintopython'/>
    <category scheme='http://diveintomark.org' term='docbook'/>
    <category scheme='http://diveintomark.org' term='html'/>
    <summary type='html'>Putting an entire chapter on one page sounds
      bloated, but consider this &amp;mdash; my longest chapter so far
      would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
      On dialup.</summary>
  </entry>
</feed>
```

The header of the xml document specifies an encoding `utf-8`. The root node of the document `feed` specifies a default `namespace` for all elements in the document, this namespace is the `Atom` syndication format namespace - `xmlns='http://www.w3.org/2005/Atom'`

The `lang` attribute of the `feed` element lives under the `xml` namespace and specifies the language `en` of the document.

The `feed` element has children elements of names `title`, `subtitle`, `id`, `updated`, `link` and `entry`. The `entry` element itself contains a list of subelements in this case `author`, `title` `link`, `id`, `updated`, `published`, `category` and `summary`.

Elements may occur multiple times at the same level as in the case of `category` above. The exact meaning and interpretation of the elements named here can be understood by following the atom [specification](https://tools.ietf.org/html/rfc428])

Lets get parsing.

##### Parsing XML

Python can parse XML document in several ways. It has the traditional `DOM` and `SAX` parsers, but we focus on a the ElementTree api

In [6]:
import xml.etree.ElementTree as etree
tree = etree.parse('examples/feed.xml')
root = tree.getroot()
root

<Element '{http://www.w3.org/2005/Atom}feed' at 0x10c742e00>

In [9]:
root.attrib

{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}

The root element is the `feed` element in the namespace `http://www.w3.org/2005/Atom`. It has a single attribute `lang` in the namespace `http://www.w3.org/XML/1998/namespace` with value `en`

`ElementTree` represents `XML` elements as `{namespace}localname`.


    ##### Elements are lists

In [11]:
print(root.tag)
print(len(root))

{http://www.w3.org/2005/Atom}feed
8


In [12]:
for child in root:
    print(child)

<Element '{http://www.w3.org/2005/Atom}title' at 0x10c74e950>
<Element '{http://www.w3.org/2005/Atom}subtitle' at 0x10c5e3e00>
<Element '{http://www.w3.org/2005/Atom}id' at 0x10c6b66d0>
<Element '{http://www.w3.org/2005/Atom}updated' at 0x10c6ba860>
<Element '{http://www.w3.org/2005/Atom}link' at 0x10c6ba4a0>
<Element '{http://www.w3.org/2005/Atom}entry' at 0x10c6bad60>
<Element '{http://www.w3.org/2005/Atom}entry' at 0x10c7d8680>
<Element '{http://www.w3.org/2005/Atom}entry' at 0x10c7d8b80>


The list of children only includes _direct_ children.

##### Attributes are dictionaries

In [13]:
print(root.attrib)

{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}


In [19]:
root[4]

<Element '{http://www.w3.org/2005/Atom}link' at 0x10c6ba4a0>

In [15]:
root[4].tag

'{http://www.w3.org/2005/Atom}link'

In [20]:
root[4].attrib

{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/'}

##### Searching for nodes within an XML document

So far we've worked with this XML document "from the top down", starting with the root element, getting its child elements and so on throughtout the document. But many uses of XML require you to find specific elements. Etree can do that too.

In [26]:
import xml.etree.ElementTree as etree

tree = etree.parse('examples/feed.xml')
root = tree.getroot()
root.findall('{http://www.w3.org/2005/Atom}entry')

[<Element '{http://www.w3.org/2005/Atom}entry' at 0x1108185e0>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x110805bd0>,
 <Element '{http://www.w3.org/2005/Atom}entry' at 0x1108249f0>]

In [27]:
root.findall('{http://www.w3.org/2005/Atom}feed')

[]

In [28]:
root.findall('{http://www.w3.org/2005/Atom}author')

[]

The `findall` method in this case only searches the children of the node, `feed` is not found because the current node (`feed`) does not have any children nodes of type `feed`

Similarly nodes of type `author` are grand-children of `feed` and not children. 

To expand the search to include all descendents, prefix the findall search criteria with `//`. A recursive search can only be performed on the `document` as a whole and not an element directly including the `root` element

In [33]:
tree.findall('.//{http://www.w3.org/2005/Atom}link')

[<Element '{http://www.w3.org/2005/Atom}link' at 0x110818450>,
 <Element '{http://www.w3.org/2005/Atom}link' at 0x1108189f0>,
 <Element '{http://www.w3.org/2005/Atom}link' at 0x110805f90>,
 <Element '{http://www.w3.org/2005/Atom}link' at 0x110824540>]

In [38]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

for e in tree.findall('.//{http://www.w3.org/2005/Atom}link'):
    pp.pprint(e.attrib)

{'href': 'http://diveintomark.org/', 'rel': 'alternate', 'type': 'text/html'}
{   'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
    'rel': 'alternate',
    'type': 'text/html'}
{   'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress',
    'rel': 'alternate',
    'type': 'text/html'}
{   'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats',
    'rel': 'alternate',
    'type': 'text/html'}


##### Going further with lxml

`lxml` is an open source third-party library that bulids on the popular libxml2 parser. It provided 100% compatible `ElementTree` API, then extends it will full `XPATH 1.0` support and few other niceties.

In [39]:
from lxml import etree

tree = etree.parse('examples/feed.xml')
root = tree.getroot()
root.findall('{http://www.w3.org/2005/Atom}entry')

[<Element {http://www.w3.org/2005/Atom}entry at 0x110e990c0>,
 <Element {http://www.w3.org/2005/Atom}entry at 0x110e991c0>,
 <Element {http://www.w3.org/2005/Atom}entry at 0x110e99200>]

In [41]:
tree.findall('//{http://www.w3.org/2005/Atom}link')

[<Element {http://www.w3.org/2005/Atom}link at 0x110e99e00>,
 <Element {http://www.w3.org/2005/Atom}link at 0x110e9e040>,
 <Element {http://www.w3.org/2005/Atom}link at 0x110e9e100>,
 <Element {http://www.w3.org/2005/Atom}link at 0x110e9e080>]

In [42]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

for e in tree.findall('//{http://www.w3.org/2005/Atom}link'):
    pp.pprint(e.attrib)

{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/'}
{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'}
{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'}
{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'}


As one can see the API and responses are identical. For large xml documents `lxml` is significantly faster than the built in ElementTree library.

But `lxml` is more than just faster, its `findall` method includes support for all complicated expression (XPATH)

In [43]:
from lxml import etree
tree = etree.parse('examples/feed.xml')
tree.findall('//{http://www.w3.org/2005/Atom}*[@href]')

[<Element {http://www.w3.org/2005/Atom}link at 0x110ea01c0>,
 <Element {http://www.w3.org/2005/Atom}link at 0x110ea0240>,
 <Element {http://www.w3.org/2005/Atom}link at 0x110ea0280>,
 <Element {http://www.w3.org/2005/Atom}link at 0x110ea0040>]

In [45]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

for e in tree.findall('//{http://www.w3.org/2005/Atom}*[@href]'):
    pp.pprint(e.attrib)

{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/'}
{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'}
{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'}
{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'}


the XPATH expression here, run at the document level, is searching for all elements which have the `atom:href` attribute specified.

In [57]:
NS='{http://www.w3.org/2005/Atom}'
nodes = tree.findall(f'''{NS}*[@href='http://diveintomark.org/']''')

In [58]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

for e in nodes:
    pp.pprint(e.attrib)

{'rel': 'alternate', 'type': 'text/html', 'href': 'http://diveintomark.org/'}


Above we are searching for all nodes which have the attribute `atom:href` specified and having a value of `http://diveintomark.org/`


In the below the xpath expression `"//atom:category[@term='accessibility']/..` selects the parent elements (`/..`) of type `category` with an attribute `term` having value `accessibility`.

In [73]:
NSMAP={'atom': 'http://www.w3.org/2005/Atom'}
nodes = tree.xpath("//atom:category[@term='accessibility']/..", namespaces=NSMAP)

In [74]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

for entry in nodes:
    print(entry.xpath('./atom:title/text()', namespaces=NSMAP))

['Dive into history, 2009 edition']
['Accessibility is a harsh mistress']
['A gentle introduction to video encoding, part 1: container formats']


##### Generating XML

Enough with the parsing of XML, let's create some documents using the API

In [83]:
import xml.etree.ElementTree as etree

new_feed = etree.Element('{http://www.w3.org/2005/Atom}feed',
                        attrib={'{http://www.w3.org/XML/1998/namespace}lang': 'en'})

print(str(etree.tostring(new_feed), encoding='utf-8'))

<ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en" />


In [97]:
from lxml import etree

NSMAP={None: 'http://www.w3.org/2005/Atom'}

new_feed = etree.Element('feed', nsmap=NSMAP)
print(etree.tounicode(new_feed))
new_feed.set('{http://www.w3.org/XML/1998/namespace}lang', 'en')
print(etree.tounicode(new_feed))
title = etree.SubElement(new_feed, 'title', attrib={'type':'html'})
title.text="blah de blah!!"
print(etree.tounicode(new_feed))


<feed xmlns="http://www.w3.org/2005/Atom"/>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"/>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html">blah de blah!!</title></feed>
