In [16]:
import lxml
from lxml import etree

## Creating Elements and SubElements

In [18]:
root = etree.Element("root")
root

<Element root at 0x32d71f8>

We access the tag with the tag attribute:

In [27]:
root.tag

'root'

Elements are organized in an XML tree structure. We can use the `append()` method to create a child of an element:

In [21]:
root.append(etree.Element('child1'))

However, the following way is considered to be most efficient, as it assigns the child element to a variable. It uses the top-level `etree.SubElement(parent, 'child_name')` method:

In [22]:
child2 = etree.SubElement(root, 'child2')
child3 = etree.SubElement(root, 'child3')

We can serialise the tree to verify it's 'XMLness'

In [23]:
etree.tostring(root)

b'<root><child1/><child1/><child2/><child3/></root>'

This helper function is supposed to pretty-print the XML for us, although I don't fully comprehend it (what is the `pretty_print` attribute, and why the **kwargs?

In [24]:
def prettyprint(element, **kwargs):
    xml = etree.tostring(element, pretty_print=True, **kwargs)
    print(xml.decode(), end='')

In [29]:
prettyprint(root)

<root>
  <child1/>
  <child2/>
  <child3/>
</root>


## Elements are lists

To make access to the subelements as straight-foward as possible, the Elements class mimics the behaviour of lists as close as possible:

In [31]:
child = root[0]
child.tag

'child1'

In [33]:
len(root)

3

In [34]:
root.index(root[1]) # lxml.etree only

1

In [35]:
children = list(root)
for child in root:
    print(child.tag)

child1
child2
child3


In [36]:
root.insert(0, etree.Element("child0"))
start = root[:1]
end = root[-1:]

In [38]:
start[0].tag

'child0'

In [39]:
end[0].tag

'child3'

To **test if an element has children**, we can use `if len(element):`, as if len == 0, this test is false and the element has no roots.

One difference from lists is that assigning an element to a different position MOVES the element, instead of copying it.

We can access the element's neighbours with `.getprevious()` and `.getnext()` 

## Elements carry attributes as a dict

xml elements have attributes. These can be created in the Element factory:

In [43]:
root = etree.Element('root', interesting='totally')
etree.tostring(root)

b'<root interesting="totally"/>'

As attributes are unordered name-value pairs, a way of dealing with them is with the dictionary-like interface of Elements:

In [44]:
root.get('interesting')

'totally'

In [46]:
print(root.get('hello'))

None


In [48]:
root.set("hello", "Huhu")
root.get('hello')

'Huhu'

In [49]:
etree.tostring(root)

b'<root interesting="totally" hello="Huhu"/>'

In [50]:
sorted(root.keys())

['hello', 'interesting']

In [51]:
for name, value in root.items():
    print(f'{name} = {value}')

interesting = totally
hello = Huhu


The `attrib` attribute of an element is a real dictionary, supporting the dictionary indexing syntax of python.

## Elements contain text

In [52]:
root = etree.Element('root')
root.text = 'Text'
root.text

'Text'

In [53]:
etree.tostring(root)

b'<root>Text</root>'

Sometimes text can surround an element, such as in the example:

`<html><body>Hello<br/>World</body></html>`

For these cases, we access the text **after** the element with the `tail` property:

In [54]:
html = etree.Element('html')
body = etree.SubElement(html, 'body')
body.text = 'TEXT'

etree.tostring(html)

b'<html><body>TEXT</body></html>'

In [55]:
br = etree.SubElement(body, 'br')
etree.tostring(html)

b'<html><body>TEXT<br/></body></html>'

In [58]:
br.tail = 'TAIL'
etree.tostring(html)

b'<html><body>TEXT<br/>TAIL</body></html>'

In [61]:
etree.tostring(br)

b'<br/>TAIL'

In [62]:
etree.tostring(br, with_tail=False)

b'<br/>'

In [64]:
etree.tostring(html, method='text')

b'TEXTTAIL'

## Using XPath to find text

Another way to extract the text from an xml document is with the `xpath()` method, which allows you to extract the text in a "list of texts" object:

In [65]:
text = html.xpath('//text()')
text

['TEXT', 'TAIL']

The elements of these lists are intelligent, in the way that they have a special `getparent()` method to extract what node was it's parent, returning the Element object of such parent:

In [67]:
text[0].getparent().tag

'body'

In [68]:
text[1].getparent().tag

'br'