# XPath

In [1]:
import lxml.etree
import lxml.html
import requests

## Demonstrating xpath on HTML

Let's check http://www.ianhopkinson.org.uk

In [5]:
r = requests.get("http://www.ianhopkinson.org.uk")


In [None]:
from IPython.core.display import display, HTML
display(HTML(r.text))

In [None]:
root = lxml.html.fromstring(r.content)

## Specifying a complete path with / as separator

`title = root.xpath('/html/body/div/div/div[2]/h1')`

is the full path to my blog title. Notice how we request the 2nd element of the third set of div elements using div[2] – xpath arrays are one-based, not zero-based.

In [6]:
title = root.xpath('/html/body/div/div/div[2]/h1') 

In [7]:
title

[<Element h1 at 0x7f70ccb87d70>]

In [9]:
title[0].text.strip()

'SomeBeans'

In [12]:
print("My blog title is: '{}'".format(title[0].text.strip()))

My blog title is: 'SomeBeans'


## Specifying a path with wildcards using //

This expression also finds the title but the preamble of /html/body/div/div is absorbed by the // wildcard match:

`title = root.xpath('//div[2]/h1')`

To obtain the text of the title in Python, rather than an element object, we would do:

`title_text = title[0].text.strip()` or maybe `title_text = title[0].text_content().strip()`

text_content() would pick up any tail content, and any text in child elements. I use strip() here to remove leading and trailing whitespace

In [13]:
title = root.xpath('//div[2]/h1') 
title[0].text_content().strip()


'SomeBeans'

## Selecting attribute values

We’ve seen that //element selects all of the elements of type “element”. We select attribute values like this:

`ids = root.xpath('//li/@id')`

which selects the id attribute from the list elements (li) on my blog

We can get the id attributes of all the `<li>` elements. 

In [14]:
ids = root.xpath('//li/@id')

In [16]:
print("There are {} of them, the first one is {}".format(len(ids), ids[0]))

There are 9 of them, the first one is menu-item-1033


## Specifying an element by attribute

We can select elements which have particular attribute values:

`tagcloud = root.xpath('//*[@class="tagcloud"]')`

this selects the tag cloud on my blog by selecting elements which having the class attribute **tagcloud**.

In [18]:
tagcloud = root.xpath('//*[@class="tagcloud"]') 
tagcloud

[<Element div at 0x7f70ccb66a70>]

## Select an element containing some specified text

We can do something similar with the text content of an element:

`title = root.xpath(‘//h1[contains(., ‘SomeBeans’)]’)` 

In [22]:
title = root.xpath("//h1[contains(., 'SomeBeans')]")
title[0].text.strip()


'SomeBeans'

## Select via a parent or sibling relationship

Sometimes we want to select elements by their relationship to another element, for example:

`subtitle = root.xpath('//h1[contains(@class,"header_title")]/../h2')`

this selects the h1 title of my blog (SomeBeans) then navigates to the parent with .. and selects the sibling h2 element (the subtitle “the makings of a small casserole”).


In [20]:
subtitle = root.xpath('//h1[contains(@class,"header_title")]/../h2')
subtitle[0].text.strip()


'…the makings of a small casserole'

Or we can use **following-sibling** to same effect:

In [23]:
subtitle = root.xpath('//h1[contains(@class,"header_title")]/following-sibling::h2')
subtitle[0].text.strip()


'…the makings of a small casserole'

# XML Namespaces


When dealing with XML, we need to worry about namespaces. In principle the elements of an XML document are described in a schema which can be looked up and is universally unique. In practice the use of namespaces in XML documents can lead to much banging head against wall! This is largely because trivial examples of XML wrangling don’t use namespaces, except as a “special” example.

Here is a fragment of XML defining two namespaces:

`<foo:Results xmlns:foo="http://www.foo.com" xmlns="http://www.bah.com">`

xmlns:foo defines a namespace whose short form is “foo”, we select elements in this space using a namespace parameter to the xpath query:

`records = root.xpath('//foo:Title', namespaces = {"foo": "http://www.foo.com"})`

The “catch” here is we also define a default namespace xmlns = “http://www.bah.com”, which means that elements which don’t have a prefix cannot be selected unless we define the namespace in our xpath:

`records = root.xpath('//bah:Title', namespaces = {"bah": http://www.bah.com})`

Worse than that we need to include our namespace prefix in the query, even though it doesn’t appear in the file!

## Demonstrating xpath on XML

In [25]:

xml_sample = """<?xml version="1.0" encoding="UTF-8"?>
<foo:Results xmlns:foo="http://www.foo.com" xmlns="http://www.bah.com">
<foo:Recordset setCount="2">
<foo:Record setEntry="0">
<foo:Title>First title</foo:Title>
</foo:Record>
<foo:Record setEntry="1">
<foo:Title>Second title</foo:Title>
</foo:Record>
<Record setEntry="2">
<Title>Third title</Title>
</Record>
<Record setEntry="3">
<Title>Fourth title</Title>
</Record>
</foo:Recordset>
</foo:Results>
""".encode("utf-8")

## Processing XML is pretty similar except for namespaces

In [27]:
namespace = "http://www.foo.com"
namespace_c = "{" + namespace + "}"
NSMAP = {"foo": namespace}
root = lxml.etree.fromstring(xml_sample)

### These are the elements defined by the XML string at the top of this program

In [32]:
record_count = root.xpath('//@setCount')[0]

print("Attributes are easy, this is the @setCount: {}".format(record_count))

Attributes are easy, this is the @setCount: 2


In [29]:
for i, element in enumerate(root.getiterator()):
    print(element.tag)

{http://www.foo.com}Results
{http://www.foo.com}Recordset
{http://www.foo.com}Record
{http://www.foo.com}Title
{http://www.foo.com}Record
{http://www.foo.com}Title
{http://www.bah.com}Record
{http://www.bah.com}Title
{http://www.bah.com}Record
{http://www.bah.com}Title


### We can select elements by defining a namespace in our queries

In [30]:
records = root.xpath('//foo:Title', namespaces = {"foo": "http://www.foo.com"})
for record in records:
    print(record.text)


We can select elements by defining a namespace in our queries
First title
Second title


### Without defining the default namespace, we get nothing

In [31]:
records = root.xpath('//Title')    
for record in records:
    print(record.text)

### With the default namespace, we get something

In [33]:
records = root.xpath('//bah:Title', namespaces = {"bah": "http://www.bah.com"})    
for record in records:
    print("Element name: {}, element text '{}'".format(record.tag, record.text))

Element name: {http://www.bah.com}Title, element text 'Third title'
Element name: {http://www.bah.com}Title, element text 'Fourth title'
