This is a copy of the [XPath tutorial from W3Schools](http://www.w3schools.com/xpath/xpath_syntax.asp)

### XPath Terminology

#### Nodes


XML documents are treated as trees of nodes. The topmost element of the tree is called the root element. Look at the following XML document that has `<bookstore>` as the root element:

There are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. In our class, we will not deal with namespace, processing-instruction, comment, but it is useful to be aware of those.

Example of nodes in the XML document above:



### Relationship of Nodes

#### Parent

Each element and attribute has one parent.

In the following example; the book element is the parent of the title, author, year, and price:

#### Children

Element nodes may have zero, one or more children.

In the following example; the title, author, year, and price elements are all children of the book element:

#### Siblings

Nodes that have the same parent.

In the following example; the title, author, year, and price elements are all siblings:

#### Ancestors

A node's parent, parent's parent, etc.

In the following example; the ancestors of the title element are the book element and the bookstore element:

#### Descendants

A node's children, children's children, etc.

In the following example; descendants of the bookstore element are the book, title, author, year, and price elements:



## XPath Syntax

XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps.

### The XML Example Document
We will use the following XML document in the examples below.

### Selecting Nodes

XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. The most useful path expressions are listed below:

| Expression | Description                                                                                           |
|------------|-------------------------------------------------------------------------------------------------------|
| nodename   | Selects all nodes with the name "nodename"                                                            |
| /          | Selects from the root node                                                                            |
| //         | Selects nodes in the document from the current node that match the selection no matter where they are |
| .          | Selects the current node                                                                              |
| ..         | Selects the parent of the current node                                                                |
| @          | Selects attributes                                                                                    |

In the table below we have listed some path expressions and the result of the expressions:

| Path Expression | Result |
|-----------------|------------------------------------------------------------------------------------------------------------------------------------|
| bookstore | Selects all nodes with the name "bookstore" |
| /bookstore | Selects the root element bookstoreNote: If the path starts with a slash ( / ) it always represents an absolute path to an element! |
| bookstore/book | Selects all book elements that are children of bookstore |
| //book | Selects all book elements no matter where they are in the document |
| bookstore//book | Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element |
| //@lang | Selects all attributes that are named lang |

### Predicates

Predicates are used to find a specific node or a node that contains a specific value.

Predicates are always embedded in square brackets.

In the table below we have listed some path expressions with predicates and the result of the expressions:



| Path Expression                    	| Result                                                                                                                                 	|
|------------------------------------	|----------------------------------------------------------------------------------------------------------------------------------------	|
| /bookstore/book[1]                 	| Selects the first book element that is the child of the bookstore element.                                                             	|
| /bookstore/book[last()]            	| Selects the last book element that is the child of the bookstore element                                                               	|
| /bookstore/book[last()-1]          	| Selects the last but one book element that is the child of the bookstore element                                                       	|
| /bookstore/book[position()<3]      	| Selects the first two book elements that are children of the bookstore element                                                         	|
| //title[@lang]                     	| Selects all the title elements that have an attribute named lang                                                                       	|
| //title[@lang='en']                	| Selects all the title elements that have an attribute named lang with a value of 'en'                                                  	|
| /bookstore/book[price>35.00]       	| Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00                       	|
| /bookstore/book[price>35.00]/title 	| Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00 	|

## XPath Examples in Python

Let's create first our example file:

In [None]:
%%file books.xml
<?xml version="1.0"?>

<bookstore>

<book category="COOKING">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>

<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

<book category="WEB">
  <title lang="en">XQuery Kick Start</title>
  <author>James McGovern</author>
  <author>Per Bothner</author>
  <author>Kurt Cagle</author>
  <author>James Linn</author>
  <author>Vaidyanathan Nagarajan</author>
  <year>2003</year>
  <price>49.99</price>
</book>

<book category="WEB">
  <title lang="en">Learning XML</title>
  <author>Erik T. Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>

</bookstore>

We will now read the file in Python and create an XML document using the lxml package.

In [None]:
from lxml import etree

# Read the file
filecontent = open("books.xml", "r").read()
doc  = etree.XML(filecontent)

Now let's get all the book nodes

In [None]:
result = doc.findall(".//book")

In [None]:
len(result)

And all the author nodes

We can use the `get` command to get an attribute:

In [None]:
result = doc.findall(".//book")
categories = [r.get("category") for r in result]
categories

And we can use the `.text` method to get the text of the node

In [None]:
result = doc.findall(".//book/title")
titles = [r.text for r in result]
titles

In [None]:
result = doc.findall(".//book/author")
authors = [r.text for r in result]
authors

Here is a more advanced command, where we use two nested list compehensions to list the authors of a book

In [None]:
books = doc.findall(".//book")
authors = [[author.text for author in book.findall(".//author")] for book in books]
authors

If we only care for a single element (eg the first author only), we can use the command `.find` instead of the `.findall`:

In [None]:
result = doc.findall(".//book")
authors = [r.find(".//author").text for r in result]
authors

#### Issuing XPath queries directly in Python

In [None]:
result = doc.xpath("/bookstore/book/title")
titles = [r.text for r in result]
titles

In [None]:
result = doc.xpath("/bookstore/book[1]/title")
titles = [r.text for r in result]
titles

In [None]:
result = doc.xpath("/bookstore/book/price")
price = [r.text for r in result]
price

In [None]:
result = doc.xpath("/bookstore/book[price>35]/price")
price = [r.text for r in result]
price

In [None]:
result = doc.xpath("/bookstore/book[price>35]/title")
titles = [r.text for r in result]
titles