<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/06-Web_Scraping/A-Hierarchical_Data_and_XPath_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a copy of the [XPath tutorial from W3Schools](http://www.w3schools.com/xpath/xpath_syntax.asp)

### XPath Terminology

#### Nodes


XML documents are treated as trees of nodes. The topmost element of the tree is called the root element. Look at the following XML document that has `<bookstore>` as the root element:

```xml
<bookstore>
  <book>
    <title lang="en">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
</bookstore>
```

There are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. In our class, we will not deal with namespace, processing-instruction, comment, but it is useful to be aware of those.

Example of nodes in the XML document above:



```xml
<bookstore> (root element node)

<author>J K. Rowling</author> (element node)

lang="en" (attribute node)
```


### Relationship of Nodes

#### Parent

Each element and attribute has one parent.

In the following example; the book element is the parent of the title, author, year, and price:

```xml
<book>
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
```

#### Children

Element nodes may have zero, one or more children.

In the following example; the title, author, year, and price elements are all children of the book element:

```xml
<book>
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
```

#### Siblings

Nodes that have the same parent.

In the following example; the title, author, year, and price elements are all siblings:

```xml
<book>
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
```

#### Ancestors

A node's parent, parent's parent, etc.

In the following example; the ancestors of the title element are the book element and the bookstore element:

```xml
<bookstore>
  <book>
    <title lang="en">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
</bookstore>
```

#### Descendants

A node's children, children's children, etc.

In the following example; descendants of the bookstore element are the book, title, author, year, and price elements:



```xml
<bookstore>
  <book>
    <title>Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
</bookstore>
```

## XPath Syntax

XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps.

### The XML Example Document
We will use the following XML document in the examples below.

```xml
<bookstore>

    <book>
      <title lang="en">Harry Potter</title>
      <price>29.99</price>
    </book>

    <book>
      <title lang="fr">Learning XML</title>
      <price>39.95</price>
    </book>

</bookstore>
```

### Selecting Nodes

XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. The most useful path expressions are listed below:

| Expression | Description                                                                                           |
|------------|-------------------------------------------------------------------------------------------------------|
| nodename   | Selects all nodes with the name "nodename"                                                            |
| /          | Selects from the root node                                                                            |
| //         | Selects nodes in the document from the current node that match the selection no matter where they are |
| .          | Selects the current node                                                                              |
| ..         | Selects the parent of the current node                                                                |
| @          | Selects attributes                                                                                    |

In the table below we have listed some path expressions and the result of the expressions:

| Path Expression | Result |
|-----------------|------------------------------------------------------------------------------------------------------------------------------------|
| bookstore | Selects all nodes with the name "bookstore" |
| /bookstore | Selects the root element bookstoreNote: If the path starts with a slash ( / ) it always represents an absolute path to an element! |
| bookstore/book | Selects all book elements that are children of bookstore |
| //book | Selects all book elements no matter where they are in the document |
| bookstore//book | Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element |
| //@lang | Selects all attributes that are named lang |

### Predicates

Predicates are used to find a specific node or a node that contains a specific value.

Predicates are always embedded in square brackets.

In the table below we have listed some path expressions with predicates and the result of the expressions:



| Path Expression                    	| Result                                                                                                                                 	|
|------------------------------------	|----------------------------------------------------------------------------------------------------------------------------------------	|
| /bookstore/book[1]                 	| Selects the first book element that is the child of the bookstore element.                                                             	|
| /bookstore/book[last()]            	| Selects the last book element that is the child of the bookstore element                                                               	|
| /bookstore/book[last()-1]          	| Selects the last but one book element that is the child of the bookstore element                                                       	|
| /bookstore/book[position()<3]      	| Selects the first two book elements that are children of the bookstore element                                                         	|
| //title[@lang]                     	| Selects all the title elements that have an attribute named lang                                                                       	|
| //title[@lang='en']                	| Selects all the title elements that have an attribute named lang with a value of 'en'                                                  	|
| //title[contains(@lang,'en')]                	| Selects all the title elements that have an attribute named lang with a value that contains the string 'en'                                                  	|
| //title[re:match(text(), 'H.\*P.\*')]                	| Selects all the title elements that match the regular expression `H.*P.*`                                               	|
| /bookstore/book[price>35.00]       	| Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00                       	|
| /bookstore/book[price>35.00]/title 	| Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00 	|

## XPath Examples in Python

Let's create first our example file:

In [1]:
%%file books.xml
<?xml version="1.0"?>

<bookstore>

<book category="COOKING">
  <title lang="it">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>

<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

<book category="WEB">
  <title lang="en">XQuery Kick Start</title>
  <author>James McGovern</author>
  <author>Per Bothner</author>
  <author>Kurt Cagle</author>
  <author>James Linn</author>
  <author>Vaidyanathan Nagarajan</author>
  <year>2003</year>
  <price>49.99</price>
</book>

<book category="WEB">
  <title lang="en">Learning XML</title>
  <author>Erik T. Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>

</bookstore>

Overwriting books.xml


We will now read the file in Python and create an XML document using the lxml package.

In [2]:
from lxml import etree
# Read the file
filecontent = open("books.xml", "r").read()
doc  = etree.XML(filecontent)

Now let's get all the book nodes

In [3]:
result = doc.xpath("//book")

In [4]:
result

[<Element book at 0x7fbfca4ab308>,
 <Element book at 0x7fbfca4ab348>,
 <Element book at 0x7fbfca4ab0c8>,
 <Element book at 0x7fbfc9d47f08>]

In [5]:
len(result)

4

And all the author nodes

In [6]:
result = doc.xpath("//author")
result

[<Element author at 0x7fbfc9d4b608>,
 <Element author at 0x7fbfc9d4b488>,
 <Element author at 0x7fbfc9d4b408>,
 <Element author at 0x7fbfc9d4b5c8>,
 <Element author at 0x7fbfc9d4b588>,
 <Element author at 0x7fbfc9d4b8c8>,
 <Element author at 0x7fbfc9d4b908>,
 <Element author at 0x7fbfc9d4b948>]

In [7]:
len(result)

8

We can use the `get` command to get an attribute:

In [8]:
result = doc.xpath("//book")
for node in result:
    print(node.get("category"))

COOKING
CHILDREN
WEB
WEB


In [9]:
categories = [r.get("category") for r in result]
categories

['COOKING', 'CHILDREN', 'WEB', 'WEB']

In [10]:
result = doc.xpath("//book/@category")
result

['COOKING', 'CHILDREN', 'WEB', 'WEB']

And we can use the `.text` method to get the text of the node

In [11]:
# Find all the "title" nodes that are immediate children
# of a "book" node
result = doc.xpath("//book/title")
# For each such node print the text 
titles = [r.text for r in result]
titles

# The list comprehension above is equivalent to the following:
# for r in result:
#    print(r.text)



['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Learning XML']

In [12]:
# Find all the author nodes, that are children of a book node
result = doc.xpath("//book/author")
# Print the text of the author nodes
authors = [r.text for r in result]
authors

['Giada De Laurentiis',
 'J K. Rowling',
 'James McGovern',
 'Per Bothner',
 'Kurt Cagle',
 'James Linn',
 'Vaidyanathan Nagarajan',
 'Erik T. Ray']

In [13]:
result = doc.xpath("//book/price")
prices = [r.text for r in result]
prices

['30.00', '29.99', '49.99', '39.95']

In [14]:
# Find all the title nodes, that have a parent called book
# and the title node has a lang attribute equal to 'en'
result = doc.xpath("//book/title[@lang='en']")
titles = [r.text for r in result]
titles

['Harry Potter', 'XQuery Kick Start', 'Learning XML']

Here is a more advanced command, where we use two nested list compehensions to list the authors of a book. 
_Notice that we use the `.` marker in the nested loop, to indicate that we are going to only look under the current node (`book`) and not in the whole document._

In [15]:
books = doc.xpath("//book")
authors = [[author.text for author in book.xpath(".//author")] for book in books]
authors

[['Giada De Laurentiis'],
 ['J K. Rowling'],
 ['James McGovern',
  'Per Bothner',
  'Kurt Cagle',
  'James Linn',
  'Vaidyanathan Nagarajan'],
 ['Erik T. Ray']]

#### Examples of more XPath queries

In [16]:
result = doc.xpath("/bookstore/book[1]/title")
titles = [r.text for r in result]
titles

['Everyday Italian']

In [17]:
result = doc.xpath("/bookstore/book/price")
price = [r.text for r in result]
price

['30.00', '29.99', '49.99', '39.95']

In [18]:
result = doc.xpath("/bookstore/book[price>35]/price")
price = [r.text for r in result]
price

['49.99', '39.95']

In [19]:
result = doc.xpath("/bookstore/book[price>35]/title")
titles = [r.text for r in result]
titles

['XQuery Kick Start', 'Learning XML']

In [20]:
result = doc.xpath("//title[contains(text(), 'XML')]")
titles = [r.text for r in result]
titles

['Learning XML']

In [21]:
result = doc.xpath("//book[contains(@category, 'C')]/title")
titles = [r.text for r in result]
titles

['Everyday Italian', 'Harry Potter']

In [22]:
result = doc.xpath("//title[re:match(text(), 'H.*P.*')]", namespaces={'re': "http://exslt.org/regular-expressions"})
titles = [r.text for r in result]
titles

['Harry Potter']