# Denison CS181/DA210 SW Lab #8 - Step 1

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

> **In the questions that follow, we are looking for XPath declarative solutions to the problems, not procedural solutions.  You will not get credit for procedural solutions.**

---

## Part A: XPath basics

As we've seen in class, XPath provides a powerful declarative alternative to XML procedural operations.

We'll summarize some basic XPath operations here, as well as some we haven't seen yet.

We'll use the `indicators` dataset in `ind0.xml`.

In [None]:
from lxml import etree
import os.path

datadir = "publicdata"

ind_path = os.path.join(datadir, "ind0.xml")
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(ind_path, parser)

ind_root = tree.getroot()

#### First child of the root

We can use the procedural operation `getchildren` and index into the resulting list:

In [None]:
first_child = ind_root.getchildren()[0]
print("First child:", first_child, "with tag:", first_child.tag) # how else?

Alternatively, we can use XPath to match all nodes with the given tag:

In [None]:
first_child = ind_root.xpath("/indicators/country")[0]
print("First child:", first_child, "with tag:", first_child.tag)

Or, we could use the built-in function `position()` to specify that we want the first node that matches the path.  (Note that `position()` indexes from 1, and not 0.)

In [None]:
first_child = ind_root.xpath("/indicators/country[position() = 1]")[0] # only one element
print("First child:", first_child, "with tag:", first_child.tag)

#### Value of attribute

In procedural XML operations, we need to use the `attrib` dictionary or `.get()` to get the value of an attribute:

In [None]:
country_names = []
for country_node in ind_root:
    country_names.append(country_node.get("name"))
print(country_names)

In XPath, we can take another "step" in our path for the given attribute (but have no need to specify a loop):

In [None]:
country_names = ind_root.xpath("/indicators/country/@name")
print(country_names)

#### Children (tags) of a node

Similarly, in procedural XML, we need to use a loop to get the tags of all children of a node.

In [None]:
# Goal: tags below timedata for France in 2007
code = "FRA"
year = "2007"
tags = []
for country_node in ind_root:
    if country_node.get("code") != code: continue
    for timedata_node in country_node:
        if timedata_node.get("year") != year: continue
        for ind_node in timedata_node:
            tags.append(ind_node.tag)
print(tags)

In XPath, this is significantly simpler, as we can filter on attributes:

In [None]:
code = "FRA"
year = "2007"
# Note: we have to be careful and escape the value of the attribute
# with '', so we use a string format to perform the match
path = "/indicators/country[@code='{}']/timedata[@year='{}']/*".format(
    code, year)
ind_nodes = ind_root.xpath(path)
tags = [node.tag for node in ind_nodes]
print(tags)

#### Text of a node

Using procedural XML operations, we could find all nodes with a given tag and get their text, but this again requires a loop:

In [None]:
pop_list = []
for pop_node in ind_root.iter("pop"):
    pop_list.append(pop_node.text)
print(pop_list)

In XPath, we need only get the `text()` of each node in the path:

In [None]:
pop_list = ind_root.xpath("/indicators/country/timedata/pop/text()")
print(pop_list)

Additionally, we can use a shortcut if we want all nodes with a given tag in the tree:

In [None]:
pop_list = ind_root.xpath("//pop/text()")
print(pop_list)

#### Filtering on attributes and text

We can use XML procedural operations to get the countries with 2017 population less than 100 million:

In [None]:
small_list = []
for country_node in ind_root:
    for timedata_node in country_node:
        if timedata_node.get("year") != "2017": continue
        pop_node = timedata_node.find("pop")
        if float(pop_node.text) < 100:
            small_list.append(country_node.get("name"))
            break
print(small_list)

With XPath, we can do this filtering within our path, with some extra steps at the end to backtrack up the tree to get the country's name:

In [None]:
small_list = ind_root.xpath("//timedata[@year='2017']/pop[text()<100]/../../@name")
print(small_list)

---

## Part B: Familiarize yourself with the files

**Q1:** Begin by reading in and parsing the relevant datasets and familiarizing yourself with them.  In this file, we will work with:

* `countries.xml`
* `topnames.xml`

You should name the variables representing the root nodes `countries_root` and `topnames_root`, respectively.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell
assert type(countries_root) is etree._Element
assert len(countries_root) == 231

assert type(topnames_root) is etree._Element
assert len(topnames_root) == 139

---

## Part C: `countries.xml`

**Q2:** Generate a list of all the country names in the `countries.xml` file, assigning to a variable `countries`.  Then, assign the number of countries to the variable `countrycount`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell
assert(countrycount == 231)
assert('Uruguay' in countries)

**Q3:** Write a function `findPop(root,countryName)` that finds the population of a given `country` in the dataset `countries.xml`. Use an XPath expression and a format string. Return your answer as an integer.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Testing cell
assert findPop(countries_root,'Cuba') == 10951334
assert findPop(countries_root,'Uruguay') == 3238952

**Q4:** Study the `countries` data carefully.  Then, use the `position()` function to create a node set consisting of, for countries in positions 5-55 inclusive, the population of the second city listed, if there are at least two cities listed.  (Note that you can use `and` inside the filter in `[]` for a given node.)

For example, nothing is in the node set for Aruba (no cities listed) or Armenia (only Yerevan listed), but Cordoba (a city in Argentina) is in the node set because Argentina has four cities listed.

Your answer should use a single XPath expression.  Store the results in a list `secondPops` of integers.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Print the list
print(secondPops)

In [None]:
assert len(secondPops) == 6
assert secondPops[0] == 1208713
assert secondPops[5] == 1064255

> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: How could you change the previous question's answer to instead return the names of the second cities, but only if the population is at least 1500000?

---

## Part D: `topnames.xml`

**Q5:** With reference to the `topnames` dataset, find all years where there was a count (either sex) that was strictly larger than 50,000.  Store the resulting list of years (as strings) in a variable `yearsList1`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Print the beginning of the list
print(yearsList1)

In [None]:
# Testing cell
assert yearsList1[0] == '1915'
assert len(yearsList1) == 78

**Q6:** With reference to the `topnames` dataset, find all years where the top female name had a count that was strictly larger than 50,000. Store the resulting list of years (as strings) in a variable `yearsList2`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Print the beginning of the list
print(yearsList2)

In [None]:
# Testing cell
assert yearsList2[0] == '1915'
assert len(yearsList2) == 68

> You've reached the second (and final) checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: How could you instead find all years for which the combined male and female count was strictly larger than 50,000?

---

---
## Part E

How much time (in minutes/hours) did you spend on this lab outside of class?

YOUR ANSWER HERE