# Denison CS181/DA210 SW Lab #9 - Step 1

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [None]:
import os
import io
import sys
import importlib
import pandas as pd
from lxml import etree
import requests

if os.path.isdir(os.path.join("../../..", "modules")):
    module_dir = os.path.join("../../..", "modules")
else:
    module_dir = os.path.join("../..", "modules")

module_path = os.path.abspath(module_dir)
if not module_path in sys.path:
    sys.path.append(module_path)

import util
importlib.reload(util)

---

## Part A: HTML Structure

Although we will typically assume that HTML is properly formatted, that is no guarantee when using real websites, as they may have been created without (or before) such assumptions.

For example, consider the following string: `"<html><head><title>test<body><h1>header title</h3>"`.  This is "bad" for several reasons:

* The `<html>`, `<head>`, `<title>`, and `<body>` tags are all missing a closing tag.
* The `<h1>` header tag is closed with a `<h3>` tag.

If we try to use an XML parser, this will fail:

In [None]:
# Broken HTML: <html>, <head>, <title>, <body> not closed, <h1> closed as <h3>
bad_html = "<html><head><title>test<body><h1>header title</h3>"

# Try and fail to parse as XML
xmlparser = etree.XMLParser()
try:
    tree = etree.parse(io.StringIO(bad_html), parser=xmlparser)
    util.print_xml(tree.getroot())
except:
    # Should end up here
    print("Failed to parse as XML")

However, we can instead use a HTML parser provided by `etree`, which can handle such messy HTML:

In [None]:
# Try again as HTML
htmlparser = etree.HTMLParser()
try:
    # This one should work
    tree = etree.parse(io.StringIO(bad_html), parser=htmlparser)
    util.print_xml(tree.getroot())
except:
    print("Failed to parse as HTML")

Now, let's consider well-formed HTML.  As XML, it must have a single root node.  For HTML documents, this should be `<html>`.  This node should have, at most, one `<head>` child and one `<body>` child, in that order.

The `<head>` node contains meta information about the HTML document.  A common part of this is the webpage title, using the `<title>` tag.

The `<body>` node contains the content of the webpage.  This can include text nodes (e.g., using `<div>`, `<p>`, and `<span>`), headers (`<h1>` through `<h6>`), links (`<a>`), lists (`<ul>` or `<ol>`) and tables (`<table>`).

Here is a simple example (which we can parse as either XML or HTML, as it is properly formed XML):

In [None]:
# A simple HTML string
simple_html = "<html><head><title>test</title></head><body><h1>header title</h1></body></html>"
tree = etree.parse(io.StringIO(simple_html), parser=xmlparser)

# Display the HTML
util.print_xml(tree.getroot())

---

## Part B: Web scraping

We can either work with locally saved HTML documents, or download them from the web.  We won't focus on this for now, so the code in the following cell doesn't need to make too much sense to you yet (see Chapters 18-21 for what we've skipped so far in this regard if you're curious).

#### Scraping via GET request

At a high level, we can use a _URL_ to access a document on the web, and form a _request_ to _get_ the content at that URL.  If the _response_ has _status_ `200`, then the request was successful.

In this case, we will download the HTML source of the page: [http://datasystems.denison.edu/basic.html](http://datasystems.denison.edu/basic.html).

In [None]:
# Download HTML from a web URL

location = "datasystems.denison.edu"
resource = "/basic.html"

url = util.buildURL(resource, location)
response = requests.get(url)
assert response.status_code == 200

# Display the retrieved HTML text
basic_html = response.text
print(basic_html)

As you can see, this webpage is slightly more complex than our previous simple example: it has a nested list, with the outer list being unordered (bullet points), and the inner list being ordered (numbered).

It also contains two heading levels, as well as bolded text and a link inside of a paragraph node.

#### Scraping via `curl`

Alternatively, we can use the `curl` command (a command-line tool, not part of Python itself) to download the webpage content to a local HTML file.  The following command will save the HTML source of [http://datasystems.denison.edu/basic.html](http://datasystems.denison.edu/basic.html) to your computer, in a file `basic.html` in the same folder as this notebook.

In [None]:
# Download the HTML to a file -- do not modify this!
!curl -s -o basic.html http://datasystems.denison.edu/basic.html

---

## Part C: Static Web Page Example: Table

First, we'll consider the `indicators2016` dataset represented as a set of nested lists within a web page: [http://datasystems.denison.edu/ind2016.html](http://datasystems.denison.edu/ind2016.html).

In [None]:
# Download the HTML to a file -- do not modify this!
!curl -s -o ind2016.html http://datasystems.denison.edu/ind2016.html

**Q1:** First, we need to do some discovery.  Use `etree` to parse the root of the HTML tree from `ind2016.html` into the variable `ind2016_root`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display a snippet of the file (using a util module provided with the textbook)
util.print_xml(ind2016_root, depth=3, nchild=3)

**Q2:** Use XPath to find all `<table>` nodes in the `ind2016` HTML tree.  Store the resulting list in a variable `ind2016_table_nodes`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the resulting list
ind2016_table_nodes

In [None]:
# Testing cell
assert type(ind2016_table_nodes) is list
assert len(ind2016_table_nodes) == 1
assert ind2016_table_nodes[0].tag == "table"

**Q3:** The previous question should have resulted in a list of only one node.  From this node, use XPath or XML procedural operations to retrive the column names in the table.  Store this list in a variable `ind2016_columns`.

In [None]:
ind2016_table_node = ind2016_table_nodes[0]
util.print_xml(ind2016_table_node, depth=3, nchild=3)

# YOUR CODE HERE
raise NotImplementedError()

# Display the resulting list
ind2016_columns

In [None]:
# Testing cell
assert type(ind2016_columns) is list
assert len(ind2016_columns) == 6
assert "code" in ind2016_columns
assert "life" in ind2016_columns

**Q4:** One way to process the data in the table is to read the text of all data cells, and then group them into a LoL assuming the same number of cells in each row.

Modify the following code to use XPath to retrieve the text of all data cells in the table, stored in a variable `ind2016_td_text`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

ind2016_LoL = []
colNum = 0
for text in ind2016_td_text:
    if colNum == 0:
        row = []

    # Add the text to the current row
    row.append(text)

    # If this was the last element in the row, add it to the LoL,
    # otherwise increment the column number
    if colNum == len(ind2016_columns)-1:
        ind2016_LoL.append(row)
        colNum = 0
    else:
        colNum += 1

# Print a subset of the resulting LoL
util.print_data(ind2016_LoL, nlines=20)

In [None]:
# Testing cell
assert type(ind2016_td_text) is list
assert len(ind2016_td_text) == 36
assert len(ind2016_LoL) == 6
assert ind2016_LoL[0][0] == "CAN"
assert ind2016_LoL[2][4] == "68.56"

> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: Alternatively, we could use XPath to create a DoL representation of this data table.  How could you write an expression using `ind2016_table_node` to get the text for each cell in a given column?

---

## Part D: Static Web Page Example: Nested Lists

Next, we'll consider the `indicators0` dataset represented as a set of nested lists within a web page: [http://datasystems.denison.edu/ind0.html](http://datasystems.denison.edu/ind0.html).

In [None]:
# Download the HTML to a file -- do not modify this!
!curl -s -o ind0.html http://datasystems.denison.edu/ind0.html

**Q5:** Once again, discovery is our first step.  Use `etree` to parse the root of the HTML tree from `ind0.html` into the variable `ind0_root`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display a snippet of the file (using a util module provided with the textbook)
util.print_xml(ind0_root, depth=3, nchild=3)

> You've reached the second checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: Look through the HTML for `ind0.html` and/or play around with XPath.  How many `<ul>` nodes are there?  How can you identify the one containing the indicators nested-list data?

**Q6:** This webpage was created by a tool, so it has a lot going on (e.g., due to formatting) between the `<body>` node and the nested lists.  Use XPath to find the top-level HTML unordered-list element representing the indicators data, and store that node in the variable `ind0_list_node`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# You should get a single node with tag "ul"
util.print_xml(ind0_list_node, depth=4, nchild=3, nlines=18)

In [None]:
# Testing cell
assert type(ind0_list_node) is etree._Element
assert ind0_list_node.tag == "ul"
assert len(ind0_list_node) == 3

**Q7:** The subtree for `FRA` is fairly straightforward.  Use XPath or XML procedural operations to construct a row dictionary for `FRA` with columns (keys) `code`, `pop2007`, `gdp2007`, `pop2017`, and `gdp2017`.  Store your dictionary in a variable `FRA_rowD`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the resulting data row dictionary
print(FRA_rowD)

In [None]:
# Testing cell
assert type(FRA_rowD) is dict
assert len(FRA_rowD) == 5
assert FRA_rowD["code"] == "FRA"
assert FRA_rowD["pop2007"] == 64.02
assert FRA_rowD["gdp2017"] == 2586.29

> You've reached the third (and final) checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 3: The subtrees for `GBR` and `USA` are more complex, as their data for 2007 are inside of `<span>` nodes, but the 2017 data are not.  How would you need to change your approach to handle this case?

---

---
## Part E

How much time (in minutes/hours) did you spend on this lab outside of class?

YOUR ANSWER HERE