# Data in More Complex Forms

## 1. XML

E.g. Citation Analysis: Access bibliography for each article.
Q: How easy is it to pull out that type of data and use it programatically when data is encoded as XML?

<img>

Question: XML vs JSON?

1. Platform-independent data transfer (Producer and consumer apps implemented in any way).
2. Easy to write code to read/write.
3. Document validation: Write a specification for a particular type of doc in XML and any specific examples of that doc produced can be validated against that spec.
4. Human readable: Good idea of what file contains just by looking at it.
5. Supports a wide variety of apps.

### XML Standard (Benefits)
1. -> Robust parsers in most languages
2. -> We can focus on our app.
3. It's free and isn't owned by a company.

* Can build databases to support different types of queries or piece together data from different sources.
* Can reliably be converted into other formats.
* Lets you separate form or appearance from content: XML -> Structure of content, vs Stylesheet -> Formatting

XML designed to work with tree structures as in documents. (Tree structures vs key-value pairs.)

### XML Syntax

Elements building blocks of XML doc.

**XML element**
* Composed of open tag and close tag.
* Attributes
* Two types of data: More document-oriented v s

e.g. NYTimes Most Popular API (document-oriented) vs OpenStreetMap

OpenStreetMap: Human-created data superimposed on a map. NOT document-oriented, just data. Attributes heavily used.
* Mapping from geographic coordinates to street coordinates.

### Parsing XML

Parsing XML into a document tree (vs SAX parsing)
- Read entire XML tree into memory.

In [None]:
# Part of Python standard library
import xml.etree.ElementTree as ET
import pprint
# Makes sense to work with document-oriented XML tree 
# when parsing XML into a document tree

# Mostly working with element objects.

# Parsing document into tree
tree = ET.parse('exampleResearchArticle.xml')
# From tree we're getting the root element.
root = tree.getroot()

# Then iterate over children of root element and 
# use tag attribute to print out tag name of each child element.


In [None]:
# E.g. extract title from article.

# Use find method on root element and using Xpath expression to 
# express where you expect to find e.g. title element.
# Current element -> fm el -> bibl (child) -> title el itself. 
# Like how you specify a path in many file systems.

title = root.find('./fm/bibl/title')
# Many text elements in this doc are wrapped in paragraph tags.
# Iterate over all children of title
for p in title:
    # Take text attribute for child
    title_text += p.text
# and concatenate on with title text.
print('\nTitle:\n' + title_text)

print('\nAuthor email addresses:')
for a in root.findall('./fm/bibl/aug/au'):
    email = a.find('email')
    if email is not None:
        print email.text

Methods:
1. find
2. findall
3. (text)

### Exercise

In [6]:
#!/usr/bin/env python
# Your task here is to extract data from xml on authors of an article
# and add it to a list, one item for an author.
# See the provided data structure for the expected format.
# The tags for first name, surname and email should map directly
# to the dictionary keys
import xml.etree.ElementTree as ET

article_file = "data/exampleResearchArticle.xml"


def get_root(fname):
    tree = ET.parse(fname)
    return tree.getroot()


def get_authors(root):
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None
        }
        
        # YOUR CODE HERE
        data['fnm'] = author.find('fnm').text
        data['snm'] = author.find('snm').text
        data['email'] = author.find('email').text            
        
        authors.append(data)

    return authors

get_authors(get_root(article_file))

[{'email': 'omer@extremegate.com', 'fnm': 'Omer', 'snm': 'Mei-Dan'},
 {'email': 'mcarmont@hotmail.com', 'fnm': 'Mike', 'snm': 'Carmont'},
 {'email': 'laver17@gmail.com', 'fnm': 'Lior', 'snm': 'Laver'},
 {'email': 'nyska@internet-zahav.net', 'fnm': 'Meir', 'snm': 'Nyska'},
 {'email': 'kammarh@gmail.com', 'fnm': 'Hagay', 'snm': 'Kammar'},
 {'email': 'gideon.mann.md@gmail.com', 'fnm': 'Gideon', 'snm': 'Mann'},
 {'email': 'barns.nz@gmail.com', 'fnm': 'Barnaby', 'snm': 'Clarck'},
 {'email': 'eukots@gmail.com', 'fnm': 'Eugene', 'snm': 'Kots'}]

**Remember to use the text attribute** or you'll get objects e.g. < Element 'snm' at 0x106681958 >**

### Exercise: Handling Attributes
* XML tags often contains attribute name-value pairs (e.g. OpenStreetMap data or iid attribute in insr tag which links authors with institutions.)
* Because authors may be affiliated with multiple institutions, the iid field needs to be an array.

In [26]:
#!/usr/bin/env python
# Your task here is to extract data from xml on authors of an article
# and add it to a list, one item for an author.
# See the provided data structure for the expected format.
# The tags for first name, surname and email should map directly
# to the dictionary keys, but you have to extract the attributes from the "insr" tag
# and add them to the list for the dictionary key "insr"
import xml.etree.ElementTree as ET

article_file = "data/exampleResearchArticle.xml"


def get_root(fname):
    tree = ET.parse(fname)
    return tree.getroot()


def get_authors(root):
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None,
                "insr": []
        }

        # YOUR CODE HERE

        data['fnm'] = author.find('fnm').text
        data['snm'] = author.find('snm').text
        data['email'] = author.find('email').text
        for insr in author.findall('insr'): # or author.iter 
            data['insr'].append(insr.attrib.get('iid'))        
        authors.append(data)

    return authors

def test():
    solution = [{'insr': ['I1'], 'fnm': 'Omer', 'snm': 'Mei-Dan', 'email': 'omer@extremegate.com'},
                {'insr': ['I2'], 'fnm': 'Mike', 'snm': 'Carmont', 'email': 'mcarmont@hotmail.com'},
                {'insr': ['I3', 'I4'], 'fnm': 'Lior', 'snm': 'Laver', 'email': 'laver17@gmail.com'},
                {'insr': ['I3'], 'fnm': 'Meir', 'snm': 'Nyska', 'email': 'nyska@internet-zahav.net'},
                {'insr': ['I8'], 'fnm': 'Hagay', 'snm': 'Kammar', 'email': 'kammarh@gmail.com'},
                {'insr': ['I3', 'I5'], 'fnm': 'Gideon', 'snm': 'Mann', 'email': 'gideon.mann.md@gmail.com'},
                {'insr': ['I6'], 'fnm': 'Barnaby', 'snm': 'Clarck', 'email': 'barns.nz@gmail.com'},
                {'insr': ['I7'], 'fnm': 'Eugene', 'snm': 'Kots', 'email': 'eukots@gmail.com'}]

    root = get_root(article_file)
    data = get_authors(root)

    assert data[0] == solution[0]
    assert data[1]["insr"] == solution[1]["insr"]

get_authors(get_root(article_file))

[{'email': 'omer@extremegate.com',
  'fnm': 'Omer',
  'insr': ['I1'],
  'snm': 'Mei-Dan'},
 {'email': 'mcarmont@hotmail.com',
  'fnm': 'Mike',
  'insr': ['I2'],
  'snm': 'Carmont'},
 {'email': 'laver17@gmail.com',
  'fnm': 'Lior',
  'insr': ['I3', 'I4'],
  'snm': 'Laver'},
 {'email': 'nyska@internet-zahav.net',
  'fnm': 'Meir',
  'insr': ['I3'],
  'snm': 'Nyska'},
 {'email': 'kammarh@gmail.com',
  'fnm': 'Hagay',
  'insr': ['I8'],
  'snm': 'Kammar'},
 {'email': 'gideon.mann.md@gmail.com',
  'fnm': 'Gideon',
  'insr': ['I3', 'I5'],
  'snm': 'Mann'},
 {'email': 'barns.nz@gmail.com',
  'fnm': 'Barnaby',
  'insr': ['I6'],
  'snm': 'Clarck'},
 {'email': 'eukots@gmail.com', 'fnm': 'Eugene', 'insr': ['I7'], 'snm': 'Kots'}]

## 2. Scraping data from a website

e.g. US Bureau of Transport Statistics. Say we want to find data on the no. of flights Virgin Atlantic has into and out of Boston Airport.

* Use Inspect Element
* In some cases, it may be easier to record attributes or data manually than to write a script. E.g. list of carrier values in current example.

Data Wrangling Procedure for this example:
1. Build values we need to use to make HTTP request: list of carrier values and list of airport values
3. Make HTTP requests to download all the data (for easier bug fixing and so we're working with a fixed set of data)
4. Then parse the data files.

### 2.1 Parse HTML using BeautifulSoup

In [None]:
from bs4 import BeautifulSoup

def options(soup, id):
    option_values = []
    # Passes first descendant element where id = CarrierList, say.
    carrier_list = soup.find(id=id)
    for option in carrier_list.find_all('option'):
        option_values.append(option['value'])
    return option_values

def print_list(label, codes):
    print "\n%s" % label
    for c in codes:
        print c

def main():
    # Open website HTML doc
    # Soup passes you top level element
    soup = BeautifulSoup(open("virgin_and_logan_airport.html"))
    
    
    codes = options(soup, 'CarrierList')
    print_list("Carriers", codes)
    
    codes = options(soup, 'AirportList')
    print_list("Airports", codes)

### 2.2 Building HTTP requests: Understand how requests are submitted to website 
* Check out HTML
* Ask browser how exactly it's making requests: Open Network tab and submit request again. -> Form Data

From Form Data we have eight fields:
* \_\_EVENTTARGET
* \_\_EVENTARGUMENT
* \_\_VIEWSTATE: *mess*
* \_\_VIEWSTATEGENERATOR: 8E3A4798
* \_\_EVENTVALIDATION: *mess*
* CarrierList:VX
* AirportList:BOS
* Submit:Submit

Conclude that there are **hidden form elements**.

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Please note that the function 'make_request' is provided for your reference only.
# You will not be able to to actually use it from within the Udacity web UI.
# Your task is to process the HTML using BeautifulSoup, extract the hidden
# form field values for "__EVENTVALIDATION" and "__VIEWSTATE" and set the appropriate
# values in the data dictionary.
# All your changes should be in the 'extract_data' function
from bs4 import BeautifulSoup
import requests
import json

html_page = "page_source.html"


def extract_data(page):
    data = {"eventvalidation": "",
            "viewstate": ""}
    with open(page, "r") as html:
        # Write code here to find the necessary values
        soup = BeautifulSoup(html, 'lxml')
        data["eventvalidation"] = soup.find(id = "__EVENTVALIDATION")['value']
        data["viewstate"] = soup.find(id = "__VIEWSTATE")['value']

    return data


def make_request(data):
    eventvalidation = data["eventvalidation"]
    viewstate = data["viewstate"]

    r = requests.post("http://www.transtats.bts.gov/Data_Elements.aspx?Data=2",
                    data={'AirportList': "BOS", # We'd actually loop through our lists
                          'CarrierList': "VX",
                          'Submit': 'Submit',
                          "__EVENTTARGET": "",
                          "__EVENTARGUMENT": "",
                          "__EVENTVALIDATION": eventvalidation,
                          "__VIEWSTATE": viewstate
                    })
    f = open("virgin_and_logan_airport.html", "w")
    f.write(r.text)


def test():
    data = extract_data(html_page)
    assert data["eventvalidation"] != ""
    assert data["eventvalidation"].startswith("/wEWjAkCoIj1ng0")
    assert data["viewstate"].startswith("/wEPDwUKLTI")

extract_data(html_page)

Why are those values here?
-> Used to validate requests.
* Why \_\_VIEWSTATE -> view ASP documentation.

Found broken request: instead of data we have a syntax error.

Best practices for scraping:
1. Look at how a browser makes requests via dev tools. Can also look at wire traffic if necessary.
2. Emulate in code
3. If stuff blows up, look at http traffic.
4. Return to (1) until it works.

Go back to network panel and scroll around.
* Cookie data -> Use Session object.

In [None]:
# Solution (from video)
def make_request(data):
    s = requests.Session()
    
    r = s.get("http://www.transtats.bts.gov/Data_Elements.aspx?Data=2")
    soup = BeautifulSoup(r.text)
    viewstate = soup.find(id="__VIEWSTATE")["value"]
    eventvalidation = soup.find(id="__EVENTVALIDATION")["value"]
    
    r = s.post("http://www.transtats.bts.gov/Data_Elements.aspx?Data=2",
                    data={'AirportList': "BOS", # We'd actually loop through our lists
                          'CarrierList': "VX",
                          'Submit': 'Submit',
                          "__EVENTTARGET": "",
                          "__EVENTARGUMENT": "",
                          "__EVENTVALIDATION": eventvalidation,
                          "__VIEWSTATE": viewstate
                    })
    f = open("virgin_and_logan_airport.html", "w")
    f.write(r.text)


## Problem Set
### 1, 2. Carrier and Airport Lists

In [27]:
def extract_carriers(page):
    data = []

    with open(page, "r") as html:
        # do something here to find the necessary values
        soup = BeautifulSoup(html, "lxml")
        carrier_list = soup.find(id="CarrierList")
        for carrier in carrier_list.find_all('option'):
            if 'All' not in carrier['value']:
                data.append(carrier['value'])
    return data

def extract_airports(page):
    data = []
    with open(page, "r") as html:
        # do something here to find the necessary values
        soup = BeautifulSoup(html, "lxml")
        airport_list = soup.find(id="AirportList")
        for airport in airport_list.find_all("option"):
            value = airport["value"]
            if "All" not in value:
                data.append(value)
    return data


### 3. Processing All Data

In [28]:
def process_file(f):
    """
    This function extracts data from the file given as the function argument in
    a list of dictionaries. This is example of the data structure you should
    return:

    data = [{"courier": "FL",
             "airport": "ATL",
             "year": 2012,
             "month": 12,
             "flights": {"domestic": 100,
                         "international": 100}
            },
            {"courier": "..."}
    ]


    Note - year, month, and the flight data should be integers.
    You should skip the rows that contain the TOTAL data for a year.
    """
    data = []
    info = {}
    info["courier"], info["airport"] = f[:6].split("-")
    # Note: create a new dictionary for each entry in the output data list.
    # If you use the info dictionary defined here each element in the list 
    # will be a reference to the same info dictionary.
    with open("{}/{}".format(datadir, f), "r") as html:

        soup = BeautifulSoup(html, "lxml")
        rows = soup.find_all("TDRight")
        for row in rows:
            cols = row.find_all("td")
            info["year"] = cols.text[0]
            info["month"] = cols.text[1]
            info["flights"] = cols.text[4]
            if info["month"] != "TOTAL":
                data.append(info)
    return data

html_page = "data/html_page.html"
process_file(html_page)

ValueError: not enough values to unpack (expected 2, got 1)

### 4,5. Patent Database

Figure out the error in the datafile. 

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
This and the following exercise are using US Patent database. The patent.data
file is a small excerpt of much larger datafiles that are available for
download from US Patent website. These files are pretty large ( >100 MB each).
The original file is ~600MB large, you might not be able to open it in a text
editor.

The data itself is in XML, however there is a problem with how it's formatted.
Please run this script and observe the error. Then find the line that is
causing the error. You can do that by just looking at the datafile in the web
UI, or programmatically. For quiz purposes it does not matter, but as an
exercise we suggest that you try to do it programmatically.

NOTE: You do not need to correct the error - for now, just find where the error
is occurring.
"""

import xml.etree.ElementTree as ET

PATENTS = 'patent.data'

def get_root(fname):

    tree = ET.parse(fname)
    return tree.getroot()


get_root(PATENTS)

Is it problematic that lines 657-8 and lines 1-2 are the same? There was no error for lines 1-2. The error message refers to line 657, so perhaps the <?xml ... ?> repeat is the issue? (uDacity said my answer was correct.)

uDacity's full answer given in Part 6 of the quiz:

'So, the problem is that the gigantic file is actually not a valid XML, because it has several root elements, and XML declarations.
It is, a matter of fact, a collection of a lot of concatenated XML documents.
So, one solution would be to split the file into separate documents, so that you can process the resulting files as valid XML documents.'