#Extracting Data from Complex Formats
After looking some of the common data formats (*csv,Excel,JSON*), we turn our attention on some more complex data formats as well as an Introduction to **Screen Scraping**.  The two complex formats covered are **XML** and **HTML**. 
## Extracting Data from XML
Extracting XML data is done through the *ElementTree* package in Python.  In this example, we show how python reads in the entire XML document.  Once we get the root using the **get_root()**, we can search the tree for tags using the **find** and the **findall** functions. These functions can take paths to search from a specific point.  It is also noteworthy to point out the use of the **attrib** member function which allows you to obtain attribute values.

In [2]:
import xml.etree.ElementTree as ET

article_file = "exampleResearchArticle.xml"


def get_root(fname):
    tree = ET.parse(fname)
    return tree.getroot()


def get_authors(root):
    authors = []
    for author in root.findall('./fm/bibl/aug/au'):
        data = {
                "fnm": None,
                "snm": None,
                "email": None,
                "insr": []
        }

        # YOUR CODE HERE
        data["fnm"] = author.find('./fnm').text
        data["snm"] = author.find('./snm').text
        data["email"] = author.find('./email').text
        for id_ in author.findall('./insr'):
                data["insr"].append(id_.attrib["iid"])
        #print data["insr"] 
        authors.append(data)

    return authors

Below we show the output of the author dictonaries that are created in the loop on the *findall* function call

In [3]:
import pprint
root = get_root(article_file)
data = get_authors(root)

pprint.pprint(data)


[{'email': 'omer@extremegate.com',
  'fnm': 'Omer',
  'insr': ['I1'],
  'snm': 'Mei-Dan'},
 {'email': 'mcarmont@hotmail.com',
  'fnm': 'Mike',
  'insr': ['I2'],
  'snm': 'Carmont'},
 {'email': 'laver17@gmail.com',
  'fnm': 'Lior',
  'insr': ['I3', 'I4'],
  'snm': 'Laver'},
 {'email': 'nyska@internet-zahav.net',
  'fnm': 'Meir',
  'insr': ['I3'],
  'snm': 'Nyska'},
 {'email': 'kammarh@gmail.com',
  'fnm': 'Hagay',
  'insr': ['I8'],
  'snm': 'Kammar'},
 {'email': 'gideon.mann.md@gmail.com',
  'fnm': 'Gideon',
  'insr': ['I3', 'I5'],
  'snm': 'Mann'},
 {'email': 'barns.nz@gmail.com',
  'fnm': 'Barnaby',
  'insr': ['I6'],
  'snm': 'Clarck'},
 {'email': 'eukots@gmail.com', 'fnm': 'Eugene', 'insr': ['I7'], 'snm': 'Kots'}]


##Extracting Data from HTML (Screen Scraping)
There are times when the data that we are interested is not conveneintly located in a specified format. When the data that we want resides on a webpage, then we are forced to extract the data from the *html* page.  This type of data extraction is called **Screen Scraping**.  During this process, we determine how the html code performs request, and programmitcally generate HTTP requests to obtain the data that we want. When trying to understand how HTTP requests are made, you should perform the following:
- Look at how a browser makes requests
- Emulate in code
- If stuff Blows up, look at your HTTP traffic
- Return to Step 1 until it works
To see how browser makes requests, you can use the **Inspect Element** feature in Google Chrome.  To view the traffic, you use the the **Network** tab that can be accessed after opening the Inspect Element window

In [4]:
from bs4 import BeautifulSoup
html_page = "options.html"


def extract_airports(page):
    data = []
    with open(page, "r") as html:
        # do something here to find the necessary values
        soup = BeautifulSoup(html)
        entries = soup.find(id="AirportList")
        l = entries.find_all_next("option")
        #print l
        for a in l:
          t = a["value"]
          if t != "selected" and t != "AllMajors" and t != "AllOthers" and t != "All":
                #print t
                data.append(t)
    return data

In [5]:
data = extract_airports(html_page)
pprint.pprint(data)

['ATL',
 'BWI',
 'BOS',
 'CLT',
 'MDW',
 'ORD',
 'DFW',
 'DEN',
 'DTW',
 'FLL',
 'IAH',
 'LAS',
 'LAX',
 'ABR',
 'ABI']
