# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
print((document_tree.__dict__))

{'_root': <Element 'mondial' at 0x10458ed90>}


In [5]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [6]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [7]:
import pandas as pd

In [8]:
document_tree = ET.parse( './data/mondial_database.xml' )

In [9]:
# print names of all countries
namelist = []
ratelist = []
for element in document_tree.iterfind('country'):
    name = element.find('name').text
    if element.find('infant_mortality') is not None:
        rate = element.find('infant_mortality').text
        namelist.append(name)
        ratelist.append(float(rate))
    
ratedf = pd.DataFrame({'country': namelist, 'infant_mortality': ratelist})
ratedf.sort_values(by = 'infant_mortality').head(10)

Unnamed: 0,country,infant_mortality
36,Monaco,1.81
90,Japan,2.13
109,Bermuda,2.48
34,Norway,2.48
98,Singapore,2.53
35,Sweden,2.6
8,Czech Republic,2.63
72,Hong Kong,2.73
73,Macao,3.13
39,Iceland,3.15


In [20]:
cities = [city for city in country.findall('city') for country in document_tree.findall('country')]




[<Element 'city' at 0x113ce67d0>,
 <Element 'city' at 0x113ce67d0>,
 <Element 'city' at 0x113ce67d0>,
 <Element 'city' at 0x113ce67d0>,
 <Element 'city' at 0x113ce67d0>,
 <Element 'city' at 0x113ce67d0>,
 <Element 'city' at 0x113ce67d0>,
 <Element 'city' at 0x113ce67d0>,
 <Element 'city' at 0x113ce67d0>,
 <Element 'city' at 0x113ce67d0>]

#### .4 name and country of a) longest river, b) largest lake and c) airport at highest elevation

Rivers, lakes, and airports are elements of the root.

- Each has a country attribute.
- Each has a name subelement
- Each has a length, area, and elevation subelement respectively

In [55]:
lakes = document_tree.findall('lake')
rivers = document_tree.findall('river')
airports = document_tree.findall('airport')
lake_info = [[lake.findtext('name'), lake.attrib['country'], float(lake.findtext('area', default = 'nan'))] for lake in lakes]
lake_df = pd.DataFrame(lake_info, columns = ['lake_name', 'country', 'area']).sort_values('area', ascending = False).head(10)
lake_df

Unnamed: 0,lake_name,country,area
54,Caspian Sea,R AZ KAZ IR TM,386400.0
109,Lake Superior,CDN USA,82103.0
81,Lake Victoria,EAT EAK EAU,68870.0
106,Lake Huron,CDN USA,59600.0
108,Lake Michigan,USA,57800.0
47,Dead Sea,IL JOR WEST,41650.0
83,Lake Tanganjika,ZRE Z BI EAT,32893.0
98,Great Bear Lake,CDN,31792.0
43,Ozero Baikal,R,31492.0
89,Lake Malawi,MW MOC EAT,29600.0
