# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [24]:
# print names of all countries
for child in document_tree.iterfind('country'):
    print child.find('name').text


Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [26]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [25]:
document = ET.parse( './data/mondial_database.xml' )

In [52]:
# Number 1 
# This seems not very useful as I had to guess what value the infant_mortality had to be less than and it's not ordered
for country in document.iterfind('country'):
    if country.find('infant_mortality') is None:
        pass
    elif float(country.find('infant_mortality').text)<3.3:
        print country.find('name').text + ": " + country.find('infant_mortality').text

Czech Republic: 2.63
Norway: 2.48
Sweden: 2.6
Monaco: 1.81
Iceland: 3.15
Hong Kong: 2.73
Macao: 3.13
Japan: 2.13
Singapore: 2.53
Bermuda: 2.48


In [107]:
# I think this one is more useful overall
import pandas as pd
a = []
b = []
for country in document.getiterator('country'):
    if country.find('infant_mortality') is None:
        pass
    else:
        a.append(country.find('name').text)
        b.append(float(country.find('infant_mortality').text))      
df = pd.DataFrame({'country': a , 'mortality': b })
df.set_index('country').mortality.sort_values().head(10)

country
Monaco            1.81
Japan             2.13
Bermuda           2.48
Norway            2.48
Singapore         2.53
Sweden            2.60
Czech Republic    2.63
Hong Kong         2.73
Macao             3.13
Iceland           3.15
Name: mortality, dtype: float64

In [112]:
# Number 2
city = []
pop = []
for element in document.getiterator('city'):
    if element.find('population') is not None:
        city.append(element.find('name').text)
        pop.append(float(element.findall('population')[-1].text))
df = pd.DataFrame({'city': city , 'population': pop })
df.set_index('city').population.sort_values(ascending=False).head(10)

city
Shanghai     22315474
Istanbul     13710512
Mumbai       12442373
Moskva       11979529
Beijing      11716620
São Paulo    11152344
Tianjin      11090314
Guangzhou    11071424
Delhi        11034555
Shenzhen     10358381
Name: population, dtype: float64