# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse('data/mondial-database-less.xml')

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial-database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
import pandas as pd
document = ET.parse('data/mondial-database.xml')

In [6]:
# Problem 1
mortality = {}

for country in document.iter('country'):
    percent = country.find('infant_mortality')

    if percent is not None:
        mortality[country.find('name').text] = float(percent.text)

pd.Series(mortality).sort_values().head(10)

Monaco            1.81
Japan             2.13
Norway            2.48
Bermuda           2.48
Singapore         2.53
Sweden            2.60
Czech Republic    2.63
Hong Kong         2.73
Macao             3.13
Iceland           3.15
dtype: float64

In [7]:
# Problem 2
population = {}

for country in document.iter('country'):
    for city in country.iter('city'):
        pop = city.findall('population')
        
        if len(pop) > 0:
            population[city.find('name').text + ', ' + country.find('name').text] = int(pop[-1].text)

pd.Series(population).sort_values(ascending=False).head(10)

Shanghai, China      22315474
Istanbul, Turkey     13710512
Mumbai, India        12442373
Moskva, Russia       11979529
Beijing, China       11716620
São Paulo, Brazil    11152344
Tianjin, China       11090314
Guangzhou, China     11071424
Delhi, India         11034555
Shenzhen, China      10358381
dtype: int64

In [8]:
# Problem 3
from collections import defaultdict

population = defaultdict(int)

for country in document.iter('country'):
    for group in country.iter('ethnicgroup'):
        population[group.text] += int(int(country.findall('population')[-1].text) * float(group.get('percentage'))/100)

pd.Series(population).sort_values(ascending=False).head(10)

Han Chinese    1245058800
Indo-Aryan      871815583
European        494872201
African         318325104
Dravidian       302713744
Mestizo         157734349
Bengali         146776916
Russian         131856989
Japanese        126534212
Malay           121993548
dtype: int64

In [9]:
# Define function to extract interesting features
def extract(document, feature, tag, data_type=float):
    features = []
    
    for element in document.iter(feature):
        field = element.find(tag)
        
        if field is not None and field.text is not None:
            features += [{'name': element.find('name').text, 'country': element.get('country'),
                          tag: data_type(field.text)}]
    
    df = pd.DataFrame(features).sort_values(by=tag, ascending=False)
    return df[['name', 'country', tag]]

In [10]:
# Problem 4a
extract(document, 'river', 'length').head(1)

Unnamed: 0,name,country,length
174,Amazonas,CO BR PE,6448.0


In [11]:
# Problem 4b
extract(document, 'lake', 'area').head(1)

Unnamed: 0,name,country,area
54,Caspian Sea,R AZ KAZ IR TM,386400.0


In [12]:
# Problem 4c
extract(document, 'airport', 'elevation', int).head(1)

Unnamed: 0,name,country,elevation
80,El Alto Intl,BOL,4063
