# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [27]:
from xml.etree import ElementTree as ET
import pandas as pd

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [40]:
getroot = document.getroot()

countries = []
mortality = []

for i in getroot:
        if i.find('infant_mortality') != None:
            countries.append(i.find('name').text)
            mortality.append(float(i.find('infant_mortality').text))
    
data = {'Country':countries, 'Mortality': mortality}
pd.DataFrame(data).sort_values('Mortality',ascending = True).head(10)

Unnamed: 0,Country,Mortality
36,Monaco,1.81
90,Japan,2.13
109,Bermuda,2.48
34,Norway,2.48
98,Singapore,2.53
35,Sweden,2.6
8,Czech Republic,2.63
72,Hong Kong,2.73
73,Macao,3.13
39,Iceland,3.15


In [41]:
getroot = document.getroot()

countries = []
population = []

for i in getroot:
        if i.find('population') != None:
            countries.append(i.find('name').text)
            population.append(float(i.find('population').text))
    
data = {'Country':countries, 'Population': population}
pd.DataFrame(data).sort_values('Population', ascending=False).head(10)


Unnamed: 0,Country,Population
55,China,543776080.0
67,India,238396327.0
120,United States,157813040.0
23,Russia,102798657.0
98,Japan,82199470.0
88,Indonesia,72592192.0
11,Germany,68230796.0
176,Brazil,53974725.0
53,United Kingdom,50616012.0
7,France,40502513.0


In [51]:
ethnicitygroup = []
largestpop = []

for country in getroot.findall('country'):
    for population in reversed(country.findall('population')):
        largestpop.append(int(population.text))
        for ethnicity in country.findall('ethnicgroup'):
            ethnicitygroup.append((int(population.text), ethnicity.text))
            
            
df3 = pd.DataFrame(ethnicitygroup, columns=['population', 'ethnicity'])
df3 = df3.groupby('ethnicity').sum().sort_values(by='population', ascending=False).head(10)

print df3

             population
ethnicity              
Han Chinese  8101335630
European     6893373256
Mongol       6467465051
Dravidian    6454113551
Indo-Aryan   6454113551
African      5712127620
Amerindian   3604638777
Asian        2715445790
Russian      2667830828
Ukrainian    2595397472
