# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [5]:
document = ET.parse( './data/mondial_database.xml' )

In [6]:
# create empty lists to store country and infant mortality
getcountry = []
getmortality = []

#find the root to loop through
getroot = document.getroot()

for country in getroot.iter('country'):
    # make sure country has a name
    if country.find('name') != None:
        # make sure there is a value in the country for infant mortality
        if country.find('infant_mortality') != None:
            getcountry.append(country.find('name').text)
            getmortality.append(float(country.find('infant_mortality').text))

mortdata = {'country':getcountry, 'infant mortality': getmortality}

In [7]:
import pandas as pd

# Now sort values in the dataframe by lowest infant mortality rate
df1 = pd.DataFrame(mortdata).sort_values(by='infant mortality').head(10)
print "The 10 countries with the lowest infant mortality rate are the following: "
print df1

The 10 countries with the lowest infant mortality rate are the following: 
            country  infant mortality
36           Monaco              1.81
90            Japan              2.13
109         Bermuda              2.48
34           Norway              2.48
98        Singapore              2.53
35           Sweden              2.60
8    Czech Republic              2.63
72        Hong Kong              2.73
73            Macao              3.13
39          Iceland              3.15


In [8]:
popcity = []
largestpop = []
for city in getroot.findall('country/city'):
    # Make sure city has a value
    if city.findtext('name') != None:
        for population in city.findall('population'):
            if population.attrib['year'] == '2011':
                largestpop.append(int(population.text))
                popcity.append(city.findtext('name'))
data = {'city' : popcity, 'population' : largestpop}
df2 = pd.DataFrame(data).sort_values('population', ascending=False).head(10)    
print "The 10 cities with the largest population are the following: "
print df2

The 10 cities with the largest population are the following: 
          city  population
8      Beograd     1639121
56  Montevideo     1318755
23       Sofia     1270284
39     Yerevan     1060138
42   Kathmandu     1003285
18      Zagreb      686568
52    Kingston      662426
14        Rīga      658640
15     Vilnius      535631
35      Dublin      525383


In [9]:
ethnicitygroup = []
largestpop = []
for country in getroot.findall('country'):
    for population in reversed(country.findall('population')):
        largestpop.append(int(population.text))
        for ethnicity in country.findall('ethnicgroup'):
            ethnicitygroup.append((int(population.text), float(ethnicity.attrib['percentage']), ethnicity.text))
df3 = pd.DataFrame(ethnicitygroup, columns=['population', 'percentage', 'ethnicity'])
df3['popethnic'] = (df3.population * df3.percentage)/100
df3 = df3.groupby('ethnicity').sum().sort_values(by='popethnic', ascending=False).head(10)
print "The 10 ethnic groups with the largest overall population are the following: "
print df3

The 10 ethnic groups with the largest overall population are the following: 
             population  percentage     popethnic
ethnicity                                        
Han Chinese  8101335630      732.00  7.412722e+09
Indo-Aryan   6454113551      864.00  4.646962e+09
European     6893373256    10371.80  3.287846e+09
Dravidian    6454113551      300.00  1.613528e+09
African      5712127620    13912.75  1.600036e+09
Russian      2667830828     2094.40  1.213847e+09
Japanese     1150856873      994.00  1.143952e+09
Bengali       817479427     1176.00  8.011298e+08
Mestizo      1421167299     7369.60  7.791067e+08
Malay        2168394594     3444.10  7.751078e+08
