# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [116]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [117]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [118]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [119]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print(capitals_string[:-2])

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [120]:
import pandas as pd
import numpy as np
document = ET.parse( './data/mondial_database.xml' )
root = document.getroot()

### Exercise 1: Find 10 countries with the lowest infant mortality rates

In [121]:
# Initialize DataFrame
dafr = pd.DataFrame(columns = ["country","infant_mortality"])
for country in document.findall( 'country' ):
    # 'find' finds the first 'name' tag which is the country name for each 'country'
    country_name = country.find('name').text
    for node in country: 
        if node.tag == 'infant_mortality': #find mortality rate tag
            infant_mortality = float(node.text)
            #print(infant_mortality)
    dafr.loc[len(dafr)] = [country_name,infant_mortality] #add country name and mortality rate to data frame

# Sort resulting dataframe
dafr.sort_values(by='infant_mortality').head(10)

Unnamed: 0,country,infant_mortality
38,Monaco,1.81
98,Japan,2.13
36,Norway,2.48
117,Bermuda,2.48
106,Singapore,2.53
37,Sweden,2.6
10,Czech Republic,2.63
78,Hong Kong,2.73
79,Macao,3.13
44,Iceland,3.15


### Exercise 2: 10 cities with the largest population

In [122]:
# Initialize DataFrame
df = pd.DataFrame(columns = ["city","population"])
country = document.findall('country')
document = ET.parse( './data/mondial_database.xml' )
for country in document.findall('country'):
    for city in country.iter('city'):
        city_name = city.find('name').text
        #print(city_name)
        yr = int(0)
        pop_list = city.findall('population')
        if len(pop_list) >= 1:
            city_pop = int(pop_list[-1].text)
        else:
            city_pop = int(0)

        df.loc[len(df)] = [city_name, city_pop]

df.sort_values(by= 'population', ascending= False).head(10)



Unnamed: 0,city,population
1341,Shanghai,22315474.0
771,Istanbul,13710512.0
1527,Mumbai,12442373.0
479,Moskva,11979529.0
1340,Beijing,11716620.0
2810,São Paulo,11152344.0
1342,Tianjin,11090314.0
1064,Guangzhou,11071424.0
1582,Delhi,11034555.0
1067,Shenzhen,10358381.0


### Exercise 3: 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [123]:
# Initialize DataFrame
frame = pd.DataFrame(columns = ["country","population","ethnicgroup","percentage","egcalc"])
document = ET.parse( './data/mondial_database.xml' )

for country in document.findall('country'):
    country_name = country.find('name').text
    ethnicgroup = ''
    perc = 0
    egcalc = 0
    pop_list = country.findall('population')
    if len(pop_list) >= 1:
        country_pop = int(pop_list[-1].text)
    else:
        country_pop = int(0)
        
    for ethnicgroup in country.findall('ethnicgroup'):
        eg = ethnicgroup.text
        perc = float(ethnicgroup.attrib['percentage'])
        perc = (perc/100)
        egcalc = country_pop*perc
        frame.loc[len(frame)] = [country_name, country_pop, eg, perc, egcalc]
        
fr2 = frame.groupby('ethnicgroup').sum().sort_values(by = 'egcalc', ascending = False)
fr2.head(10)

Unnamed: 0_level_0,population,percentage,egcalc
ethnicgroup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Han Chinese,1360720000.0,0.915,1245059000.0
Indo-Aryan,1210855000.0,0.72,871815600.0
European,1157296000.0,9.7082,494872200.0
African,975352700.0,18.6855,318325100.0
Dravidian,1210855000.0,0.25,302713700.0
Mestizo,279744000.0,8.707,157734400.0
Bengali,149772400.0,0.98,146776900.0
Russian,322438400.0,2.241,131857000.0
Japanese,127298000.0,0.994,126534200.0
Malay,377500300.0,2.423,121993600.0


### Exercise 4: name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [124]:
document = ET.parse( './data/mondial_database.xml' )

air_elev = int(0)
country = ''

for airport in document.findall('airport'):
    curr_name = airport.find('name').text
    countrycode = airport.get('country')
    #print(curr_name)
    for node in airport:
        if node.tag == 'elevation':
            air_test = node.text
            if air_test is None:
                air_test = 0
            else:
                air_test = float(air_test)
            #print(air_test)
            if air_test > air_elev:
                air_elev = air_test
                country = countrycode
                name = curr_name
print("The highest airport is " + name + 
      " located in " + country + " at an elevation of " + str(air_elev))


The highest airport is El Alto Intl located in BOL at an elevation of 4063.0
