# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [4]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


In [5]:
pop_by_city = dict()
for element in document_tree.iterfind('country/city'):
    #print element.find('name').text, element.find('population').text
    pop_by_city[element.find('name').text] = element.find('population').text

pop_by_city

{'Andorra la Vella': '15600',
 'Beograd': '1407073',
 u'Durr\xebs': '60000',
 'Elbasan': '53000',
 u'Kor\xe7\xeb': '52000',
 'Kumanovo': '105484',
 u'Ni\u0161': '250518',
 'Novi Sad': '299294',
 'Podgorica': '136473',
 'Prishtine': '148090',
 u'Shkod\xebr': '62000',
 'Skopje': '506926',
 'Tirana': '192000',
 u'Vlor\xeb': '56000'}

****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [6]:
document = ET.parse( './data/mondial_database.xml' )

In [7]:
##1 10 countries with the lowest infant mortality rates

In [8]:
inf_mort_rate = dict()
for element in document.iterfind('country'):
    #print element.find('infant_mortality').text
    if element.find('infant_mortality') is not None:
        inf_mort_rate[element.find('name').text] = float(element.find('infant_mortality').text)

new_list = sorted([(v,k) for k,v in inf_mort_rate.items()] )
for i in new_list[:10]:
    print i[1],": ",i[0]

Monaco :  1.81
Japan :  2.13
Bermuda :  2.48
Norway :  2.48
Singapore :  2.53
Sweden :  2.6
Czech Republic :  2.63
Hong Kong :  2.73
Macao :  3.13
Iceland :  3.15


In [9]:
##2 10 cities with the largest population

In [28]:
pop_by_city = dict()
for element in document.getiterator('city'):
   
   
    if  len(element.findall('population'))>0:
        pop_by_city[element.find('name').text] = int(element.findall('population')[-1].text)


new_pop = sorted([(v,k) for k,v in pop_by_city.items()] ,reverse=True)
for i in new_pop[:10]:
    print i[1],": ",i[0]

Shanghai :  22315474
Istanbul :  13710512
Mumbai :  12442373
Moskva :  11979529
Beijing :  11716620
São Paulo :  11152344
Tianjin :  11090314
Guangzhou :  11071424
Delhi :  11034555
Shenzhen :  10358381


In [11]:
##3 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [33]:
ethnic_grp= dict()
for element in document.iterfind('country'):
    sub_grps = element.findall('ethnicgroup')
    try: 
        tot_pop = float(element.findall('population')[-1].text)
    except:
        tot_pop = 0
        
    if len(sub_grps)>0:
        for grp in sub_grps:
            
            name=grp.text
            perc = float(grp.attrib.get('percentage'))/100*tot_pop
            #print name, perc
            ethnic_grp[name] = ethnic_grp.get(name,0) + perc
           

#print ethnic_grp
new_list = sorted([(v,k) for k,v in ethnic_grp.items()] ,reverse=True)
for i in new_list[:10]:
    print i[1],": ",i[0]

Han Chinese :  1245058800.0
Indo-Aryan :  871815583.44
European :  494872219.72
African :  318325120.369
Dravidian :  302713744.25
Mestizo :  157734354.937
Bengali :  146776916.72
Russian :  131856996.077
Japanese :  126534212.0
Malay :  121993550.374


In [None]:
ethnic_grp=[]
ethnic_grp_pop = []



In [None]:
## name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [20]:
longest_river =[]
longest_river_cntry =[]
river_len_longest = 0
for element in document.iterfind('river'):
    river_name = element.find('name').text
    try:
        river_len = float(element.find('length').text)
    except:
        river_len = 0
    if river_len > river_len_longest:
        longest_river_cntry = element.attrib.get('country')
        longest_river =  element.find('name').text
        river_len_longest = river_len
print 'Longest river is:', longest_river,'In country: ', longest_river_cntry, 'Len:', river_len_longest

Longest river is: Amazonas In country:  CO BR PE Len: 6448.0


In [21]:
largest_lake =[]
largest_lake_cntry =[]
largest_lake_area = 0
for element in document.iterfind('lake'):
    lake_name = element.find('name').text
    try:
        lake_area = float(element.find('area').text)
    except:
        lake_area = 0
    if lake_area > largest_lake_area:
        largest_lake_cntry = element.attrib.get('country')
        largest_lake =  element.find('name').text
        largest_lake_area = lake_area
print 'Largest lake is:', largest_lake,'In country: ', largest_lake_cntry, 'Area:', largest_lake_area

Largest lake is: Caspian Sea In country:  R AZ KAZ IR TM Area: 386400.0


In [23]:
he_airport = []
he_ap_cntry =[]
he_ap_elev = 0
for element in document.iterfind('airport'):
    ap_name = element.find('name').text
    try:
        ap_ele = float(element.find('elevation').text)
    except:
        ap_ele = 0
    if ap_ele > he_ap_elev:
        he_airport = ap_name
        he_ap_cntry = element.attrib.get('country')
        he_ap_elev = ap_ele
print 'Airport with highest elevation:', he_airport, 'In country:', he_ap_cntry, 'Elevation:', he_ap_elev

Airport with highest elevation: El Alto Intl In country: BOL Elevation: 4063.0
