# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [4]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [5]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [7]:
document = ET.parse( './data/mondial_database.xml' )

In [8]:
document

<xml.etree.ElementTree.ElementTree at 0x10a32d3d0>

In [245]:
# 10 countries with the lowest infant mortality rates
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('./data/mondial_database.xml')

root = tree.getroot()
infant_mort_dict = {}


for child in root:
    im = child.find('infant_mortality')
    if( im is not None):
        country = child.find('name').text
        infant_mort_dict[country] = im.text   
    
im_series = pd.Series(infant_mort_dict).sort_values(ascending = True)
print('10 countries with the lowest infant mortality rates')
print(im_series[:10])

10 countries with the lowest infant mortality rates
Monaco                   1.81
Romania                 10.16
Fiji                     10.2
Brunei                  10.48
Grenada                  10.5
Mauritius               10.59
Panama                   10.7
Seychelles              10.77
United Arab Emirates    10.92
Barbados                10.93
dtype: object


In [231]:
# 10 cities with the largest population
largest_pop_dict = {}
for element in root.iterfind('country'):
    for subelement in element.getiterator('city'):
        
        population = subelement.find('population') 
        if( population is not None):
            
            city = subelement.find('name').text
            pop_2011 = population.get('year')
            if pop_2011 == '2011':
                #key = element.find('name').text + '[' + city + ']'
                key = city
                value = int(population.text)
                largest_pop_dict[key] = value   
                
    for subelement in element.getiterator('province'):
        province_name = subelement.find('name').text
        for city in subelement.getiterator('city'):
            population = city.find('population')
            if(population is not None):
                population_year = population.get('year')
                if(population_year == "2011"):
                    #print(province_name + ':' + city.find('name').text + ':' + population.text)
                    
                    #key = element.find('name').text + '[' + province_name + ':' + city.find('name').text + ']'
                    key = city.find('name').text
                    value = int(population.text)
                    largest_pop_dict[key] = value   
                    
largest_pop_series = pd.Series(largest_pop_dict).sort_values(ascending = False)
print('10 cities with the largest population')
print(largest_pop_series[:10])                            

10 cities with the largest population
Portsmouth       238137
Plymouth         234982
Wolverhampton    210319
Erfurt           200868
Bolton           194189
Potsdam          156021
Blackburn        117963
Rotherham        109691
Worthing         109120
Maidstone        107627
dtype: int64


In [248]:
largest_pop_dict = {}
for element in root.iterfind('country'):

    for subelement in element.getiterator('province'):
        province_name = subelement.find('name').text
        for city in subelement.getiterator('city'):
            population = city.find('population')
            if(population is not None):
                population_year = int(population.get('year'))
                
                #print(province_name + ':' + city.find('name').text + ':' + str(population_year) + ':' + population.text )
                if(population_year == 2011):
                    print(province_name + ':' + city.find('name').text + ':' + str(population_year) + ':' + population.text )

#                     print(province_name + ':' + city.find('name').text + ':' + population.text)
                    
                    #key = element.find('name').text + '[' + province_name + ':' + city.find('name').text + ']'
                    key = city.find('name').text
                    value = int(population.text)
                    largest_pop_dict[key] = value   
                    
largest_pop_series = pd.Series(largest_pop_dict).sort_values(ascending = False)
print('10 cities with the largest population')
print(largest_pop_series[:10])     

Stereas Elladas:Lamia:2011:75315
Brandenburg:Potsdam:2011:156021
Brandenburg:Cottbus:2011:99984
Thüringen:Erfurt:2011:200868
Thüringen:Gera:2011:96067
Thüringen:Jena:2011:105739
North West:Stockport:2011:105878
North West:Bolton:2011:194189
North West:Salford:2011:103886
North West:Blackburn:2011:117963
North West:Preston:2011:97886
Yorkshire and the Humber:Rotherham:2011:109691
West Midlands:Wolverhampton:2011:210319
South East:Portsmouth:2011:238137
South East:Maidstone:2011:107627
South East:Worthing:2011:109120
South West:Plymouth:2011:234982
10 cities with the largest population
Portsmouth       238137
Plymouth         234982
Wolverhampton    210319
Erfurt           200868
Bolton           194189
Potsdam          156021
Blackburn        117963
Rotherham        109691
Worthing         109120
Maidstone        107627
dtype: int64
