# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [1]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [2]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [3]:
# print names of all countries
for child in document_tree.getroot():
    print child.find('name').text

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [9]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print '* ' + element.find('name').text + ':',
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print capitals_string[:-2]

* Albania: Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece: Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia: Skopje, Kumanovo
* Serbia: Beograd, Novi Sad, Niš
* Montenegro: Podgorica
* Kosovo: Prishtine
* Andorra: Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [10]:
document = ET.parse( './data/mondial_database.xml' )

##10 countries with the lowest infant mortality rates

In [61]:
from xml.etree import ElementTree as ET
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement
import pandas as pd

rates = {}
for country in document.findall('country'):
    for node in country.getchildren():
        if node.tag == 'name':
            name = node.text
        if node.tag == 'infant_mortality':
            rates[name] = float(node.text)

df = pd.DataFrame(rates.items(), columns = ['Country','Infant Mortality Rate'])
df.sort_values('Infant Mortality Rate').head(10)

Unnamed: 0,Country,Infant Mortality Rate
34,Monaco,1.81
210,Japan,2.13
71,Norway,2.48
64,Bermuda,2.48
76,Singapore,2.53
106,Sweden,2.6
55,Czech Republic,2.63
143,Hong Kong,2.73
52,Macao,3.13
189,Iceland,3.15


##10 cities with the largest population

In [79]:
populations = {}
for country in document.iterfind('country'):
    for city in country.getiterator('city'):
        name = city.find('name').text
        years = []
        for node in city.getiterator('population'):
            years.append(node.attrib['year'])
            if node.attrib['year'] == max(years):
                populations[name] = int(node.text)

df = pd.DataFrame(populations.items(),columns = ['City','Population'])
df.sort_values('Population',ascending = False).head(10)

Unnamed: 0,City,Population
2777,Shanghai,22315474
1620,Istanbul,13710512
1857,Mumbai,12442373
1064,Moskva,11979529
2209,Beijing,11716620
2681,São Paulo,11152344
538,Tianjin,11090314
840,Guangzhou,11071424
2735,Delhi,11034555
598,Shenzhen,10358381


##10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)

In [228]:
ethnic_groups = {}
years = {}
countries = {}

import json

for country in document.iterfind('country'):
    country_name = country.find('name').text
    for child in country.findall('population'):
        if country_name not in years.keys():
            years[country_name] = []
            years[country_name].append(int(child.attrib['year']))
        else:
            years[country_name].append(int(child.attrib['year']))
        #taking into consideration most recent year for population
        year_to_consider = str(max(years[country_name]))
        if child.attrib['year'] == year_to_consider:
            countries[country_name] = {}
            countries[country_name]['population'] = int(child.text)
    
    #creating dictionary for ethnic groups
    x = {child.text:float(child.attrib['percentage']) for child in country.findall('ethnicgroup')}
    #assigning new dictionary to a group section in the original one
    countries[country_name]['groups'] = x

new_df = {}
for country, info in countries.iteritems():
    population = countries[country]['population']
    if bool(info['groups']):
        for ethnicity, percentage in info['groups'].iteritems():
            if ethnicity not in new_df.keys():
                new_df[ethnicity] = int(percentage * population / 100)
            else:
                new_df[ethnicity] += int(percentage * population / 100)


df = pd.DataFrame(new_df.items(), columns = ['Ethnicity','Population'])
df.sort_values('Population', ascending = False).head(10)

Unnamed: 0,Ethnicity,Population
98,Han Chinese,1245058800
108,Indo-Aryan,871815583
15,European,494872201
128,African,318325104
193,Dravidian,302713744
138,Mestizo,157734349
14,Bengali,146776916
181,Russian,131856989
272,Japanese,126534212
175,Malay,121993548
