# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [4]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [5]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [183]:
# print names of all countries
for child in document_tree.getroot():
    print (child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [185]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':',)
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])
    

* Albania:
Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
Skopje, Kumanovo
* Serbia:
Beograd, Novi Sad, Niš
* Montenegro:
Podgorica
* Kosovo:
Prishtine
* Andorra:
Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

Question 1. 10 countries with the lowest infant mortality rates

In [244]:
document = ET.parse( './data/mondial_database.xml' )

root = document.getroot()

#save country and infant mortality in a list
mortality_list=[]    
country_list=[]
for country in root.findall('country'):
    infant_mortality=country.findtext('infant_mortality')
    name=country.findtext('name')

    mortality_list.append(infant_mortality)
    country_list.append(name)
    
#generate a "clean" mortality list which excludes "None" record (the "infant_mortality" is not in the XML block.)   
none_indexes = []
mortality_list_clean = []
country_list_clean = []

for i in range(len(mortality_list)):
    if mortality_list[i]==None: none_indexes.append(i)
for i in range(len(mortality_list)):
    if i not in none_indexes: 
        mortality_list_clean.append(float(mortality_list[i]))
        country_list_clean.append(country_list[i])

y=mortality_list_clean
x=country_list_clean
country_sortby_mortality=[x for (y,x) in sorted(zip(y,x), key=lambda pair: pair[0])]
#print(country_sortby_mortality[0:10])
#print(country_list_clean.index('Monaco'))
#sorted_indexes is the index of items being sorted. The smallest item' index is saved in sorted_indexes[0]
sorted_indexes = [b[0] for b in sorted(enumerate(mortality_list_clean),key=lambda i:i[1])]
#another method of sort
#s2=sorted(range(len(mortality_list_clean)), key=lambda k:mortality_list_clean[k])

#print answer
print("10 countries with the lowest infant mortality rates:\n")
for i in range(10):
    print(country_list_clean[sorted_indexes[i]],mortality_list_clean[sorted_indexes[i]])


10 countries with the lowest infant mortality rates:

Monaco 1.81
Japan 2.13
Norway 2.48
Bermuda 2.48
Singapore 2.53
Sweden 2.6
Czech Republic 2.63
Hong Kong 2.73
Macao 3.13
Iceland 3.15


Question 2. 10 cities with the largest population


In [243]:
#save city, population in a list
city_list = []
pop_list = []
for city in root.iter('city'):
    city_list.append(city.findtext('name'))
    pop_list.append(city.findtext('population'))
#print(len(city_list),len(pop_list))

#print(ethinc_list)
#remove the record when value is None
def remove_none(value_list,item_list):
    none_indexes = []
    value_list_clean = []
    item_list_clean = []

    for i in range(len(value_list)):
        if value_list[i]==None: none_indexes.append(i)
    for i in range(len(value_list)):
        if i not in none_indexes: 
            value_list_clean.append(int(value_list[i]))
            item_list_clean.append(item_list[i])
    return value_list_clean,item_list_clean

pop_list,city_list = remove_none(pop_list,city_list)
y=pop_list
x=city_list
#sort by population, I chould have used df.sort_values, similar to what was done
#in the following questions.
city_sortby_pop=[x for (y,x) in sorted(zip(y,x), key=lambda pair: pair[0])]
pop_sortby_pop=[y for (y,x) in sorted(zip(y,x), key=lambda pair: pair[0])]

#print answer for 10 cities with the largest population, the largest comes first
print("10 cities with the largest population:\n")
for i in range(len(city_sortby_pop)-1,len(city_sortby_pop)-11,-1):
    print(city_sortby_pop[i],", ",pop_sortby_pop[i])


10 cities with the largest population:

Seoul ,  10229262
Mumbai ,  9925891
São Paulo ,  9412894
Jakarta ,  8259266
Shanghai ,  8205598
Ciudad de México ,  8092449
Moskva ,  8010954
Tokyo ,  7843000
Beijing ,  7362426
Delhi ,  7206704


Qquestion 3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)


In [240]:
#find enthic groups and save their name, country and population in a seperate list
#convert population to float to sort by value

ethic_list=[]    
country_list=[]
pop_list=[]
for country in root.findall('country'):
    name=country.findtext('name')
    ethic=country.findtext('ethnicgroup')
    pop=country.findtext('population')
    ethic_list.append(ethic)
    country_list.append(name)
    pop_list.append(int(pop))
#print(len(ethic_list),len(country_list),len(pop_list))

#save data of interest in a dataframe and find the max 10 by df.sort_values
data = [('country', country_list),
         ('ethic_group', ethic_list),
         ('population', pop_list),
         ]
import pandas as pd
df = pd.DataFrame.from_items(data)
df = df.groupby('ethic_group').sum()
df_sort = df.sort_values(by='population', ascending=0)

#print answer for 10 ethnic groups with the largest overall populations
print("10 ethnic groups with the largest overall populations:")
print(df_sort[:10])

10 ethnic groups with the largest overall populations
             population
ethic_group            
Han Chinese   543776080
European      284338691
Dravidian     238396327
Russian       102798657
German         96828641
Japanese       82199470
Javanese       72592192
African        51889000
Mestizo        50797340
English        50616012


Question 4a. Name and country of longest river.

In [268]:
#find river and save their name, country and length in a seperate list

river_list = []
country_list = []
length_list = []

for river in root.iter('river'):
    if river.findtext('length')==None:
        pass
    else:
        river_list.append(river.findtext('name'))
        country_list.append(river.get('country'))
        length_list.append(float(river.findtext('length')))

data = [('country', country_list),
         ('river', river_list),
         ('length', length_list),
         ]
df = pd.DataFrame.from_items(data)
df_sort = df.sort_values(by='length', ascending=0)
#print answer for longest river
print("The longest river:\n")
print(df_sort[:1])

The longest river:

      country     river  length
174  CO BR PE  Amazonas  6448.0


Question 4b. Name and country of largest lake 

In [266]:
def find_largest(nametext,country,item):
    name_list = []
    country_list = []
    item_list = []

#airport elevation has blank fields, we should not include them 
    for element in root.iter(nametext):
        if element.findtext(item)==None or element.findtext(item)=="" :
            pass
        else:
            #print(element.findtext(item))
            name_list.append(element.findtext('name'))
            country_list.append(element.get(country))
            item_list.append(float(element.findtext(item)))

    data = [(country, country_list),
             (nametext, name_list),
             (item, item_list),
             ]
    df = pd.DataFrame.from_items(data)
    df_sort = df.sort_values(by=item, ascending=0)
    return df_sort

nametext='lake'
country='country'
area='area'
#print answer for longest river
df_sort=find_largest(nametext,country,area)
print("The largest lake:\n")
print(df_sort[:1])

The largest lake:

           country         lake      area
54  R AZ KAZ IR TM  Caspian Sea  386400.0


Question 4c. Name and country of airport at highest elevation


In [267]:
nametext='airport'
country='country'
height='elevation'
df_sort=find_largest(nametext,country,height)
#print answer for airport at highest elevation

print("Name and country of airport at highest elevation:\n")
print(df_sort[:1])

Name and country of airport at highest elevation:

   country       airport  elevation
80     BOL  El Alto Intl     4063.0
