# XML example and exercise
****
+ study examples of accessing nodes in XML tree structure  
+ work on exercise to be completed and submitted
****
+ reference: https://docs.python.org/2.7/library/xml.etree.elementtree.html
+ data source: http://www.dbis.informatik.uni-goettingen.de/Mondial
****

In [2]:
from xml.etree import ElementTree as ET

## XML example

+ for details about tree traversal and iterators, see https://docs.python.org/2.7/library/xml.etree.elementtree.html

In [3]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [9]:
# print names of all countries
for child in document_tree.getroot():
    print(child.find('name').text)

Albania
Greece
Macedonia
Serbia
Montenegro
Kosovo
Andorra


In [18]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print('* ' + element.find('name').text + ':',)
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print('\t' + capitals_string[:-2])

* Albania:
	Tirana, Shkodër, Durrës, Vlorë, Elbasan, Korçë
* Greece:
	Komotini, Kavala, Athina, Peiraias, Peristeri, Acharnes, Patra, Kozani, Kerkyra, Ioannina, Thessaloniki, Iraklio, Chania, Ermoupoli, Rhodes, Tripoli, Lamia, Chalkida, Larissa, Volos, Mytilini, Karyes
* Macedonia:
	Skopje, Kumanovo
* Serbia:
	Beograd, Novi Sad, Niš
* Montenegro:
	Podgorica
* Kosovo:
	Prishtine
* Andorra:
	Andorra la Vella


****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

In [74]:
import pandas as pd
document = ET.parse( './data/mondial_database.xml' )

In [117]:
#Problem 1: 10 Countries with lowest infant mortality rates

#Init list for Country and Mortality Pairs
infMortList = []

for child in document.iterfind('country'):
    if (child.find('infant_mortality') != None):
        #Appending lists of country and mortality to larger list
        infMortList.append([child.find('name').text, float(child.find('infant_mortality').text)])

#Changing everything over to a DataFrame
infMort = pd.DataFrame(infMortList,columns=['Country','Infant_Mortality'])

#Sorting and printing top ten countries with lowest Infant Mortality Rate
print(infMort.sort_values(by='Infant_Mortality', ascending=True).head(10))        


            Country  Infant_Mortality
36           Monaco              1.81
90            Japan              2.13
109         Bermuda              2.48
34           Norway              2.48
98        Singapore              2.53
35           Sweden              2.60
8    Czech Republic              2.63
72        Hong Kong              2.73
73            Macao              3.13
39          Iceland              3.15


In [195]:
#Problem 2: 10 cities with largest population

#Init City/Pop List
cityPops = []

#Finding all nested subelements named city (Hopefully the ones nested in provinces too)
for city in document.findall('.//city'):
    for pop in city.findall("population"):
        #Selecting the population metrics from 2011 (seems to be latest from this dataset)
        if (pop.attrib['year'] == '2011'):
            cityPops.append([city.find('name').text, int(pop.text)])

            #Creating the Pandas Dataframe from the lists            
cityPopsDF = pd.DataFrame(cityPops,columns=['City','Population(2011)'])

#Sorting and printing top ten Cities with the highest Population
print(cityPopsDF.sort_values(by='Population(2011)', ascending=False).head(10))    

          City  Population(2011)
529     Mumbai          12442373
554      Delhi          11034555
523  Bangalore           8443675
418     London           8250205
487     Tehran           8154051
505      Dhaka           7423137
558  Hyderabad           6731790
518  Ahmadabad           5577940
627     Luanda           5000000
542    Chennai           4646732


In [255]:
#Problem 3: 10 ethnic groups with highest population (best, most recent estimates)

#Create the ethnic group dictionary
ethGrpsPop = {}

for cnt in document.findall('country'):
    #Find the most recent population measure (shows up last in the iterable)
    currPop = int(cnt.findall('population')[-1].text)
    #Get all of the ethnic group elements
    for ethGrp in cnt.findall('ethnicgroup'):
        #Update or add to dictionary with ethnic group population
        ethGrpsPop[ethGrp.text] = ethGrpsPop.get(ethGrp.text,0) + round(currPop * float(ethGrp.attrib['percentage'])/100)

ethGrpsPopDF = pd.DataFrame.from_dict(ethGrpsPop, orient='index')
print(ethGrpsPopDF.sort_values(by=0, ascending = False).head(10))


                      0
Han Chinese  1245058800
Indo-Aryan    871815583
European      494872221
African       318325121
Dravidian     302713744
Mestizo       157734355
Bengali       146776917
Russian       131856994
Japanese      126534212
Malay         121993550


In [286]:
#Problem 4: Name and Country of a) longest river, b) largest lake and c) airport at highest elevation

#Problem 4a:
longest = 0
for rvr in document.findall('river'):
    if (rvr.find('length'))!= None and float(rvr.find('length').text)  > longest:
        rvrName = rvr.find('name').text
        rvrCnt = rvr.attrib['country']
        rvrLen = rvr.find('length').text
        longest = float(rvrLen)

print('The longest river is the ' + rvrName, end= '. ')
print('It is located in ' + rvrCnt, end= ' ')
print('and it is %s units long' % rvrLen)

#Problem 4b:
largest = 0
for lk in document.findall('lake'):
    if (lk.find('area'))!= None and float(lk.find('area').text)  > largest:
        lkName = lk.find('name').text
        lkCnt = lk.attrib['country']
        lkAr = lk.find('area').text
        largest = float(lkAr)

print('The largest lake is the ' + lkName, end= '. ')
print('It is located in ' + lkCnt, end= ' ')
print('and it is %s units large' % lkAr)

#Problem 4c:
highest = 0
for air in document.findall('airport'):
    if (air.find('elevation'))!= None and (air.find('elevation').text)!= None and float(air.find('elevation').text)  > highest:
        airName = air.find('name').text
        airCnt = air.attrib['country']
        airHigh = air.find('elevation').text
        highest = float(airHigh)

print('The highest airport is the ' + airName, end= '. ')
print('It is located in ' + airCnt, end= ' ')
print('and it is %s units high' % airHigh)

The longest river is the Amazonas. It is located in CO BR PE and it is 6448 units long
The largest lake is the Caspian Sea. It is located in R AZ KAZ IR TM and it is 386400 units large
The highest airport is the El Alto Intl. It is located in BOL and it is 4063 units high
