
# Читање и пишување на XML документи во Python

<br>

Вежби, Веб сервиси и XML, 08.03.2017

доц. д-р. Светлана Кордумова Трајанова

<br>

As we discussed in the lectures, XML, or Extensible Markup Language, is a markup-language that is commonly used to structure, store, and transfer data between systems. While not as common as it used to be, it is still used in services like RSS and SOAP, as well as for structuring files like Microsoft Office documents.

With Python being a popular language for the web and data analysis, it's likely you'll need to read or write XML data at some point, in which case you're in luck.

Throughout this article we'll primarily take a look at the **ElementTree** module for reading, writing, and modifying XML data. We'll also compare it with the older **minidom** module in the first few sections so you can get a good comparison of the two.

## The XML Modules

The minidom, or Minimal DOM Implementation, is a simplified implementation of the Document Object Model (DOM). The DOM is an application programming interface that treats XML as a tree structure, where each node in the tree is an object. Thus, the use of this module requires that we are familiar with its functionality.

The ElementTree module provides a more "Pythonic" interface to handling XML and is a good option for those not familiar with the DOM. It is also likely a better candidate to be used by more novice programmers due to its simple interface, which you'll see throughout this article.

In this tutorial, the ElementTree module will be used in all examples, whereas minidom will also be demonstrated, but only for counting and reading XML documents.

## XML File Example

In the examples below, we will be using the following XML text, which you should save as "items.xml":



In [1]:
xmlcontents = """
<data>  
     <items>
        <item name="item1">item1abc</item>
        <item name="item2">item2abc</item>
    </items>
</data>"""

print(xmlcontents)


<data>  
     <items>
        <item name="item1">item1abc</item>
        <item name="item2">item2abc</item>
    </items>
</data>


As you can see, it's a fairly simple XML example, only containing a few nested objects and one attribute. However, it should be enough to demonstrate all of the XML operations in this tutorial.


## Reading XML Documents

### Using minidom

In order to parse an XML document using minidom, we must first import it from the xml.dom module. 

In [2]:
from xml.dom import minidom

This module uses the parse function to create a DOM object from our XML file. The parse function has the following syntax:

`xml.dom.minidom.parse(filename_or_file[, parser[, bufsize]])`

Here the file name can be a string containing the file path or a file-type object. The function returns a document, which can be handled as an XML type. Thus, we can use the function `getElementByTagName()` to find a specific tag.

Since each node can be treated as an object, we can access the attributes and text of an element using the properties of the object. In the example below, we have accessed the attributes and text of a specific node, and of all nodes together.

In [3]:
# parse an xml file by name
mydoc = minidom.parse('items.xml')

items = mydoc.getElementsByTagName('item')

# one specific item attribute
print('Item #2 attribute:')  
print(items[1].attributes['name'].value)

# all item attributes
print('\nAll attributes:')  
for elem in items:  
    print(elem.attributes['name'].value)

# one specific item's data
print('\nItem #2 data:')  
print(items[1].firstChild.data)  
print(items[1].childNodes[0].data)

# all items data
print('\nAll item data:')  
for elem in items:  
    print(elem.firstChild.data)

FileNotFoundError: [Errno 2] No such file or directory: 'items.xml'

### Using ElementTree

`ElementTree` presents us with an very simple way to process XML files. As always, in order to use it we must first import the module. In our code we use the import command with the as keyword, which allows us to use a simplified name (ET in this case) for the module in the code.

Following the import, we create a tree structure with the `parse` function, and we obtain its root element. Once we have access to the root node we can easily traverse around the tree, because a tree is a connected graph.

Using ElementTree, and like the previous code example, we obtain the node attributes and text using the objects related to each node.

The code is as follows:

In [None]:
import xml.etree.ElementTree as ET  

tree = ET.parse('items.xml')  
root = tree.getroot()

# one specific item attribute
print('Item #2 attribute:')  
print(root[0][1].attrib)

# all item attributes
print('\nAll attributes:')  
for elem in root:  
    for subelem in elem:
        print(subelem.attrib)

# one specific item's data
print('\nItem #2 data:')  
print(root[0][1].text)

# all items data
print('\nAll item data:')  
for elem in root:  
    for subelem in elem:
        print(subelem.text)

As you can see, this is very similar to the `minidom` example. One of the main differences is that the `attrib` object is simply a dictionary object, which makes it a bit more compatible with other Python code. We also don't need to use `value` to access the item's attribute value like we did before.

You may have noticed how accessing objects and attributes with `ElementTree` is a bit more Pythonic, as we mentioned before. This is because the XML data is parsed as simple lists and dictionaries, unlike with `minidom` where the items are parsed as custom `xml.dom.minidom.Attr` and "DOM Text nodes"


## Counting the Elements of an XML Document

### Using minidom

As in the previous case, the `minidom` must be imported from the `dom` module. This module provides the function `getElementsByTagName`, which we'll use to find the tag item. Once obtained, we use the `len()` built-in method to obtain the number of sub-items connected to a node. 

In [None]:
from xml.dom import minidom

# parse an xml file by name
mydoc = minidom.parse('items.xml')

items = mydoc.getElementsByTagName('item')

# total amount of items
print(len(items)) 

Keep in mind that this will only count the number of children items under the note you execute `len()` on, which in this case is the root node. If you want to find all sub-elements in a much larger tree, you'd need to traverse all elements and count each of their children.

### Using ElementTree

Similarly, the ElementTree module allows us to calculate the amount of nodes connected to a node.

Example code:

In [None]:
import xml.etree.ElementTree as ET  
tree = ET.parse('items.xml')  
root = tree.getroot()

# total amount of items
print(len(root[0]))

## Writing XML Documents

### Using ElementTree

`ElementTree` is also great for writing data to XML files. The code below shows how to create an XML file with the same structure as the file we used in the previous examples.

The steps are:

Create an element, which will act as our root element. In our case the tag for this element is "data".
Once we have our root element, we can create sub-elements by using the `SubElement` function. This function has the syntax:

`SubElement(parent, tag, attrib={}, **extra)`

Here `parent` is the parent node to connect to, `attrib` is a dictionary containing the element attributes, and `extra` are additional keyword arguments. This function returns an element to us, which can be used to attach other sub-elements, as we do in the following lines by passing items to the `SubElement` constructor.

Although we can add our attributes with the `SubElement` function, we can also use the `set()` function, as we do in the following code. The element text is created with the `text` property of the `Element` object.
In the last 3 lines of the code below we create a string out of the XML tree, and we write that data to a file we open.
Example code:

In [None]:
import xml.etree.ElementTree as ET

# create the file structure
data = ET.Element('data')  
items = ET.SubElement(data, 'items')  
item1 = ET.SubElement(items, 'item')  
item2 = ET.SubElement(items, 'item')  
item1.set('name','item1')  
item2.set('name','item2')  
item1.text = 'item1abc'  
item2.text = 'item2abc'

# create a new XML file with the results
mydata = ET.tostring(data, encoding="unicode", method="xml")  
myfile = open("items2.xml", "w")  
myfile.write(mydata); 

## Finding XML Elements

### Using ElementTree

The ElementTree module offers the `findall()` function, which helps us in finding specific items in the tree. It returns all items with the specified condition. In addition, the module has the function `find()`, which returns only the first sub-element that matches the specified criteria. The syntax for both of these functions are as follows:

`findall(match, namespaces=None)` <br><br>
`find(match, namespaces=None)  `

For both of these functions the match parameter can be an XML tag name or a path. The function findall() returns a list of elements, and find returns a single object of type Element.

In addition, there is another helper function that returns the text of the first node that matches the given criterion:

`findtext(match, default=None, namespaces=None)  `

Here is some example code to show you exactly how these functions operate:

In [None]:
import xml.etree.ElementTree as ET  
tree = ET.parse('items.xml')  
root = tree.getroot()

# find the first 'item' object
for elem in root:  
    print(elem.find('item').get('name'))
    
print('\nfind all: \n')

# find all "item" objects and print their "name" attribute
for elem in root:  
    for subelem in elem.findall('item'):

        # if we don't need to know the name of the attribute(s), get the dict
        print(subelem.attrib)      

        # if we know the name of the attribute, access it directly
        print(subelem.get('name'))

## Modifying XML Elements

### Using ElementTree

The `ElementTree` module presents several tools for modifying existing XML documents. The example below shows how to change the name of a node, change the name of an attribute and modify its value, and how to add an extra attribute to an element.

A node text can be changed by specifying the new value in the text field of the node object. The attribute's name can be redefined by using the `set(name, value)` function. The `set` function doesn't have to just work on an existing attribute, it can also be used to define a new attribute.

The code below shows how to perform these operations:

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse('items.xml')  
root = tree.getroot()

# changing a field text
for elem in root.iter('item'):  
    elem.text = 'new text'

# modifying an attribute
for elem in root.iter('item'):  
    elem.set('name', 'newitem')

# adding an attribute
for elem in root.iter('item'):  
    elem.set('name2', 'newitem2')

tree.write('newitems.xml')  

After running the code, the resulting XML file "newitems.xml" will have an XML tree with the following data:

`<data>  
    <items>
        <item name="newitem" name2="newitem2">new text</item>
        <item name="newitem" name2="newitem2">new text</item>
    </items>
</data>  `

As we can see when comparing with the original XML file, the names of the item elements have changed to "newitem", the text to "new text", and the attribute "name2" has been added to both nodes.

You may also notice that writing XML data in this way (calling tree.write with a file name) adds some more formatting to the XML tree so it contains newlines and indentation.

## Creating XML Sub-Elements

### Using ElementTree

The `ElementTree` module has more than one way to add a new element. The first way we'll look at is by using the `makeelement()` function, which has the node name and a dictionary with its attributes as parameters.

The second way is through the `SubElement()` class, which takes in the parent element and a dictionary of attributes as inputs.

In our example below we show both methods. In the first case the node has no attributes, so we created an empty dictionary (`attrib = {}`). In the second case, we use a populated dictionary to create the attributes.

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse('items.xml')
root = tree.getroot()

# adding an element to the root node
attrib = {}
element = root.makeelement('seconditems', attrib)
root.append(element)

# adding an element to the seconditem node
attrib = {'name2': 'secondname2'}
subelement = root[0][1].makeelement('seconditem', attrib)
ET.SubElement(root[1], 'seconditem', attrib)
root[1][0].text = 'seconditemabc'

# create a new XML file with the new element
tree.write('newitems2.xml')

After running this code the resulting XML file will look like this:

`<data>  
    <items>
        <item name="item1">item1abc</item>
        <item name="item2">item2abc</item>
    </items>
    <seconditems>
         <seconditem name2="secondname2">seconditemabc</seconditem>
    </seconditems>
</data>  `

As we can see when comparing with the original file, the "seconditems" element and its sub-element "seconditem" have been added. In addition, the "seconditem" node has "name2" as an attribute, and its text is "seconditemabc", as expected.

## Deleting XML Elements

### Using ElementTree

As you'd probably expect, the `ElementTree` module has the necessary functionality to delete node's attributes and sub-elements.

#### Deleting an attribute

The code bellow shows how to remove a node's attribute by using the `pop()` function. The function applies to the attrib object parameter. It specifies the name of the attribute and sets it to None.

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse('items.xml')  
root = tree.getroot()

# removing an attribute
root[0][0].attrib.pop('name', None)

# create a new XML file with the results
tree.write('newitems3.xml')  

The result will be the following XML file:

`<data>  
    <items>
        <item>item1abc</item>
        <item name="item2">item2abc</item>
    </items>
</data>  `

As we can see in the XML code above, the first item has no attribute "name".

### Deleting one sub-element

One specific sub-element can be deleted using the remove function. This function must specify the node that we want to remove.

The following example shows us how to use it:

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse('items.xml')  
root = tree.getroot()

# removing one sub-element
root[0].remove(root[0][0])

# create a new XML file with the results
tree.write('newitems4.xml')  

The result will be the following XML file:

`<data>  
    <items>
        <item name="item2">item2abc</item>
    </items>
</data>  `

As we can see from the XML code above, there is now only one "item" node. The second one has been removed from the original tree.

### Deleting all sub-elements

The `ElementTree` module presents us with the `clear()` function, which can be used to remove all sub-elements of a given element.

The example below shows us how to use `clear()`:

In [None]:
import xml.etree.ElementTree as ET

tree = ET.parse('items.xml')  
root = tree.getroot()

# removing all sub-elements of an element
root[0].clear()

# create a new XML file with the results
tree.write('newitems5.xml')  

The result will be the following XML file:

`<data>  
    <items />
</data>  `

As we can see in the XML code above, all sub-elements of the "items" element have been removed from the tree.

## Wrapping Up

Python offers several options to handle XML files. The ElementTree module is much easier to work with and it's recommended over the minidom module.

## New example and exercise to be completed and submitted


In the next example we will also use the pandas library. Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

data source: http://www.dbis.informatik.uni-goettingen.de/Mondial


In [None]:
from xml.etree import ElementTree as ET

In [None]:
document_tree = ET.parse( './data/mondial_database_less.xml' )

In [None]:
import pandas as pd

In [None]:
# print names of all countries
for child in document_tree.getroot():
    print (child.find('name').text)

In [None]:
# print names of all countries and their cities
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':'),
    capitals_string = ''
    for subelement in element.getiterator('city'):
        capitals_string += subelement.find('name').text + ', '
    print (capitals_string[:-2])

In [None]:
document_root = document_tree.getroot()

In [None]:
#the first order elements in root
for child in document_root:
    print (child.tag)

In [None]:
#Print the infant_mortality for each country
for element in document_tree.iterfind('country'):
    print ('* ' + element.find('name').text + ':'),
    infant_mortal = ''
    for subelement in element.getiterator('infant_mortality'):
        infant_mortal += subelement.text + ', '
    print (infant_mortal[:-2]) #this is just to get rid of the comma at the end, so if I'm not doing a list, 
                                #just git rid of the comma in the previous line

In [None]:
document_root[1].attrib

In [None]:
for child in document_root[0]: #seeing the children under the main elements
    print(child.tag)

In [None]:
for child in document_root[0]: #seeing the children under the main elements
    print(child.attrib)

****
## XML exercise

Using data in 'data/mondial_database.xml', the examples above, and refering to https://docs.python.org/2.7/library/xml.etree.elementtree.html, find

1. 10 countries with the lowest infant mortality rates
2. 10 cities with the largest population
3. 10 ethnic groups with the largest overall populations (sum of best/latest estimates over all countries)
4. name and country of a) longest river, b) largest lake and c) airport at highest elevation

# Answer 1

Finding the 10 lowest infnant mortality rates and the countries associated with them

In [None]:
from xml.etree import ElementTree as ET
import pandas as pd
import numpy as np

In [None]:
document = ET.parse( './data/mondial_database.xml' )

In [None]:
root = document.getroot()

In [None]:
# find all the infant mortality rates and sort them.
inf_list = []
for country in document.iterfind('country'):
   
    infant_mortal = 1000 #since the 10 lowest are wanted, this prevents missing data from appearing as the lowest (0)
    
    for infmor in country.getiterator('infant_mortality'):
        infant_mortal = float(infmor.text)
        inf_list = inf_list + [infant_mortal]
list.sort(inf_list)
short_inf_list = inf_list[:10]

In [None]:
a=root.findall("./country[infant_mortality='1.81']/name") # finding the right syntax to call the name of the country
a[0].text                                                 # corresponding to a mortatlity rate

In [None]:
root.findall("./country[infant_mortality='1.81']/name")[0].text

In [None]:
d = "./country[infant_mortality='1.81']/name"
root.findall(d)[0].text

In [None]:
#Answer 1!
for entry in short_inf_list:
    d =  "./country[infant_mortality='"+ str(entry)+"']/name"
    print (root.findall(d)[0].text +  ': ' + str(entry))
    

# Answer 2

Finding the top 10 cities by population

In [None]:
#Some exploration of the data
for child in document_root[1]: #seeing the children under the main elements
    print(child.tag)

In [None]:
for child in document_root[0]: #seeing the children under the main elements
    print(child.attrib)

In [None]:
cityname = ""
citypopulation = ""
document = ET.parse( './data/mondial_database.xml' ) #NOTE this line is not technically necessary as it is already defined, 
                                                        #but it makes this answer independent of the previous one
df = pd.DataFrame(columns=['CityName','Population']) #create data frame to hold country name and its popuplation
df['Population'] = df['Population'].astype(float)

#loop through country element to find city name and its population
for country in document.iterfind( 'country' ):
    for city in country.iter('city'): #find all cities within each country element
        cityname = city.find('name').text
        year = int(0)
        for node in city.iterfind('population'): #find all population elements with each city
            year = node.attrib['year'] #there are multiple population elements with different 'year' attribute
            if node.attrib['year'] >= year: #store the population number of the latest year
                citypopulation = int(node.text)
        df.loc[len(df)] = [cityname,citypopulation] #add city name and its population to data frame
        cityname = ""
        

#sort data frame to find 10 cities with largest population
df.sort_values(by = 'Population', ascending=False).head(10)

## Задача: Излистајте ги првите 10 градови со најмала популација.

In [4]:
import xml.etree.ElementTree as ET


cities = [
    {"name": "Hum", "population": 30},
    {"name": "Buford", "population": 2},
    {"name": "Lost Springs", "population": 4},
    {"name": "Durbuy", "population": 11},
    {"name": "Valvasone", "population": 13},
    {"name": "Melnik", "population": 250},
    {"name": "Bonanza", "population": 10},
    {"name": "Villa Epecuén", "population": 1},
    {"name": "Lanarca", "population": 5},
    {"name": "Monowi", "population": 1},
]


root = ET.Element("Cities")

for city in cities:
    city_element = ET.SubElement(root, "City")
    name = ET.SubElement(city_element, "Name")
    name.text = city["name"]
    population = ET.SubElement(city_element, "Population")
    population.text = str(city["population"])

tree = ET.ElementTree(root)
tree.write("cities.xml", encoding="utf-8", xml_declaration=True)
