# ElementTree Python Package

This is a short tutorial for using xml.etree.ElementTree (ET in short). The goal is to demonstrate some of the building blocks and basic concepts of the module.

XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.

<span style="color:red">*WARNING*</span> The xml.etree.ElementTree module is not secure against maliciously constructed data. If you need to parse untrusted or unauthenticated data see XML vulnerabilities.

## ElementTree vs lxml vs BeautifulSoup

Ultimately it's a matter of preference. However, BeautifulSoup is by far the slowest and least robust option. ElementTree is specifically meant for xml, lxml is meant for XML and html, BrautifulSoup is meant for html, but it allows you to parse XML as well. If you have massive XML documents, use ElementTree or lxml. If speed is not a factor, use whatever you're familiar with.

You can go into a [StackOverflow rabbit hole](https://www.google.com/search?q=best+xml+parser+python+site:stackoverflow.com&rlz=1C1GCEU_enUS846US846&sxsrf=ALeKk03CZzALP592i7maG6QdTcC-CJHzrg:1611856624975&sa=X&ved=2ahUKEwjntr3smb_uAhUwHjQIHUmiAj8QrQIoBHoECAIQBQ&biw=1920&bih=969) about which to use if you dare. 

In [1]:
import pandas as pd
import xml.etree.ElementTree as ET

In [2]:
# Import data by reading from file 
tree = ET.parse('country_data.xml')
print(tree, '\n')

# Convert tree to get the root
root = tree.getroot()
print(root, '\n')

<xml.etree.ElementTree.ElementTree object at 0x00000155E1E9A148> 

<Element 'data' at 0x00000155E20A0A98> 



As an Element, root has a tag and a dictionary of attributes:

In [3]:
root.tag

'data'

In [4]:
root.attrib

{}

It also has children nodes over which we can iterate

In [5]:
for child in root:
    print(child.tag, child.attrib)

country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}


Children are nested, and we can access specific child nodes by index:

In [6]:
root[0][1].text

'2008'

# Diving Deeper into the Elements

Element.iter() iterates over this element and all elements below it<br>
Element.findall() finds only elements with a tag which are direct children of the current element<br>
Element.find() finds the first child with a particular tag<br>
Element.text accesses the element’s text content<br>
Element.get() accesses the element’s attributes

In [7]:
# Get the attribute
print([neighbor.attrib for neighbor in root.iter('neighbor')], '\n') 

# Get the keys
print([neighbor.keys() for neighbor in root.iter('neighbor')], '\n') 

# Get key, value pairs
print([neighbor.items() for neighbor in root.iter('neighbor')], '\n') 

[{}, {}, {}] 

[[], [], []] 

[[], [], []] 



In [8]:
# Find all sub-elemtns in the country key
print([country for country in root.find('country')], '\n')

# Find all country keys
print([country for country in root.findall('country')], '\n')

# Find text after country key
print([country for country in root.findtext('country')])

[<Element 'rank' at 0x00000155E20A0B38>, <Element 'year' at 0x00000155E20A0B88>, <Element 'gdppc' at 0x00000155E20A0BD8>, <Element 'neighbor' at 0x00000155E20A0C28>, <Element 'direction' at 0x00000155E20A0C78>] 

[<Element 'country' at 0x00000155E20A0AE8>, <Element 'country' at 0x00000155E20A0CC8>, <Element 'country' at 0x00000155E20A0EA8>] 

['\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


In [9]:
# Note that this fails because rank is a subelement 
print([country for country in root.findtext('rank')])

TypeError: 'NoneType' object is not iterable

In [10]:
# To make it work, you need to search for all the subelements 
# .// Selects all subelements, on all levels beneath the current element
print([country for country in root.findtext('.//rank')])

['1']


In [11]:
# Find all the country keys in the XML file
for country in root.findall('country'):
    
    # For each country find the text from the rank key
    rank = country.find('rank').text
    
    # Get the name attrib of the country
    name = country.get('name')
    print(name, rank)

Liechtenstein 1
Singapore 4
Panama 68


# Create a Dataframe from an XML

In [12]:
# Loop through nodes in the root to see all nodes > ie. this is what you're looping through
for node in root:
    print(node)

<Element 'country' at 0x00000155E20A0AE8>
<Element 'country' at 0x00000155E20A0CC8>
<Element 'country' at 0x00000155E20A0EA8>


In [13]:
# Create an empty rows list & neighbors list
country_rows = []
country_neighbors = []

# Loop through each node in the root
for node in root: 
    
    # Get the country name
    country = node.get('name')
    
    # Get the subelements
    rank = node.find("rank").text
    year = node.find("year").text
    gdppc = node.find("gdppc").text
    
    # Get the neighbor name and direction
    neighbor_name = node.find("neighbor").items()[0][1]
    neighbor_dir = node.find("neighbor").items()[1][1]
    
    # Create a list of dictionaries that we can convert to a dataframe
    country_rows.append({'Country'           : country,
                         'Year'              : year, 
                         'Rank'              : rank, 
                         'GDPPC'             : gdppc,
                         'Neighbor Name'     : neighbor_name, 
                         'Neighbor Direction': neighbor_dir
                        })

country_rows

# # Create the dataframe
# countries_df = pd.DataFrame(country_rows)

# countries_df

IndexError: list index out of range

Note that this is specific to the XML file we have. In order to make this more reusable, we would need to figure out how to generalize this so it works for any XML file. This will be our homework this week, create a function that can can be reused for multiple .xml files with different structures

# Homework

In [14]:
def xml_to_df(filename):
    
    # Import data by reading from file 
    tree = ET.parse(filename)

    # Convert tree to get the root
    root = tree.getroot()

    # Initialize an empty list for appending
    rows = []

    # Loop through each node in the root
    for node in root:

        # Get the parent node dictionary element
        parent = node.attrib

        # For each child node in the parent node
        for child in node:

            # Create a dictionary with the tag and text elements
            parent.update({child.tag : child.text})

        # Append to the rows list
        rows.append(parent)
    
    # Create the dataframe
    df = pd.DataFrame(rows)
    
    return df

In [4]:
import pandas as pd
pd.set_option('display.width', 150)

country_df = xml_to_df('country_data.xml')
print(f'Country Dataframe \n {country_df} \n \n')

movie_df = xml_to_df('movie_data.xml')
print(f'Movie Dataframe \n {movie_df} \n \n')

student_df = xml_to_df('student_data.xml')
print(f'Student Dataframe \n {student_df} \n \n')

Country Dataframe 
             name rank  year   gdppc    neighbor direction
0  Liechtenstein    1  2008  141100     Austria         E
1      Singapore    4  2011   59900    Malaysia         N
2         Panama   68  2011   13600  Costa Rica         W 
 

Movie Dataframe 
           title                    type format  year rating stars                description episodes
0  Enemy Behind           War, Thriller    DVD  2003     PG    10  Talk about a US-Japan war      NaN
1  Transformers  Anime, Science Fiction    DVD  1989      R     8      A schientific fiction      NaN
2        Trigun           Anime, Action    DVD   NaN     PG    10         Vash the Stampede!        4
3        Ishtar                  Comedy    VHS   NaN     PG     2           Viewable boredom      NaN 
 

Student Dataframe 
            name age grade quiz midterm final
0  Steven Smith  17    12   98      78    82
1   James Brown  16    12   76      81    78
2    Mary Olsen  17    12   91      89    96 
 



In [15]:
# Import data by reading from file 
tree = ET.parse('movie_data.xml')

# Convert tree to get the root
root = tree.getroot()

# Initialize an empty list for appending
rows = []

# Loop through each node in the root
for node in root:

    # Get the parent node dictionary element
    parent = node.attrib

    # For each child node in the parent node
    for child in node:

        # Create a dictionary with the tag and text elements
        parent.update({child.tag : child.text})

    # Append to the rows list
    rows.append(parent)
print(rows)
# # Create the dataframe
# df = pd.DataFrame(rows)

# df

[{'title': 'Enemy Behind', 'type': 'War, Thriller', 'format': 'DVD', 'year': '2003', 'rating': 'PG', 'stars': '10', 'description': 'Talk about a US-Japan war'}, {'title': 'Transformers', 'type': 'Anime, Science Fiction', 'format': 'DVD', 'year': '1989', 'rating': 'R', 'stars': '8', 'description': 'A schientific fiction'}, {'title': 'Trigun', 'type': 'Anime, Action', 'format': 'DVD', 'episodes': '4', 'rating': 'PG', 'stars': '10', 'description': 'Vash the Stampede!'}, {'title': 'Ishtar', 'type': 'Comedy', 'format': 'VHS', 'rating': 'PG', 'stars': '2', 'description': 'Viewable boredom'}]
