# **Guided LAB - 342.1.3 - Parsing XML document using Python**

---



## **Objective:**

 In this Python XML Parser lab, you will learn how to parse XML using Python.
 By the end of lab, learners will be able to:

 * Read XML document by using python `ElementTree` module.
 * Use `parse()` function.
 * Filter elements from XML document.

## **Instructions:**

The **xml.etree.ElementTree** module provides a simple and effective way to parse and create XML data.

We can get the list of attributes and their values in the root tag. Once we find the attributes, it helps us navigate the XML tree easily.

- The `parse(“file.xml”)` function takes XML file format to parse it. Take a look - The `getroot()` method returns the root element of `‘my_document.xml’`.
- We can also filter the results out of the xml tree by using the `findall()` function of this module.

In [2]:
import xml.etree.ElementTree as ET

# Parse XML file
xml_tree = ET.parse('movie.xml')

# Get root element
xml_root = xml_tree.getroot()

# Print root element tag name
print(xml_root.tag)

# Get the child elements of the root element
child = xml_root.findall("movie")
print(child)



collection
[]


- `root.tag` will return the tag name of the XML element represented by the root variable, as shown in the below code example.

- `root.attrib` is used to access the attributes of an XML element, as shown in the below code example.

**For loop**
- You can easily iterate over subelements (commonly called "children") in the root by using a simple "for" loop, as shown in the below code example.

In [3]:
print('print name of the Child Elements:')
for child in xml_root:
      print(child.tag, child.attrib)

print name of the Child Elements:
genre {'category': 'Action'}
genre {'category': 'Thriller'}


Now you know that the children of the root collection are all genre. To designate the genre, the XML uses the attribute category. There are Action, Thriller, and Comedy movies according the genre element.

Typically it is helpful to know all the elements in the entire tree. One useful function for doing that is **`root.iter()`**. You can put this function into a `for` loop and it will iterate over the entire tree.

In [4]:
[elem.tag for elem in xml_root.iter()]

['collection',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'genre',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'decade',
 'movie',
 'format',
 'year',
 'rating',
 'description',
 'movie',
 'format',
 'year',
 'rating',
 'description']

This gives a general notion for how many elements you have, but it does not show the attributes or levels in the tree.

There is a helpful way to see the whole document. Any element has a `.tostring()` method. If you pass the root into the `.tostring()` method, you can return the whole document. Within `ElementTree` (remember aliased as `ET`), `.tostring()` takes a slightly strange form.

Since ElementTree is a powerful library that can interpret more than just XML, you must specify both the encoding and decoding of the document you are displaying as the string. For XMLs, use `utf8` - this is the typical document format type for an XML.

In [5]:
print(ET.tostring(xml_root, encoding='utf8').decode('utf8'))

<?xml version='1.0' encoding='utf8'?>
<collection>
    <genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
                </description>
            </movie>
            <movie favorite="True" title="THE KARATE KID">
                <format multiple="Yes">DVD,Online</format>
                <year>1984</year>
                <rating>PG</rating>
                <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back 2 the Future">
                <format multiple="False">Blu-ray</format>
                <year>1985</year>

You can expand the use of the `iter()` function to help with finding particular elements of interest. `root.iter()` will list all subelements under the root that match the element specified. Here, you will list all attributes of the movie element in the tree:

In [6]:
for movie in xml_root.iter('movie'):
  print(movie.attrib)


{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back 2 the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}


## **XPath Expressions**

Many times elements will not have attributes, they will only have text content. Using the attribute `.text`, you can print out this content.

Now, print out all the descriptions of the movies.

In [7]:
for description in xml_root.iter('description'):
    print(description.text)


                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
                
None provided.
Marty McFly
Two mutants come to a private academy for their kind whose resident superhero team must oppose a terrorist organization with similar powers.
NA.
WhAtEvER I Want!!!?!
fhfdh
Funny movie about a funny guy
psychopathic Bateman


Printing out the XML is helpful, but XPath is a query language used to search through an XML quickly and easily. XPath stands for XML Path Language and uses, as the name suggests, a "path like" syntax to identify and navigate nodes in an XML document.

Understanding XPath is critically important to scanning and populating XMLs. ElementTree has a `.findall()` function that will traverse the immediate children of the referenced element. You can use XPath expressions to specify more useful searches.

Here, you will search the tree for movies that came out in 1992:

In [8]:
for movie in xml_root.findall("./genre/decade/movie/[year='1992']"):
    print(movie.attrib)

{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}


The function `.findall()` always begins at the element specified. This type of function is extremely powerful for a "find and replace". You can even search on attributes!

Now, print out only the movies that are available in multiple formats (an attribute).

In [23]:
for movie in xml_root.findall("./genre/decade/movie/format/[@multiple='Yes']..."):
    print(movie.attrib)

{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'False', 'title': 'ALIEN'}


Brainstorm why, in this case, the print statement returns the "Yes" values of multiple. Think about how the "for" loop is defined. Could you rewrite this loop to print out the movie titles instead? Try it below:

Tip: use `...` inside of XPath to return the parent element of the current element.

In [40]:
for movie in xml_root.findall("./genre/decade/movie/[@favorite='True']"): # ---------- the functionality of this line i played with a little bit more to get a better grasp as to how to use it. 
    print(movie.attrib)


{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
