# The XML Data Format

**XML**, which stands for eXtensible Markup Language, is another way to represent hierarchical data. The basic building block of XML is the **tag**, denoted by angle brackets `<>`.

For example, a data set of movies might be represented using XML as follows:

```
<movies>
  <movie id="1" title="The Godfather">
    <director id="50" name="Coppola, Francis Ford">
    </director>
    <releasedate>1972-03-24</releasedate>
    <character id="100" name="Vito Corleone">
      <actor id="200" name="Brando, Marlon">
      </actor>
    </character>
    <character id="101" name="Michael Corleone">
      <actor id="201" name="Pacino, Al">
      </actor>
    </character>
    ...
  </movie>
  <movie id="2" title="The Godfather: Part II">
    <director id="50" name="Coppola, Francis Ford">
    </director>
    <releasedate>1974-10-20</releasedate>
    <character id="101" name="Michael Corleone">
      <actor id="201" name="Pacino, Al">
      </actor>
    </character>
    <character id="100" name="Vito Corleone">
      <actor id="250" name="De Niro, Robert">
      </actor>
    </character>
    ...
  </movie>
  ...
</movies>
```

Note the following features of XML:

- Every tag `<a>` has a corresponding closing tag `</a>`. You can always recognize a closing tag by the forward slash `/`.
- Additional tags and/or strings can be nested between the opening and closing tags. In the example above, `<actor>` is nested between `<character>` and `</character>`, and `<character>` is nested between `<movie>` and `</movie>`. The nesting is used to represent hierarchy.
- Indentation is used to make the code more readable (to make it easier to see the nesting structure). But it is optional.
- Attributes can be associated with each tag, like `id=` and `name=` with the `<character>` tag and `id=` and `title=` with the `<movie>` tag.
- Children are represented by nested tags.
- Repeated fields are represented by repeated tages.

Each tag represents a variable in the data set. Unlike JSON, which uses lists to represent repeated fields, XML represents repeated fields by simply repeating tags where necessary. In the example above, there are multiple instances of `<movie>` within `<movies>` and multiple instances of `<character>` within `<movie>`, so `movie` and `character` are both repeated fields. (In fact, `director` is also a repeated field, but it is impossible to tell from the code above, since the movies shown above only have one director.)



We will process XML files using a Python library called [Beautiful Soup 4](https://pypi.org/project/beautifulsoup4/).

First, we read in the XML data from a URL (https://dlsun.github.io/pods/data/tvshows.xml) using the requests library. Note: you should also open this link in a browser window to see how the data is stored; note the tags for `<show>`, `<cast>`, `<episode>`, etc.

In [1]:
import requests
response = requests.get("https://dlsun.github.io/pods/data/tvshows.xml")
response

<Response [200]>

Currently `response` is a very long string which we can parse with Beautiful Soup. We read the XML data into a `BeautifulSoup` object, which
represents the data as a tree.

- You can navigate this tree using `.parent` (`.parents`) and `.children` (`.descendants`).
- You can search for a tag using `.find_all()` or `.find()` (which returns the first tag found).


In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "xml")

![](https://github.com/dlsun/pods/blob/master/11-Hierarchical-Data/hierarchical_data.png?raw=1)

Suppose we want to create a DataFrame with one row for each show and a column for the number of episodes. The following code

- Searches through `soup` to find all instances of the `<show>` tag: `soup.find_all("show")`
- Loops through all instances of the show tag to find for each `show`...
- the show name (and convert it to string): `show.find("name").string`
- and the number of episodes, by finding all instances of the `<episode>` tag for the `show` and finding the length of this list: `len(show.find_all("episode"))`

In [5]:
# intialize empty lists
show_names = []
show_episodes = []

# find all <show> tags, then iterate through them
for show in soup.find_all("show"):

  # find the name of the show, and append to the list
  show_names.append(show.find("name").string)

  # find the number of episodes of the show, and append to the list
  show_episodes.append(len(show.find_all("episode")))



After running the for loop the `show_names` and `show_episodes` lists are filled in; we just have to convert them to a Pandas DataFrame.

In [6]:
import pandas as pd
pd.DataFrame({
    "name": show_names,
    "episodes": show_episodes
})


Unnamed: 0,name,episodes
0,Girls,63
1,The Golden Girls,181
2,Good Girls,26
3,The Powerpuff Girls,119
4,Florida Girls,10
5,Chicken Girls,76
6,Derry Girls,12
7,The Powerpuff Girls,82
8,Bomb Girls,19
9,Gilmore Girls,153


Suppose we want to find out which TV show the episode "Those Are Strings, Pinocchio" comes from. Now we can search through `soup` to find all instances of the `<episode>` tag, find the name of the episode, and if it matches the one want, return the name of the show. Note that `<episode>` is nested within `<season>` which is nested within `<show>`. So `episode.parent` returns the season and `episode.parent.parent` returns the show; we want to find the `<name>` at the show level in the hierarchy.

In [7]:
# iterate over all instances of the <episode> tag
for episode in soup.find_all("episode"):

  # find the <name> of the current episode
  episode_name = episode.find("name").string

  # check if the episode name is what we're looking for
  if episode_name == "Those Are Strings, Pinocchio":

    # navigate from the episode level up two levels to the show level
    # and find the <name> at the show level (and convert to string)
    print(episode.parent.parent.find("name").string)

Gilmore Girls
