# XML

XML is another format for representing nested data. An XML document consists of **tags**, which are denoted by angle brackets `< >`. Every tag is opened and closed. Nested fields are represented by tags between the opening and closing of a tag.

```
<employee name="Willy L.">
  <child name="Biff">
  </child>
  <child name="Happy">
  </child>
</employee>
```

In the above example, `employee` and `child` are tag names, while `name` is an attribute. Each tag can contain any number of attributes.

We will work with XML data from [setlist.fm](http://setlist.fm), a website that contains user-contributed setlists for live music concerts around the world. They provide [a REST API](https://api.setlist.fm/docs/index.html) for querying their data.

Let's get setlists from concerts in San Luis Obispo. Take a look at [the documentation here](https://api.setlist.fm/docs/rest.0.1.search.setlists.html), and see if you can construct the URL to fetch the most recent concerts in San Luis Obispo. Then, write a request using the `requests` package to fetch the data into a `Request` object called `req`. (But before you write the request, visit the URL first in your browser to make sure you are understanding the API.) 

In [1]:
import requests

# YOUR CODE HERE
req = requests.get("https://api.setlist.fm/rest/0.1/search/setlists?cityName=San%20Luis%20Obispo")

The XML document is now saved as a string in `req.text`. To make the data more accessible, we should represent the XML document using a nested data structure. To do this, we will use a Python package called BeautifulSoup. BeautifulSoup parses a string of XML into a tree-like data structure that makes it easy to find the tags we need. We have to specify a parser; since this is XML, we use the `"xml"` parser. (Later, we'll use `"html.parser"` for HTML.)

In [2]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(req.text, "xml")

The `BeautifulSoup` object contains a method `.find_all()`, which allows you to find all tags with a given name or attribute.

For example, if we wanted to find all `<setlist>` tags, we could do:

In [3]:
soup.find_all('setlist')

[<setlist eventDate="07-05-2017" id="53e6437d" lastUpdated="2017-05-05T07:17:53.000+0000" tour="Spring 2017" versionId="5b5d6fb0"><artist disambiguation="" mbid="b7765a33-ed19-4d99-bc93-1e54efdbe255" name="From Indian Lakes" sortName="From Indian Lakes" tmid="1484853"><url>http://www.setlist.fm/setlists/from-indian-lakes-5bda27b4.html</url></artist><venue id="13d55115" name="SLO Guild Hall"><city id="cu:aa19a1c6-06c7-11e6-b736-22000bb3106b" name="San Luis Obispo" state="California" stateCode="CA"><coords lat="35.0" long="120.0"/><country code="US" name="United States"/></city><url>http://www.setlist.fm/venue/slo-guild-hall-san-luis-obispo-ca-usa-13d55115.html</url></venue><sets/><url>http://www.setlist.fm/setlist/from-indian-lakes/2017/slo-guild-hall-san-luis-obispo-ca-53e6437d.html</url></setlist>,
 <setlist eventDate="07-05-2017" id="4be6437e" lastUpdated="2017-05-05T07:17:46.000+0000" tour="Light We Made" versionId="535d6fb1"><artist disambiguation="" mbid="0d20c42d-133c-429d-8f76-3

Or, if we wanted to find all `<venue>` tags where `name="The Fremont Theatre"`, we could do:

In [4]:
len(soup.find_all('venue', {"name": "The Fremont Theatre"}))

10

Remember, `soup` is a nested data structure, so we can also query within any tag that we find. For example, suppose we have a setlist tag. We can also use the `.find_all()` on a tag to search only within that tag.

(Note: `.find()` just pulls the first instance of a tag. You should use it only when you are sure there is only one tag or when you only want one tag.)

In [5]:
setlist = soup.find("setlist")
setlist.find_all("artist")

[<artist disambiguation="" mbid="b7765a33-ed19-4d99-bc93-1e54efdbe255" name="From Indian Lakes" sortName="From Indian Lakes" tmid="1484853"><url>http://www.setlist.fm/setlists/from-indian-lakes-5bda27b4.html</url></artist>]

In [6]:
artist = soup.find("artist")
artist.find_parents("setlist")

[<setlist eventDate="07-05-2017" id="53e6437d" lastUpdated="2017-05-05T07:17:53.000+0000" tour="Spring 2017" versionId="5b5d6fb0"><artist disambiguation="" mbid="b7765a33-ed19-4d99-bc93-1e54efdbe255" name="From Indian Lakes" sortName="From Indian Lakes" tmid="1484853"><url>http://www.setlist.fm/setlists/from-indian-lakes-5bda27b4.html</url></artist><venue id="13d55115" name="SLO Guild Hall"><city id="cu:aa19a1c6-06c7-11e6-b736-22000bb3106b" name="San Luis Obispo" state="California" stateCode="CA"><coords lat="35.0" long="120.0"/><country code="US" name="United States"/></city><url>http://www.setlist.fm/venue/slo-guild-hall-san-luis-obispo-ca-usa-13d55115.html</url></venue><sets/><url>http://www.setlist.fm/setlist/from-indian-lakes/2017/slo-guild-hall-san-luis-obispo-ca-53e6437d.html</url></setlist>]

## Exercises

**Question 1.** Flatten the data at the setlist level to obtain a data frame with one setlist per row. Note that there is always exactly one "artist" and "venue" per setlist, so these should be columns in your data frame. However, for repeated fields like "song", you will need to aggregate them into a single value.

In [11]:
setlists = soup.find_all("setlist")

In [17]:
import pandas as pd
setlist_data = {
    "artist":[],
    "venue":[],
    "songs":[]
}
for setlist in setlists:
    setlist_data["artist"].append(setlist.find("artist")["name"])
    setlist_data["venue"].append(setlist.find("venue")["name"])
    setlist_data["songs"].append(len(setlist.find_all("song")))

setlist_data = pd.DataFrame(setlist_data)
setlist_data

Unnamed: 0,artist,songs,venue
0,From Indian Lakes,0,SLO Guild Hall
1,Balance and Composure,0,SLO Guild Hall
2,Cage the Elephant,22,The Fremont Theatre
3,Tijuana Panthers,0,The Fremont Theatre
4,Runner,0,The Fremont Theatre
5,Joyce Manor,0,The Fremont Theatre
6,Laura Stevenson,0,Sweet Springs Saloon
7,Dev,1,SLO Brewing Company
8,matt pond PA,0,Living Room
9,JoJo,0,The Fremont Theatre


**Question 2.** Flatten the data at the song level to obtain a data frame with one song per row. You will want to include information at higher levels, such as the artist and the venue.

In [8]:
# YOUR CODE HERE.

In [19]:
soup.find_all("setlist", {"artist": "Rihanna"})

[]