# Chapter 11. Hierarchical Data

A lot of data in the real world is naturally hierarchical. For example, consider a data set of concert programs by the New York Philharmonic, one of the world's leading orchestras. Each program consists of one or more works of music and is performed at one or more concerts. Furthermore, each work of music may feature any number of soloists.

How would we represent this information in a single `DataFrame`? If each row represents a single program, then we need one column for each concert that the program appeared in. This is wasteful because some programs may have only appeared in one concert. We still need to keep around $M$ "concert" columns, where $M$ is the maximum number of concerts that any program appeared in.

|concert1    | concert2   | ... | concertM | work1 | work2 | ... | workN |
|------------|------------|-----|----------|----------------|-------|-----|-------|
| 2016-12-11 | `NaN`      | ... | `NaN`    | Violin Concerto No. 2 | Symphony No. 5  | ... | `NaN` |
| 2016-12-13 | 2016-12-14 | ... | 2016-12-17 | Messiah | `NaN` | ... | `NaN` |
| ... | ... | ... | ... | ... | ... | ... | ... |

Similarly, we need one column for each work in the program. The number of "work" columns has to be equal to the maximum number of works on any program, even though most programs may have had far fewer works. 

Hopefully, it is clear that a single `DataFrame` is an inefficient way to represent hierarchical data---and we haven't even tried to include information about the soloists who performed in each work. This chapter is about efficient ways to represent hierarchical data, like the New York Philharmonic data set described above.

# Chapter 11.1 The JSON Data Format

The JavaScript Object Notation, or **JSON**, data format is a popular way to represent hierarchical data. Despite its name, its application extends far beyond JavaScript, the language for which it was originally designed.

Let's take a look at the first 1000 characters of a JSON file. (_Warning:_ Never try to print the entire contents of a JSON file in a Jupyter notebook; this will freeze the notebook if the file is large!)

In [63]:
!head -c 1000 /data301/data/nyphil/complete.json

{
  "programs": [
    {
      "id": "00646b9f-fec7-4ffb-9fb1-faae410bd9dc-0.1",
      "programID": "3853",
      "orchestra": "New York Philharmonic",
      "season": "1842-43",
      "concerts": [
        {
          "eventType": "Subscription Season",
          "Location": "Manhattan, NY",
          "Venue": "Apollo Rooms",
          "Date": "1842-12-07T05:00:00Z",
          "Time": "8:00PM"
        }
      ],
      "works": [
        {
          "ID": "52446*",
          "composerName": "Beethoven,  Ludwig  van",
          "workTitle": "SYMPHONY NO. 5 IN C MINOR, OP.67",
          "conductorName": "Hill, Ureli Corelli",
          "soloists": []
        },
        {
          "ID": "8834*4",
          "composerName": "Weber,  Carl  Maria Von",
          "workTitle": "OBERON",
          "movement": "\"Ozean, du Ungeheuer\" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II",
          "conductorName": "Timm, Henry C.",
          "soloists": [
            {
              "sol

Hopefully, this notation is familiar. It is just the notation for a Python dictionary! Although there are a few cosmetic differences between Python dicts and JSON, they are the same for the most part, and we will use the terms "dict" and "JSON" interchangeably. 

The `json` library in Python allows you to read a JSON file directly into a Python dict.

In [64]:
import json

with open("/data301/data/nyphil/complete.json") as f:
    nyphil = json.load(f)

Let's take a look at this Python dict that we just created, again being careful not to print out the entire dict. Let's just take a look at the first two programs in the data set. This should hopefully be enough to give you a sense of how the data is structured.

In [65]:
nyphil["programs"][:2]

[{'id': '00646b9f-fec7-4ffb-9fb1-faae410bd9dc-0.1',
  'programID': '3853',
  'orchestra': 'New York Philharmonic',
  'season': '1842-43',
  'concerts': [{'eventType': 'Subscription Season',
    'Location': 'Manhattan, NY',
    'Venue': 'Apollo Rooms',
    'Date': '1842-12-07T05:00:00Z',
    'Time': '8:00PM'}],
  'works': [{'ID': '52446*',
    'composerName': 'Beethoven,  Ludwig  van',
    'workTitle': 'SYMPHONY NO. 5 IN C MINOR, OP.67',
    'conductorName': 'Hill, Ureli Corelli',
    'soloists': []},
   {'ID': '8834*4',
    'composerName': 'Weber,  Carl  Maria Von',
    'workTitle': 'OBERON',
    'movement': '"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II',
    'conductorName': 'Timm, Henry C.',
    'soloists': [{'soloistName': 'Otto, Antoinette',
      'soloistInstrument': 'Soprano',
      'soloistRoles': 'S'}]},
   {'ID': '3642*',
    'composerName': 'Hummel,  Johann',
    'workTitle': 'QUINTET, PIANO, D MINOR, OP. 74',
    'soloists': [{'soloistNa

The top-level variables in each "program" are:

- concerts
- id
- orchestra
- programID
- season
- works

Most of these variables are fairly standard; the only interesting ones are "concerts" and "works", which are both lists. A variable that is a list is called a **repeated field**. A repeated field might itself consist of several variables (for example, each "work" has a composer, a conductor, and soloists), thus creating a hierarchy of variables. Repeated fields are what makes a data set hierarchical.

# Flattening Hierarchical Data

How many distinct works by Ludwig van Beethoven has the New York Philharmonic performed? Answering this question from the Python dict is irritating, as it involves writing multiple nested "for" loops to traverse the JSON data. Shown below is the code to do this, although we will see an easier way shortly.

In [66]:
# Spaghetti Code (Don't do this --- see below for an easier way.)
beethoven = set()
for program in nyphil["programs"]:
    for work in program["works"]:
        if "composerName" in work and work["composerName"] == "Beethoven,  Ludwig  van":
            beethoven.add(work["workTitle"])
            
len(beethoven)

144

The only data that we really need to answer the above question is a `DataFrame` of works that the New York Philharmonic has performed. To obtain such a `DataFrame`, we need to **flatten** the JSON data at the level of "work" to produce a `DataFrame` with one row per work. The `json_normalize()` in `pandas.io.json` is a function that allows us to flatten JSON data at any desired level. The first argument to `json_normalize()` is the JSON data (i.e., a Python dict), and the second argument specifies the level at which to flatten.

In [67]:
import pandas as pd
from pandas.io.json import json_normalize
pd.options.display.max_rows = 10

works = json_normalize(nyphil["programs"], "works")
works

Unnamed: 0,ID,composerName,conductorName,interval,movement,soloists,workTitle
0,52446*,"Beethoven, Ludwig van","Hill, Ureli Corelli",,,[],"SYMPHONY NO. 5 IN C MINOR, OP.67"
1,8834*4,"Weber, Carl Maria Von","Timm, Henry C.",,"""Ozean, du Ungeheuer"" (Ocean, thou mighty mons...","[{'soloistName': 'Otto, Antoinette', 'soloistI...",OBERON
2,3642*,"Hummel, Johann",,,,"[{'soloistName': 'Scharfenberg, William', 'sol...","QUINTET, PIANO, D MINOR, OP. 74"
3,0*,,,Intermission,,[],
4,8834*3,"Weber, Carl Maria Von","Etienne, Denis G.",,Overture,[],OBERON
...,...,...,...,...,...,...,...
83314,52446*,"Beethoven, Ludwig van","Gilbert, Alan",,,[],"SYMPHONY NO. 5 IN C MINOR, OP.67"
83315,53976*,"Handel, George Frideric","Manze, Andrew",,,"[{'soloistName': 'Harvey, Joelle [Joélle]', 's...",MESSIAH
83316,0*,,,Intermission,,[],
83317,53976*,"Handel, George Frideric","Manze, Andrew",,,"[{'soloistName': 'Harvey, Joelle [Joélle]', 's...",MESSIAH


Note that this flattening operation resulted in some loss of information. We no longer have information about the program that each work appeared in. We can partly alleviate this problem by specifying "metadata" from parent levels to append. For example, "season" and "orchestra" are properties of "program", which is the parent of "work". If we want to include these variables with each work, then we pass them to the `meta=` argument of `json_normalize()`.

In [68]:
json_normalize(nyphil["programs"], "works", meta=["season", "orchestra"])

Unnamed: 0,ID,composerName,conductorName,interval,movement,soloists,workTitle,season,orchestra
0,52446*,"Beethoven, Ludwig van","Hill, Ureli Corelli",,,[],"SYMPHONY NO. 5 IN C MINOR, OP.67",1842-43,New York Philharmonic
1,8834*4,"Weber, Carl Maria Von","Timm, Henry C.",,"""Ozean, du Ungeheuer"" (Ocean, thou mighty mons...","[{'soloistName': 'Otto, Antoinette', 'soloistI...",OBERON,1842-43,New York Philharmonic
2,3642*,"Hummel, Johann",,,,"[{'soloistName': 'Scharfenberg, William', 'sol...","QUINTET, PIANO, D MINOR, OP. 74",1842-43,New York Philharmonic
3,0*,,,Intermission,,[],,1842-43,New York Philharmonic
4,8834*3,"Weber, Carl Maria Von","Etienne, Denis G.",,Overture,[],OBERON,1842-43,New York Philharmonic
...,...,...,...,...,...,...,...,...,...
83314,52446*,"Beethoven, Ludwig van","Gilbert, Alan",,,[],"SYMPHONY NO. 5 IN C MINOR, OP.67",2017-18,New York Philharmonic
83315,53976*,"Handel, George Frideric","Manze, Andrew",,,"[{'soloistName': 'Harvey, Joelle [Joélle]', 's...",MESSIAH,2017-18,New York Philharmonic
83316,0*,,,Intermission,,[],,2017-18,New York Philharmonic
83317,53976*,"Handel, George Frideric","Manze, Andrew",,,"[{'soloistName': 'Harvey, Joelle [Joélle]', 's...",MESSIAH,2017-18,New York Philharmonic


However, there is still some loss of information. For example, there is no way to tell from this flattened `DataFrame` which works appeared together on the same program. (In the case of this particular data set, there is a "programID" that could be used to preserve information about the program, but not all data sets will have such an ID.)

Note also that repeated fields that are nested within "work", such as "soloist", remain unflattened. They simply remain as a list of JSON objects embedded within the `DataFrame`. They are not particularly accessible to analysis.

But now that we have a `DataFrame` with one row per work, we can determine the number of unique Beethoven works that the Philharmonic has performed by subsetting the `DataFrame` and grouping by the title of the work.

In [69]:
beethoven = works[works.composerName == "Beethoven,  Ludwig  van"]
len(beethoven.groupby("workTitle")["ID"].count())

144

What if we wanted to know how many works Benny Goodman has performed with the New York Philharmonic? We could flatten the data at the level of the "soloist". Since "soloists" is nested within "works", we specify a path (i.e., `["works", "soloists"]`) as the flattening level.

In [70]:
soloists = json_normalize(nyphil["programs"], ["works", "soloists"])
soloists

Unnamed: 0,soloistInstrument,soloistName,soloistRoles
0,Soprano,"Otto, Antoinette",S
1,Piano,"Scharfenberg, William",A
2,Violin,"Hill, Ureli Corelli",A
3,Viola,"Derwort, G. H.",A
4,Cello,"Boucher, Alfred",A
...,...,...,...
56926,Soprano,"Harvey, Joelle [Joélle]",S
56927,Mezzo-Soprano,"Johnson Cano, Jennifer",S
56928,Tenor,"Bliss, Ben",S
56929,Baritone,"Duncan, Tyler",S


Now we can use this flattened `DataFrame` to easily answer the question.

In [71]:
(soloists["soloistName"] == "Goodman, Benny").sum()

25

If we wanted to know how many works by Mozart that Goodman performed, we need to additionally store the "composerName" from the "works" level. We do this by specifying the path to "composerName" (i.e., `["works", "soloists"]`) in the `meta=` argument. But there is a catch. There are some works where the "composerName" field is missing. `json_normalize()` will fail if it cannot find the "composerName" key for even a single work. So we have to manually go through the JSON object and manually add "composerName" to the object, setting its value to `None`, if it does not exist.

In [72]:
for program in nyphil["programs"]:
    for work in program["works"]:
        if "composerName" not in work:
            work["composerName"] = None

In [73]:
soloists = json_normalize(
    nyphil["programs"],
    ["works", "soloists"], 
    meta=[["works", "composerName"], "season"]
)
soloists

Unnamed: 0,soloistInstrument,soloistName,soloistRoles,works.composerName,season
0,Soprano,"Otto, Antoinette",S,"Weber, Carl Maria Von",1842-43
1,Piano,"Scharfenberg, William",A,"Hummel, Johann",1842-43
2,Violin,"Hill, Ureli Corelli",A,"Hummel, Johann",1842-43
3,Viola,"Derwort, G. H.",A,"Hummel, Johann",1842-43
4,Cello,"Boucher, Alfred",A,"Hummel, Johann",1842-43
...,...,...,...,...,...
56926,Soprano,"Harvey, Joelle [Joélle]",S,"Handel, George Frideric",2017-18
56927,Mezzo-Soprano,"Johnson Cano, Jennifer",S,"Handel, George Frideric",2017-18
56928,Tenor,"Bliss, Ben",S,"Handel, George Frideric",2017-18
56929,Baritone,"Duncan, Tyler",S,"Handel, George Frideric",2017-18


In [74]:
soloists[soloists["soloistName"] == "Goodman, Benny"]["works.composerName"].value_counts()

Mozart,  Wolfgang  Amadeus      3
Weber,  Carl  Maria Von         3
Gershwin,  George               2
Sauter,  Eddie                  2
Baxter,  Phil                   1
                               ..
Copland,  Aaron                 1
Unspecified,                    1
Anthem,                         1
Handy,  William  Christopher    1
Prima,  Louis                   1
Name: works.composerName, Length: 19, dtype: int64

# RESTful Web Services

One way that organizations expose their data to the public is through RESTful web services. In a typical RESTful service, the user specifies the kind of data they want in the URL, and the server returns the desired data. JSON is a common format for returning data.

For example, the [Star Wars API](http://swapi.co) is a RESTful web service that returns data about the Star Wars universe, including characters, spaceships, and planets. To look up information about characters named "Skywalker", we would issue an HTTP request to the URL http://swapi.co/api/people/?search=skywalker. Notice that this returns data in JSON format.

To issue the HTTP request within Python (so that we can further process the JSON), we can use the `requests` library in Python.

In [75]:
import requests
resp = requests.get("http://swapi.co/api/people/?search=skywalker")
resp

<Response [200]>

The response object contains the JSON and other metadata. To extract the JSON in the form of a Python dict, we call `.json()` on the response object.

In [76]:
skywalker = resp.json()
skywalker

{'count': 3,
 'next': None,
 'previous': None,
 'results': [{'name': 'Luke Skywalker',
   'height': '172',
   'mass': '77',
   'hair_color': 'blond',
   'skin_color': 'fair',
   'eye_color': 'blue',
   'birth_year': '19BBY',
   'gender': 'male',
   'homeworld': 'https://swapi.co/api/planets/1/',
   'films': ['https://swapi.co/api/films/2/',
    'https://swapi.co/api/films/6/',
    'https://swapi.co/api/films/3/',
    'https://swapi.co/api/films/1/',
    'https://swapi.co/api/films/7/'],
   'species': ['https://swapi.co/api/species/1/'],
   'vehicles': ['https://swapi.co/api/vehicles/14/',
    'https://swapi.co/api/vehicles/30/'],
   'starships': ['https://swapi.co/api/starships/12/',
    'https://swapi.co/api/starships/22/'],
   'created': '2014-12-09T13:50:51.644000Z',
   'edited': '2014-12-20T21:17:56.891000Z',
   'url': 'https://swapi.co/api/people/1/'},
  {'name': 'Anakin Skywalker',
   'height': '188',
   'mass': '84',
   'hair_color': 'blond',
   'skin_color': 'fair',
   'eye_col

In [77]:
from pandas.io.json import json_normalize

Now we can process this data just like we did with the JSON data that we read in from a file.

In [78]:
json_normalize(skywalker, "results")

Unnamed: 0,birth_year,created,edited,eye_color,films,gender,hair_color,height,homeworld,mass,name,skin_color,species,starships,url,vehicles
0,19BBY,2014-12-09T13:50:51.644000Z,2014-12-20T21:17:56.891000Z,blue,"[https://swapi.co/api/films/2/, https://swapi....",male,blond,172,https://swapi.co/api/planets/1/,77,Luke Skywalker,fair,[https://swapi.co/api/species/1/],"[https://swapi.co/api/starships/12/, https://s...",https://swapi.co/api/people/1/,"[https://swapi.co/api/vehicles/14/, https://sw..."
1,41.9BBY,2014-12-10T16:20:44.310000Z,2014-12-20T21:17:50.327000Z,blue,"[https://swapi.co/api/films/5/, https://swapi....",male,blond,188,https://swapi.co/api/planets/1/,84,Anakin Skywalker,fair,[https://swapi.co/api/species/1/],"[https://swapi.co/api/starships/59/, https://s...",https://swapi.co/api/people/11/,"[https://swapi.co/api/vehicles/44/, https://sw..."
2,72BBY,2014-12-19T17:57:41.191000Z,2014-12-20T21:17:50.401000Z,brown,"[https://swapi.co/api/films/5/, https://swapi....",female,black,163,https://swapi.co/api/planets/1/,unknown,Shmi Skywalker,fair,[https://swapi.co/api/species/1/],[],https://swapi.co/api/people/43/,[]


# Ethical Enlightenment: Staggering Requests

Suppose you want information about the starships associated with the Skywalkers you found above. If we flatten the JSON object at the "starships" level, then we get a list of URLs that we can query to get information about each starship.

In [79]:
starship_urls = json_normalize(skywalker, ["results", "starships"])
starship_urls

Unnamed: 0,0
0,https://swapi.co/api/starships/12/
1,https://swapi.co/api/starships/22/
2,https://swapi.co/api/starships/59/
3,https://swapi.co/api/starships/65/
4,https://swapi.co/api/starships/39/


It is straightforward enough to write a loop that queries each of these URLs and saves the corresponding JSON object. However, a script can easily issue hundreds, even thousands, of queries per second, and we want to avoid spamming the server. (In fact, if a website detects many requests coming from the same IP address, it may think it is being attacked and block the IP address.)

To respect the host, who is providing this information for free, we stagger the queries by inserting a delay. This can be done using `time.sleep()`, which will suspend execution of the script for the given number of seconds. We will add a half second delay (so that we make no more than 2 queries per second) between requests.

In [80]:
import time

starships = []
for starship_url in starship_urls[0]:
    
    # get the JSON for the starship from the REST API
    resp = requests.get(starship_url)
    starships.append(resp.json())
    
    # add a 0.5 second delay between each query
    time.sleep(0.5)
    
starships

[{'name': 'X-wing',
  'model': 'T-65 X-wing',
  'manufacturer': 'Incom Corporation',
  'cost_in_credits': '149999',
  'length': '12.5',
  'max_atmosphering_speed': '1050',
  'crew': '1',
  'passengers': '0',
  'cargo_capacity': '110',
  'consumables': '1 week',
  'hyperdrive_rating': '1.0',
  'MGLT': '100',
  'starship_class': 'Starfighter',
  'pilots': ['https://swapi.co/api/people/1/',
   'https://swapi.co/api/people/9/',
   'https://swapi.co/api/people/18/',
   'https://swapi.co/api/people/19/'],
  'films': ['https://swapi.co/api/films/2/',
   'https://swapi.co/api/films/3/',
   'https://swapi.co/api/films/1/'],
  'created': '2014-12-12T11:19:05.340000Z',
  'edited': '2014-12-22T17:35:44.491233Z',
  'url': 'https://swapi.co/api/starships/12/'},
 {'name': 'Imperial shuttle',
  'model': 'Lambda-class T-4a shuttle',
  'manufacturer': 'Sienar Fleet Systems',
  'cost_in_credits': '240000',
  'length': '20',
  'max_atmosphering_speed': '850',
  'crew': '6',
  'passengers': '20',
  'cargo_

# Exercises

Exercises 1-3 deal with the New York Philharmonic data set from above.

**Exercise 1.** Answer the Benny Goodman question above ("How many works has Benny Goodman performed with the New York Philharmonic?") by writing nested for loops that traverse the structure of the JSON object. Check that your answer agrees with the one we obtained above by first flattening the JSON object to a `DataFrame`.

In [81]:
count = 0
for program in nyphil["programs"]:
    for work in program["works"]:
        for soloist in work["soloists"]:
            if soloist["soloistName"] == "Goodman, Benny":
                count += 1
count

25

**Exercise 2.** What is the most frequent start time for New York Philharmonic concerts?

In [169]:
json_normalize(nyphil["programs"], "concerts")["Time"].value_counts()

8:30PM    4584
8:00PM    4443
3:00PM    2133
7:30PM    2075
2:30PM    1618
          ... 
9:45PM       1
6:03PM       1
8:10PM       1
8:36PM       1
2:30AM       1
Name: Time, Length: 70, dtype: int64

**Exercise 3.** How many total concerts did the New York Philharmonic perform in the 2014-15 season?

In [112]:
(json_normalize(nyphil["programs"], "concerts", meta="season")["season"]
     .str.contains("2014").value_counts())[1]

217

To answer Exercises 4-6, you will need to issue HTTP requests to the Open States API, which contains information about state legislatures. You will need to include an API key with every request. You can [register for an API key here](https://openstates.org/api/register/). Once you have an API key, enter your API key below. If your API key works, then the code below should produce a `DataFrame` of all of the committees in the California State Assembly (the lower chamber).

In [131]:
# This is just a sample request to test that your API key is working.
apikey = "10243f7b-037f-4bd2-bf57-37b4387e6f4e"
resp = requests.get(
    "https://openstates.org/api/v1/committees/?state=ca&chamber=lower&apikey=%s" % apikey
)

pd.DataFrame(resp.json())

To answer the questions below, you will need to issue your own HTTP requests to the API. To understand how to construct URLs, you will need to refer to the [documentation for this API](http://docs.openstates.org/en/latest/api/).

**Exercise 4.** Legislators typically have offices in both the Capitol building and in their districts. Among the active legislators in the California Assembly (lower chamber), which legislators have the most offices (and how many do they have)?

In [133]:
resp = requests.get(
    "https://openstates.org/api/v1/legislators/?state=ca&chamber=lower&apikey=%s" % apikey
)

pd.DataFrame(resp.json())

In [165]:
json_normalize(resp.json(), "offices", meta="full_name")["full_name"].value_counts()

Cecilia M. Aguiar-Curry    4
Jim Wood                   4
Frank Bigelow              4
Marc Levine                4
Adam C. Gray               3
                          ..
Bill Quirk                 2
Ed Chau                    2
Christy Smith              2
Cottie Petrie-Norris       2
Kansen Chu                 2
Name: full_name, Length: 80, dtype: int64

**Exercise 5.** Get all of the _constitutional amendments_ in the California State Senate (upper house) from the current legislative session. How many amendments have there been?

(_Hint:_ "Constitutional amendment" is a type of bill.)

In [None]:
# ENTER YOUR CODE HERE.

**Exercise 6.** Look up the votes on the constitutional amendments you found in Exercise 5. Calculate the number of "yes" and "no" votes for each legislator on these amendments. Which legislator had the most total votes on constitutional amendments in the current session? Which legislator had the most total negative votes?

In [None]:
# ENTER YOUR CODE HERE.