# Hierarchical Data and JSON

A lot of data in the real world is naturally hierarchical. As an example, consider a data set where each observation is a TV show. Many of the variables in this data set are what we would expect, such as the runtime of each show and what network it was broadcast on. But there are also variables that are unorthodox, such as the season and the cast. A TV show can have multiple seasons and multiple cast members, as illustated in the figure below.

![](https://github.com/dlsun/pods/blob/master/11-Hierarchical-Data/hierarchical_data.png?raw=1)

Can we represent all of this information in a single `DataFrame`? If each row represents a single show, then it is straightforward to have columns containing the runtimes, the premiere dates, and so on. But it is not obvious how to incorporate the season information into this `DataFrame` in a way that still makes this information accessible for analysis. There are multiple challenges:

- A show has multiple seasons. We will need multiple columns, one for each season.
- The number of seasons varies from show to show. For example, "Girls" was on for 6 seasons, while "The Golden Girls" was on for 7. We will need to have at least 7 columns in our `DataFrame` to be able to store all the information for "The Golden Girls", even though we only need 6 columns for "Girls" (and perhaps even fewer for other shows).
- Each season has multiple variables associated with it, such as the premiere date and the end date. We will need a separate column for each of these variables.

The resulting `DataFrame` might look something like this.

|name    | runtime |  premiered  | season1premiere | season1end | ... | season7premiere | season7end |
|--------|---------|-------------|-----------------|------------|-----|-----------------|------------|
| Girls | 30       | 2012-04-15  | 2012-04-15      | 2012-06-17 | ... | `NaN`           | `NaN`      |
| The Golden Girls | 30 | 1985-09-14 | 1985-09-14 | 1986-05-10 | ... | 1991-09-21 | 1992-05-09      |
| ... | ... | ... | ... | ... | ... | ... | ... |

Furthermore, each season contains a different number of episodes. If we also want to store information about each episode, it is impractical to do so in a `DataFrame`.

The problem is that this data is naturally hierarchical. A TV show can have multiple cast members and multiple seasons; furthermore, each season can have multiple episodes. Hierarchical data requires a different storage format, which we'll explore now.

# The JSON Data Format

The JavaScript Object Notation, or **JSON**, data format is a popular way to represent hierarchical data. Despite its name, its application extends far beyond JavaScript, the language for which it was originally designed.

Let's take a look at the first 1000 characters of a JSON file. (_Warning:_ Never try to print the entire contents of a JSON file in a Jupyter notebook; this will freeze the notebook if the file is large!)

In [1]:
# Fetch data from a URL
import requests
response = requests.get("https://dlsun.github.io/pods/data/tvshows.json")

print(response.text[:1000])

[{"id": 139, "url": "http://www.tvmaze.com/shows/139/girls", "name": "Girls", "type": "Scripted", "language": "English", "genres": ["Drama", "Romance"], "status": "Ended", "runtime": 30, "premiered": "2012-04-15", "officialSite": "http://www.hbo.com/girls", "schedule": {"time": "22:00", "days": ["Sunday"]}, "rating": {"average": 6.9}, "weight": 75, "network": {"id": 8, "name": "HBO", "country": {"name": "United States", "code": "US", "timezone": "America/New_York"}}, "webChannel": null, "externals": {"tvrage": 30124, "thetvdb": 220411, "imdb": "tt1723816"}, "image": {"medium": "http://static.tvmaze.com/uploads/images/medium_portrait/31/78286.jpg", "original": "http://static.tvmaze.com/uploads/images/original_untouched/31/78286.jpg"}, "summary": "<p>This Emmy winning series is a comic look at the assorted humiliations and rare triumphs of a group of girls in their 20s.</p>", "updated": 1577601053, "cast": [{"person": {"id": 27410, "url": "http://www.tvmaze.com/people/27410/lena-dunham",

This syntax should seem familiar if you are a regular user of Python. Except for a few cosmetic differences, this is exactly the syntax of a Python dictionary! The `json` library in Python automatically translates a JSON string or file into a Python dict.

In [4]:
import json
data_shows = json.loads(response.text)
data_shows

[{'id': 139,
  'url': 'http://www.tvmaze.com/shows/139/girls',
  'name': 'Girls',
  'type': 'Scripted',
  'language': 'English',
  'genres': ['Drama', 'Romance'],
  'status': 'Ended',
  'runtime': 30,
  'premiered': '2012-04-15',
  'officialSite': 'http://www.hbo.com/girls',
  'schedule': {'time': '22:00', 'days': ['Sunday']},
  'rating': {'average': 6.9},
  'weight': 75,
  'network': {'id': 8,
   'name': 'HBO',
   'country': {'name': 'United States',
    'code': 'US',
    'timezone': 'America/New_York'}},
  'webChannel': None,
  'externals': {'tvrage': 30124, 'thetvdb': 220411, 'imdb': 'tt1723816'},
  'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/31/78286.jpg',
   'original': 'http://static.tvmaze.com/uploads/images/original_untouched/31/78286.jpg'},
  'summary': '<p>This Emmy winning series is a comic look at the assorted humiliations and rare triumphs of a group of girls in their 20s.</p>',
  'updated': 1577601053,
  'cast': [{'person': {'id': 27410,
 

If the JSON is read in from a URL using the `requests` library, then the JSON object can also be accessed directly from the response.

In [5]:
# This code is equivalent to the above code.
data_shows = response.json()
data_shows

[{'id': 139,
  'url': 'http://www.tvmaze.com/shows/139/girls',
  'name': 'Girls',
  'type': 'Scripted',
  'language': 'English',
  'genres': ['Drama', 'Romance'],
  'status': 'Ended',
  'runtime': 30,
  'premiered': '2012-04-15',
  'officialSite': 'http://www.hbo.com/girls',
  'schedule': {'time': '22:00', 'days': ['Sunday']},
  'rating': {'average': 6.9},
  'weight': 75,
  'network': {'id': 8,
   'name': 'HBO',
   'country': {'name': 'United States',
    'code': 'US',
    'timezone': 'America/New_York'}},
  'webChannel': None,
  'externals': {'tvrage': 30124, 'thetvdb': 220411, 'imdb': 'tt1723816'},
  'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/31/78286.jpg',
   'original': 'http://static.tvmaze.com/uploads/images/original_untouched/31/78286.jpg'},
  'summary': '<p>This Emmy winning series is a comic look at the assorted humiliations and rare triumphs of a group of girls in their 20s.</p>',
  'updated': 1577601053,
  'cast': [{'person': {'id': 27410,
 

Now let's investigate the JSON data that we just loaded, again being careful not to print out all of data. Let's start by looking at the top-level variables associated with each TV show.

In [6]:
show = data_shows[0] # data for the show "Girls"
show.keys()

dict_keys(['id', 'url', 'name', 'type', 'language', 'genres', 'status', 'runtime', 'premiered', 'officialSite', 'schedule', 'rating', 'weight', 'network', 'webChannel', 'externals', 'image', 'summary', 'updated', 'cast', 'seasons'])

We see variables like **runtime** and **premiered** which contain a single value for each show.

In [7]:
show["runtime"]

30

In [8]:
show["premiered"]

'2012-04-15'

But we also see "variables" like **schedule** and **network** which contain dictionaries.

In [9]:
show["schedule"]

{'time': '22:00', 'days': ['Sunday']}

In [10]:
show["network"]

{'id': 8,
 'name': 'HBO',
 'country': {'name': 'United States',
  'code': 'US',
  'timezone': 'America/New_York'}}

And we also see "variables" like **cast** and **seasons**, which contain multiple values.

In [11]:
show["cast"]

[{'person': {'id': 27410,
   'url': 'http://www.tvmaze.com/people/27410/lena-dunham',
   'name': 'Lena Dunham',
   'country': {'name': 'United States',
    'code': 'US',
    'timezone': 'America/New_York'},
   'birthday': '1986-05-13',
   'deathday': None,
   'gender': 'Female',
   'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/3/7597.jpg',
    'original': 'http://static.tvmaze.com/uploads/images/original_untouched/3/7597.jpg'}},
  'character': {'id': 36886,
   'url': 'http://www.tvmaze.com/characters/36886/girls-hannah-horvath',
   'name': 'Hannah Horvath',
   'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/0/1954.jpg',
    'original': 'http://static.tvmaze.com/uploads/images/original_untouched/0/1954.jpg'}},
  'self': False,
  'voice': False},
 {'person': {'id': 11102,
   'url': 'http://www.tvmaze.com/people/11102/allison-williams',
   'name': 'Allison Williams',
   'country': {'name': 'United States',
    'code': 'US',
    '

A "variable" (like **cast**) with multiple values is called a _repeated field_. A repeated field might itself contain a repeated field (e.g., each show has multiple seasons, and each season in turn has multiple episodes), creating a hierarchy of variables. Repeated fields are represented as lists or arrays in JSON.

Let's take a closer look at how each cast member is represented, by zooming in on the first cast member.

In [12]:
show["cast"][0]

{'person': {'id': 27410,
  'url': 'http://www.tvmaze.com/people/27410/lena-dunham',
  'name': 'Lena Dunham',
  'country': {'name': 'United States',
   'code': 'US',
   'timezone': 'America/New_York'},
  'birthday': '1986-05-13',
  'deathday': None,
  'gender': 'Female',
  'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/3/7597.jpg',
   'original': 'http://static.tvmaze.com/uploads/images/original_untouched/3/7597.jpg'}},
 'character': {'id': 36886,
  'url': 'http://www.tvmaze.com/characters/36886/girls-hannah-horvath',
  'name': 'Hannah Horvath',
  'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/0/1954.jpg',
   'original': 'http://static.tvmaze.com/uploads/images/original_untouched/0/1954.jpg'}},
 'self': False,
 'voice': False}

It appears that each cast member is itself a dictionary with four keys: **person** (i.e., the actor), **character**, **self**, and **voice**. The first two attributes are themselves dictionaries containing further information about the actor and the character, while the last two attributes are booleans.

If we wanted to get the complete list of actors who appeared in these shows, excluding voice actors, we could traverse the levels using nested loops:

In [13]:
actors = []
for show in data_shows:
    for cast in show["cast"]:
        # exclude voice actors
        if not cast["voice"]:
            actors.append(cast["person"]["name"])

actors

['Lena Dunham',
 'Allison Williams',
 'Jemima Kirke',
 'Zosia Mamet',
 'Adam Driver',
 'Alex Karpovsky',
 'Andrew Rannells',
 'Ebon Moss-Bachrach',
 'Bea Arthur',
 'Betty White',
 'Rue McClanahan',
 'Estelle Getty',
 'Christina Hendricks',
 'Manny Montana',
 'Reno Wilson',
 'Matthew Lillard',
 'Retta',
 'Mae Whitman',
 'Lidya Jewett',
 'Izzy Stannard',
 'Laura Chinn',
 'Melanie Field',
 'Laci Mosley',
 'Patty Guggenheim',
 'Annie LeBlanc',
 'Brooke Butler',
 'Hayden Summerall',
 'Dylan Conrique',
 'Riley Lewis',
 'Carson Lueders',
 'Mads Lewis',
 'Indiana Massara',
 'Greg Marks',
 'Caden Conrique',
 'Aliyah Moulden',
 'Rush Holland Butler',
 'Brec Bassinger',
 'Jeremiah Perkins',
 'Ariel Martin',
 'Jenna Davis',
 'Talin Silva',
 'Kelsey Leon',
 'Erin Reese DeJarnette',
 'Matt Sato',
 'Grant Knoche',
 'Aidette Cancino',
 'Luke Patrick Dodge',
 'Kaylyn Slevin',
 'Lily Chee',
 'Isabel Marcus',
 'Jay Ulloa',
 'Paul Toweh',
 'Sean Cavaliere',
 'Marlhy Murphy',
 'Hayley LeBlanc',
 'Kathy Kie

However, it is often easier to work with hierarchical data by first flattening it to a `DataFrame`.

# Flattening Hierarchical Data

Although hierarchical data cannot be efficiently represented using a `DataFrame`, most questions do not require working with the full data. In these cases, it is helpful to first "flatten" the JSON data into a `DataFrame`.

For example, suppose we want to know the average runtime of shows. To answer this question, it suffices to work with a `DataFrame` with one row per show. We can use the `json_normalize()` function in `pandas` to flatten the data into a `DataFrame` of this form.

In [14]:
import pandas as pd

df_shows = pd.json_normalize(data_shows)
df_shows

Unnamed: 0,id,url,name,type,language,genres,status,runtime,premiered,officialSite,...,externals.thetvdb,externals.imdb,image.medium,image.original,network,webChannel.id,webChannel.name,webChannel.country.name,webChannel.country.code,webChannel.country.timezone
0,139,http://www.tvmaze.com/shows/139/girls,Girls,Scripted,English,"[Drama, Romance]",Ended,30,2012-04-15,http://www.hbo.com/girls,...,220411,tt1723816,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
1,722,http://www.tvmaze.com/shows/722/the-golden-girls,The Golden Girls,Scripted,English,"[Drama, Comedy]",Ended,30,1985-09-14,,...,71292,tt0088526,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
2,23542,http://www.tvmaze.com/shows/23542/good-girls,Good Girls,Scripted,English,"[Drama, Comedy, Crime]",Running,60,2018-02-26,https://www.nbc.com/good-girls?nbc=1,...,328577,tt6474378,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
3,6771,http://www.tvmaze.com/shows/6771/the-powerpuff...,The Powerpuff Girls,Animation,English,"[Comedy, Action, Science-Fiction]",Running,15,2016-04-04,https://www.cartoonnetwork.com/video/powerpuff...,...,307473,tt4718304,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
4,42726,http://www.tvmaze.com/shows/42726/florida-girls,Florida Girls,Scripted,English,[Comedy],Running,30,2019-07-10,https://poptv.com/floridagirls,...,363682,tt8548870,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
5,32087,http://www.tvmaze.com/shows/32087/chicken-girls,Chicken Girls,Scripted,English,"[Drama, Children, Music]",Running,16,2017-09-05,https://www.youtube.com/playlist?list=PLVewHiZ...,...,339854,,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,274.0,Brat,United States,US,America/New_York
6,33320,http://www.tvmaze.com/shows/33320/derry-girls,Derry Girls,Scripted,English,[Comedy],Running,30,2018-01-04,http://www.channel4.com/programmes/derry-girls,...,338903,tt7120662,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
7,1955,http://www.tvmaze.com/shows/1955/the-powerpuff...,The Powerpuff Girls,Animation,English,"[Action, Children, Crime]",Ended,30,1998-11-18,,...,76200,tt0175058,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
8,1073,http://www.tvmaze.com/shows/1073/bomb-girls,Bomb Girls,Scripted,English,"[Drama, Romance, War]",Ended,60,2012-01-04,,...,254378,tt1955311,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,
9,525,http://www.tvmaze.com/shows/525/gilmore-girls,Gilmore Girls,Scripted,English,"[Drama, Comedy, Romance]",Ended,60,2000-10-05,,...,76568,tt0238784,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,


Using this DataFrame, we can compute the mean runtime of these shows as usual.

In [15]:
df_shows["runtime"].mean()

36.1

Let us take a closer look at the columns of this `DataFrame`.

In [16]:
df_shows.keys()
# Since df_shows is a DataFrame, this is equivalent to: df_shows.columns

Index(['id', 'url', 'name', 'type', 'language', 'genres', 'status', 'runtime',
       'premiered', 'officialSite', 'weight', 'webChannel', 'summary',
       'updated', 'cast', 'seasons', 'schedule.time', 'schedule.days',
       'rating.average', 'network.id', 'network.name', 'network.country.name',
       'network.country.code', 'network.country.timezone', 'externals.tvrage',
       'externals.thetvdb', 'externals.imdb', 'image.medium', 'image.original',
       'network', 'webChannel.id', 'webChannel.name',
       'webChannel.country.name', 'webChannel.country.code',
       'webChannel.country.timezone'],
      dtype='object')

Notice that variables that were themselves dictionaries, such as **schedule** and **network**, have been expanded into multiple columns, with names like **schedule.time**, **schedule.days**, etc.

Repeated fields, like **genres**, **cast**, and **seasons**, are also columns in this `DataFrame`. These columns just contain a dump of the raw JSON. The information in these columns is not readily accessible.

In [17]:
df_shows["seasons"]

0    [{'id': 650, 'url': 'http://www.tvmaze.com/sea...
1    [{'id': 2923, 'url': 'http://www.tvmaze.com/se...
2    [{'id': 58294, 'url': 'http://www.tvmaze.com/s...
3    [{'id': 20323, 'url': 'http://www.tvmaze.com/s...
4    [{'id': 97695, 'url': 'http://www.tvmaze.com/s...
5    [{'id': 75322, 'url': 'http://www.tvmaze.com/s...
6    [{'id': 77973, 'url': 'http://www.tvmaze.com/s...
7    [{'id': 7073, 'url': 'http://www.tvmaze.com/se...
8    [{'id': 4998, 'url': 'http://www.tvmaze.com/se...
9    [{'id': 2080, 'url': 'http://www.tvmaze.com/se...
Name: seasons, dtype: object

What if we wanted to identify the show with the most episodes? It is difficult to calculate this from the `DataFrame` above, since the episodes are buried within the **seasons** column. It would be preferable to have a `DataFrame` where each row represents a season of a show.

The `json_normalize()` function also accepts an additional argument specifying the variable that we want to be the rows of the `DataFrame`. So if we wanted a `DataFrame` where each row represents a season, we would pass in the name of that variable in the JSON data (i.e., **seasons**) to `json_normalize()`.

In [18]:
df_seasons = pd.json_normalize(data_shows, "seasons")
df_seasons

Unnamed: 0,id,url,number,name,episodeOrder,premiereDate,endDate,webChannel,summary,episodes,...,image.medium,image.original,image,network,webChannel.id,webChannel.name,webChannel.country,webChannel.country.name,webChannel.country.code,webChannel.country.timezone
0,650,http://www.tvmaze.com/seasons/650/girls-season-1,1,,10.0,2012-04-15,2012-06-17,,,"[{'id': 10820, 'url': 'http://www.tvmaze.com/e...",...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,,,
1,651,http://www.tvmaze.com/seasons/651/girls-season-2,2,,10.0,2013-01-13,2013-03-17,,,"[{'id': 10830, 'url': 'http://www.tvmaze.com/e...",...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,,,
2,652,http://www.tvmaze.com/seasons/652/girls-season-3,3,,12.0,2014-01-12,2014-03-23,,,"[{'id': 10840, 'url': 'http://www.tvmaze.com/e...",...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,,,
3,653,http://www.tvmaze.com/seasons/653/girls-season-4,4,,10.0,2015-01-11,2015-03-22,,,"[{'id': 40963, 'url': 'http://www.tvmaze.com/e...",...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,,,,,,,
4,21282,http://www.tvmaze.com/seasons/21282/girls-seas...,5,,10.0,2016-02-21,2016-04-17,,,"[{'id': 409239, 'url': 'http://www.tvmaze.com/...",...,,,,,,,,,,
5,48205,http://www.tvmaze.com/seasons/48205/girls-seas...,6,,10.0,2017-02-12,2017-04-16,,,"[{'id': 1127016, 'url': 'http://www.tvmaze.com...",...,,,,,,,,,,
6,2923,http://www.tvmaze.com/seasons/2923/the-golden-...,1,,25.0,1985-09-14,1986-05-10,,,"[{'id': 63861, 'url': 'http://www.tvmaze.com/e...",...,,,,,,,,,,
7,2924,http://www.tvmaze.com/seasons/2924/the-golden-...,2,,26.0,1986-09-27,1987-05-16,,,"[{'id': 63886, 'url': 'http://www.tvmaze.com/e...",...,,,,,,,,,,
8,2925,http://www.tvmaze.com/seasons/2925/the-golden-...,3,,25.0,1987-09-19,1988-05-07,,,"[{'id': 63912, 'url': 'http://www.tvmaze.com/e...",...,,,,,,,,,,
9,2926,http://www.tvmaze.com/seasons/2926/the-golden-...,4,,26.0,1988-10-08,1989-05-13,,,"[{'id': 63937, 'url': 'http://www.tvmaze.com/e...",...,,,,,,,,,,


There is just one problem. We now have a `DataFrame` of seasons, without any indication of which TV show they came from. This is because when we used `json_normalize()`, it automatically discarded variables from all levels above the one that we flattened to. (Since we flattened to the "season" level, we lost all variables associated with the "show".) If there are any variables from the higher levels that we want to keep, then they have to be specified explicitly in the `meta=` argument. Since we want the name of the TV show, which is stored in the "name" key of each show, we specify `meta="name"`.

(We also specify a prefix in the `meta_prefix=` argument to avoid column name clashes, since the `DataFrame` already has a column called **name**. This is not necessary if the column names do not clash.)

In [19]:
df_seasons = pd.json_normalize(data_shows, "seasons",
                               meta="name", meta_prefix="show.")
df_seasons

Unnamed: 0,id,url,number,name,episodeOrder,premiereDate,endDate,webChannel,summary,episodes,...,image.original,image,network,webChannel.id,webChannel.name,webChannel.country,webChannel.country.name,webChannel.country.code,webChannel.country.timezone,show.name
0,650,http://www.tvmaze.com/seasons/650/girls-season-1,1,,10.0,2012-04-15,2012-06-17,,,"[{'id': 10820, 'url': 'http://www.tvmaze.com/e...",...,http://static.tvmaze.com/uploads/images/origin...,,,,,,,,,Girls
1,651,http://www.tvmaze.com/seasons/651/girls-season-2,2,,10.0,2013-01-13,2013-03-17,,,"[{'id': 10830, 'url': 'http://www.tvmaze.com/e...",...,http://static.tvmaze.com/uploads/images/origin...,,,,,,,,,Girls
2,652,http://www.tvmaze.com/seasons/652/girls-season-3,3,,12.0,2014-01-12,2014-03-23,,,"[{'id': 10840, 'url': 'http://www.tvmaze.com/e...",...,http://static.tvmaze.com/uploads/images/origin...,,,,,,,,,Girls
3,653,http://www.tvmaze.com/seasons/653/girls-season-4,4,,10.0,2015-01-11,2015-03-22,,,"[{'id': 40963, 'url': 'http://www.tvmaze.com/e...",...,http://static.tvmaze.com/uploads/images/origin...,,,,,,,,,Girls
4,21282,http://www.tvmaze.com/seasons/21282/girls-seas...,5,,10.0,2016-02-21,2016-04-17,,,"[{'id': 409239, 'url': 'http://www.tvmaze.com/...",...,,,,,,,,,,Girls
5,48205,http://www.tvmaze.com/seasons/48205/girls-seas...,6,,10.0,2017-02-12,2017-04-16,,,"[{'id': 1127016, 'url': 'http://www.tvmaze.com...",...,,,,,,,,,,Girls
6,2923,http://www.tvmaze.com/seasons/2923/the-golden-...,1,,25.0,1985-09-14,1986-05-10,,,"[{'id': 63861, 'url': 'http://www.tvmaze.com/e...",...,,,,,,,,,,The Golden Girls
7,2924,http://www.tvmaze.com/seasons/2924/the-golden-...,2,,26.0,1986-09-27,1987-05-16,,,"[{'id': 63886, 'url': 'http://www.tvmaze.com/e...",...,,,,,,,,,,The Golden Girls
8,2925,http://www.tvmaze.com/seasons/2925/the-golden-...,3,,25.0,1987-09-19,1988-05-07,,,"[{'id': 63912, 'url': 'http://www.tvmaze.com/e...",...,,,,,,,,,,The Golden Girls
9,2926,http://www.tvmaze.com/seasons/2926/the-golden-...,4,,26.0,1988-10-08,1989-05-13,,,"[{'id': 63937, 'url': 'http://www.tvmaze.com/e...",...,,,,,,,,,,The Golden Girls


From here, it is straightforward to calculate the total number of episodes for each show. First, we determine the number of episodes in each season by calculating the length of each episodes list, storing the result in a new column called **num_episodes**. Then, we calculate the sum of **num_episodes** for each **show.name**.

In [20]:
df_seasons["num_episodes"] = df_seasons["episodes"].apply(len)
df_seasons.groupby("show.name")["num_episodes"].sum()

show.name
Bomb Girls              19
Chicken Girls           76
Derry Girls             12
Florida Girls           10
Gilmore Girls          153
Girls                   63
Good Girls              26
The Golden Girls       181
The Powerpuff Girls    201
Name: num_episodes, dtype: int64

Alternatively, we could have answered this question by flattening the JSON data to the episode level. Since "episodes" are nested underneath "seasons", we have to specify the path to the "episodes" variable in the JSON data.

In [21]:
df_episodes = pd.json_normalize(data_shows, ["seasons", "episodes"], meta="name", meta_prefix="show.")
df_episodes

Unnamed: 0,id,url,name,season,number,airdate,airtime,airstamp,runtime,summary,image.medium,image.original,image,show.name
0,10820,http://www.tvmaze.com/episodes/10820/girls-1x0...,Pilot,1,1.0,2012-04-15,22:30,2012-04-16T02:30:00+00:00,30,<p>In the premiere of this comedy about twenty...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,Girls
1,10821,http://www.tvmaze.com/episodes/10821/girls-1x0...,Vagina Panic,1,2.0,2012-04-22,22:30,2012-04-23T02:30:00+00:00,30,<p>An appointment at a women's clinic doesn't ...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,Girls
2,10822,http://www.tvmaze.com/episodes/10822/girls-1x0...,All Adventurous Women Do,1,3.0,2012-04-29,22:30,2012-04-30T02:30:00+00:00,30,<p>Hannah contacts her college boyfriend to fi...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,Girls
3,10823,http://www.tvmaze.com/episodes/10823/girls-1x0...,Hannah's Diary,1,4.0,2012-05-06,22:30,2012-05-07T02:30:00+00:00,30,<p>Adam's risqué text message sends Hannah ove...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,Girls
4,10824,http://www.tvmaze.com/episodes/10824/girls-1x0...,Hard Being Easy,1,5.0,2012-05-13,22:30,2012-05-14T02:30:00+00:00,30,<p>Hannah tries a different tack with her boss...,http://static.tvmaze.com/uploads/images/medium...,http://static.tvmaze.com/uploads/images/origin...,,Girls
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
736,47635,http://www.tvmaze.com/episodes/47635/gilmore-g...,Hay Bale Maze,7,18.0,2007-04-17,21:00,2007-04-18T01:00:00+00:00,60,<p>Brief encounter: Stars Hollow's centerpiece...,,,,Gilmore Girls
737,47636,http://www.tvmaze.com/episodes/47636/gilmore-g...,It's Just Like Riding a Bike,7,19.0,2007-04-24,21:00,2007-04-25T01:00:00+00:00,60,<p>Lorelai needs help shopping for a new car. ...,,,,Gilmore Girls
738,47637,http://www.tvmaze.com/episodes/47637/gilmore-g...,Lorelai? Lorelai?,7,20.0,2007-05-01,21:00,2007-05-02T01:00:00+00:00,60,"<p>On karaoke night, a tipsy Lorelai take the ...",,,,Gilmore Girls
739,47638,http://www.tvmaze.com/episodes/47638/gilmore-g...,Unto the Breach,7,21.0,2007-05-08,21:00,2007-05-09T01:00:00+00:00,60,<p>Pomp and circumstances. Rory's graduation b...,,,,Gilmore Girls


Since the observational unit is already an episode, we simply count the number of times each show appears in this `DataFrame`.

In [22]:
df_episodes["show.name"].value_counts()

show.name
The Powerpuff Girls    201
The Golden Girls       181
Gilmore Girls          153
Chicken Girls           76
Girls                   63
Good Girls              26
Bomb Girls              19
Derry Girls             12
Florida Girls           10
Name: count, dtype: int64

# RESTful Web Services

One way that organizations expose their data to the public is through RESTful web services. In a typical RESTful service, the user specifies the desired data in an HTTP request, and the server responds with the requested data. JSON is a common format for the response data.

For example, the JSON data that we have been analyzing in this lesson was retrieved from the [TVMaze API](http://www.tvmaze.com/api). Most APIs come with accompanying documentation explaining how to construct HTTP requests to fetch data. For example, to query the TVMaze API for TV shows related to the term "office", we would issue a request to the following URL: http://api.tvmaze.com/search/shows?q=office. Try visiting this URL in a browser; you should see a long string of JSON instead of a rendered webpage!

We can import this JSON into our Python session using the `requests` library, as above.

In [23]:
import requests
response = requests.get("http://api.tvmaze.com/search/shows?q=office")
data_office = response.json()

# Print the first 1000 characters.
str(data_office)[:1000]

"[{'score': 0.70251113, 'show': {'id': 526, 'url': 'https://www.tvmaze.com/shows/526/the-office', 'name': 'The Office', 'type': 'Scripted', 'language': 'English', 'genres': ['Comedy'], 'status': 'Ended', 'runtime': 30, 'averageRuntime': 30, 'premiered': '2005-03-24', 'ended': '2013-05-16', 'officialSite': 'http://www.nbc.com/the-office', 'schedule': {'time': '21:00', 'days': ['Thursday']}, 'rating': {'average': 8.4}, 'weight': 98, 'network': {'id': 1, 'name': 'NBC', 'country': {'name': 'United States', 'code': 'US', 'timezone': 'America/New_York'}, 'officialSite': 'https://www.nbc.com/'}, 'webChannel': None, 'dvdCountry': None, 'externals': {'tvrage': 6061, 'thetvdb': 73244, 'imdb': 'tt0386676'}, 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/271/678637.jpg', 'original': 'https://static.tvmaze.com/uploads/images/original_untouched/271/678637.jpg'}, 'summary': '<p>Steve Carell stars in <b>The Office</b>, a fresh and funny mockumentary-style glimpse into th

This JSON can then be processed using the techniques discussed above.

Although many RESTful APIs work similarly, there is no universal standard, so you will usually have to carefully read the documentation for the API that you want to use.

[You can find some REST APIs here](https://publicapi.dev/)

# Ethics Tidbit: Staggering Requests

Suppose we want information about the individual episodes of each show we found above.

In [24]:
df_office = pd.json_normalize(data_office)
df_office

Unnamed: 0,score,show.id,show.url,show.name,show.type,show.language,show.genres,show.status,show.runtime,show.averageRuntime,...,show._links.previousepisode.href,show.network,show.webChannel.id,show.webChannel.name,show.webChannel.country,show.webChannel.officialSite,show.webChannel.country.name,show.webChannel.country.code,show.webChannel.country.timezone,show.image
0,0.702511,526,https://www.tvmaze.com/shows/526/the-office,The Office,Scripted,English,[Comedy],Ended,30.0,30.0,...,https://api.tvmaze.com/episodes/711203,,,,,,,,,
1,0.70017,1292,https://www.tvmaze.com/shows/1292/the-office,The Office,Scripted,English,[Comedy],Ended,30.0,30.0,...,https://api.tvmaze.com/episodes/110286,,,,,,,,,
2,0.664103,57704,https://www.tvmaze.com/shows/57704/the-office,The Office,Scripted,Hindi,[Comedy],Ended,,25.0,...,https://api.tvmaze.com/episodes/2173965,,164.0,Disney+ Hotstar,,,,,,
3,0.657895,25637,https://www.tvmaze.com/shows/25637/radiant-office,Radiant Office,Scripted,Korean,"[Drama, Comedy, Romance]",Ended,65.0,65.0,...,https://api.tvmaze.com/episodes/1099572,,,,,,,,,
4,0.645963,44432,https://www.tvmaze.com/shows/44432/office-watch,Office Watch,Scripted,Korean,"[Drama, Comedy, Romance]",Ended,5.0,5.0,...,https://api.tvmaze.com/episodes/1734133,,122.0,V LIVE,,https://www.vlive.tv/home,"Korea, Republic of",KR,Asia/Seoul,
5,0.633565,57168,https://www.tvmaze.com/shows/57168/the-office,The Office,Scripted,English,[Comedy],Ended,30.0,30.0,...,https://api.tvmaze.com/episodes/2156555,,,,,,,,,
6,0.60736,28775,https://www.tvmaze.com/shows/28775/box-office,Box Office,Variety,English,[],Ended,30.0,30.0,...,https://api.tvmaze.com/episodes/1199441,,,,,,,,,
7,0.578757,44011,https://www.tvmaze.com/shows/44011/african-off...,African Office Worker,Animation,Japanese,"[Comedy, Anime]",Ended,,30.0,...,https://api.tvmaze.com/episodes/1729642,,139.0,Docomo Anime Store,,,Japan,JP,Asia/Tokyo,
8,0.570307,59025,https://www.tvmaze.com/shows/59025/death-office,Death Office,Scripted,Japanese,"[Drama, Anime, Fantasy]",Ended,30.0,30.0,...,https://api.tvmaze.com/episodes/2306191,,342.0,Paravi,,,Japan,JP,Asia/Tokyo,
9,0.565267,69044,https://www.tvmaze.com/shows/69044/the-office-...,The Office Australia,Scripted,English,[Comedy],In Development,,,...,,,3.0,Prime Video,,https://www.primevideo.com,,,,


From [the documentation](http://www.tvmaze.com/api#show-episode-list), we see that the episodes can be retrieved using the ID in the **show.id** column, by constructing a URL of the form http://api.tvmaze.com/shows/[ID]/episodes.

It is straightforward enough to write a loop that replaces [ID] in this URL by the actual ID of each show. However, a script can easily issue hundreds, even thousands, of queries per second, and we want to avoid spamming the server. In fact, most RESTful services have [rate limiting policies](http://www.tvmaze.com/api#rate-limiting), which means that they automatically block users if they receive too many requests from that user within a window of time. Many RESTful services also require that API keys be supplied with every request, allowing the website to block the API keys of abusers.

Out of respect for the host, who is often providing this service for free, we stagger our requests by inserting a time delay in our code. This can be done using `time.sleep()`, which will suspend execution of the script for the given number of seconds. We will add a half second delay (so that we make no more than 2 queries per second) between requests.

In [26]:
import time

episodes = []
for show_id in df_office["show.id"]:

    # get the episodes for the show from the REST API
    response = requests.get(f"http://api.tvmaze.com/shows/{show_id}/episodes")
    episodes.extend(response.json())

    # add a 0.5 second delay between each query
    time.sleep(0.5)

# Now we have a list of episodes in JSON format.
# We can convert this to a DataFrame of episodes using json_normalize.
pd.json_normalize(episodes)

Unnamed: 0,id,url,name,season,number,type,airdate,airtime,airstamp,runtime,summary,rating.average,image.medium,image.original,_links.self.href,_links.show.href,image
0,47640,https://www.tvmaze.com/episodes/47640/the-offi...,Pilot,1,1,regular,2005-03-24,21:30,2005-03-25T02:30:00+00:00,30,<p>A documentary crew arrives at Dundler Miffl...,7.6,https://static.tvmaze.com/uploads/images/mediu...,https://static.tvmaze.com/uploads/images/origi...,https://api.tvmaze.com/episodes/47640,https://api.tvmaze.com/shows/526,
1,47641,https://www.tvmaze.com/episodes/47641/the-offi...,Diversity Day,1,2,regular,2005-03-29,21:30,2005-03-30T02:30:00+00:00,30,<p>Corporate sends in a consultant after Micha...,8.1,https://static.tvmaze.com/uploads/images/mediu...,https://static.tvmaze.com/uploads/images/origi...,https://api.tvmaze.com/episodes/47641,https://api.tvmaze.com/shows/526,
2,47642,https://www.tvmaze.com/episodes/47642/the-offi...,Health Care,1,3,regular,2005-04-05,21:30,2005-04-06T01:30:00+00:00,30,<p>Dwight ends up in charge of picking a new h...,8.0,https://static.tvmaze.com/uploads/images/mediu...,https://static.tvmaze.com/uploads/images/origi...,https://api.tvmaze.com/episodes/47642,https://api.tvmaze.com/shows/526,
3,47643,https://www.tvmaze.com/episodes/47643/the-offi...,The Alliance,1,4,regular,2005-04-12,21:00,2005-04-13T01:00:00+00:00,30,"<p>With Dwight worried about downsizing, Jim a...",8.1,https://static.tvmaze.com/uploads/images/mediu...,https://static.tvmaze.com/uploads/images/origi...,https://api.tvmaze.com/episodes/47643,https://api.tvmaze.com/shows/526,
4,47644,https://www.tvmaze.com/episodes/47644/the-offi...,Basketball,1,5,regular,2005-04-19,21:00,2005-04-20T01:00:00+00:00,30,<p>Michael challenges the warehouse staff to a...,8.5,https://static.tvmaze.com/uploads/images/mediu...,https://static.tvmaze.com/uploads/images/origi...,https://api.tvmaze.com/episodes/47644,https://api.tvmaze.com/shows/526,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
306,2217715,https://www.tvmaze.com/episodes/2217715/death-...,Episode 6,1,6,regular,2019-11-21,23:45,2019-11-21T14:45:00+00:00,30,,,https://static.tvmaze.com/uploads/images/mediu...,https://static.tvmaze.com/uploads/images/origi...,https://api.tvmaze.com/episodes/2217715,https://api.tvmaze.com/shows/59025,
307,2217714,https://www.tvmaze.com/episodes/2217714/death-...,Episode 7,1,7,regular,2019-11-28,23:45,2019-11-28T14:45:00+00:00,30,,,https://static.tvmaze.com/uploads/images/mediu...,https://static.tvmaze.com/uploads/images/origi...,https://api.tvmaze.com/episodes/2217714,https://api.tvmaze.com/shows/59025,
308,2217716,https://www.tvmaze.com/episodes/2217716/death-...,Episode 8,1,8,regular,2019-12-05,23:45,2019-12-05T14:45:00+00:00,30,,,https://static.tvmaze.com/uploads/images/mediu...,https://static.tvmaze.com/uploads/images/origi...,https://api.tvmaze.com/episodes/2217716,https://api.tvmaze.com/shows/59025,
309,2217717,https://www.tvmaze.com/episodes/2217717/death-...,Episode 9,1,9,regular,2019-12-12,23:45,2019-12-12T14:45:00+00:00,30,,,https://static.tvmaze.com/uploads/images/mediu...,https://static.tvmaze.com/uploads/images/origi...,https://api.tvmaze.com/episodes/2217717,https://api.tvmaze.com/shows/59025,
