# In-Class Activity: Getting Data from APIs

Today we will learn how to:
* Use requests to access an API
* Parse the data we get from an API (in JSON)
* Retrieve historical web data from the Internet Archive
* Play around with the TV Maze API

A very big thank you to Brian Keegan (my advisor!) and the materials in his [Web Data Scraping course](https://github.com/CU-ITSS/Web-Data-Scraping-S2023). Check that out if you want to dig deeper ;) Another thank you to Jason Zeitz and Anas Buhayh at University of Colorado, Boulder, who developed some of the TV Maze API content.

But first... a warm-up!

### Refresher: Lists and Dictionaries in Python
Often we get data in the forms of lists. Lists are an **ordered** data structure that can contain integers, strings, or other objects (like lists or dictionaries), Here's an example:

In [3]:
# Make classrooms as lists with student names as strings
classroom0 = ['Alice','Bob','Carol','Dave']
classroom1 = ['Eve','Frank','Grace','Harold']
classroom2 = ['Isabel','Jack','Katy','Lloyd']
classroom3 = ['Maria','Nate','Olivia','Philip']
classroom4 = ['Quinn','Rachel','Steve','Terry','Ursula']
classroom5 = ['Violet','Walter','Xavier','Yves','Zoe']

# Make schools that contain classrooms
school0 = [classroom0,classroom1]
school1 = [classroom2,classroom3]
school2 = [classroom4,classroom5]

# Make a school district that contains schools
school_district = [school0,school1,school2]

In [6]:
school_district[0][0]

['Alice', 'Bob', 'Carol', 'Dave']

In [None]:
# Task 1: How would you access classroom0 **FROM** school_district, using index notation?

# YOUR CODE HERE

In [7]:
# Task 2: How would you access the 0th student in classroom3 (Maria) **FROM** school_district, using index notation?

school_district[1][1][0]

'Maria'

We also often get data in the form of dictionaries. Dictionaries are an **unordered** data structure containing key-value pairs, kind of like like a phonebook.

Here's a dictionary with information about the states in the Pacific Northwest:

In [9]:
pacific_northwest = {
    'Washington' : {
        'Abbreviation': 'WA',
        'Area': 71362,
        'Capital': 'Olympia',
        'Established': '1889-11-11',
        'Largest city': 'Seattle',
        'Population': 7887965,
        'Representatives': 10
    },
    'Idaho': {
        'Abbreviation': 'ID',
        'Area': 83569,
        'Capital': 'Boise',
        'Established': '1890-07-03',
        'Largest city': 'Boise',
        'Population': 1839106,
        'Representatives': 2
    },
    'Oregon': {
        'Abbreviation': 'OR',
        'Area': 98381,
        'Capital': 'Salem',
        'Established': '1859-02-14',
        'Largest city': 'Portland',
        'Population': 4246155,
        'Representatives': 6
    }
                   }

In [10]:
# Task 3: How would you list get of the keys in this dictionary?

pacific_northwest.keys()

dict_keys(['Washington', 'Idaho', 'Oregon'])

In [11]:
# Task 4: How would you access all of the values?

pacific_northwest.values()

dict_values([{'Abbreviation': 'WA', 'Area': 71362, 'Capital': 'Olympia', 'Established': '1889-11-11', 'Largest city': 'Seattle', 'Population': 7887965, 'Representatives': 10}, {'Abbreviation': 'ID', 'Area': 83569, 'Capital': 'Boise', 'Established': '1890-07-03', 'Largest city': 'Boise', 'Population': 1839106, 'Representatives': 2}, {'Abbreviation': 'OR', 'Area': 98381, 'Capital': 'Salem', 'Established': '1859-02-14', 'Largest city': 'Portland', 'Population': 4246155, 'Representatives': 6}])

In [12]:
# Task 5: How would you access *all* of the information about Washington?

pacific_northwest["Washington"]

{'Abbreviation': 'WA',
 'Area': 71362,
 'Capital': 'Olympia',
 'Established': '1889-11-11',
 'Largest city': 'Seattle',
 'Population': 7887965,
 'Representatives': 10}

In [14]:
# Task 6: How would you access the population of Oregon?

pacific_northwest["Oregon"]["Population"]

4246155

Nested data structures do not need to be the same data type. Here's the same information above, but as a list of dictionaries:

In [16]:
pacific_northwest_list = [
    {'Name': 'Washington',
     'Abbreviation': 'WA',
     'Area': 71362,
     'Capital': 'Olympia',
     'Established': '1889-11-11',
     'Largest city': 'Seattle',
     'Population': 7887965,
     'Representatives': 10
    },
    {'Name':'Idaho',
     'Abbreviation': 'ID',
     'Area': 83569,
     'Capital': 'Boise',
     'Established': '1890-07-03',
     'Largest city': 'Boise',
     'Population': 1839106,
     'Representatives': 2
    },
    {'Name': 'Oregon',
     'Abbreviation': 'OR',
     'Area': 98381,
     'Capital': 'Salem',
     'Established': '1859-02-14',
     'Largest city': 'Portland',
     'Population': 4246155,
     'Representatives': 6
    }
]

In [18]:
# Task 7: How would you access the capital of Idaho?

pacific_northwest_list[1]["Capital"]

'Boise'

In [19]:
# Task 8: How would you print out all of the state names and populations?

for state in pacific_northwest_list:
    name = state["Name"]
    pop = state["Population"]
    
    print(name + " : " + str(pop))

Washington : 7887965
Idaho : 1839106
Oregon : 4246155


Ok, now we are ready to access APIs, which often return data in the JSON format. [JavaScript Object Notation (JSON)](https://www.json.org/) is probably the most popular data markup language and is especially ubiquitous when retreiving data from the application programming interfaces (APIs) of popular platforms like Twitter, Reddit, Wikipedia, etc.

JSON is attractive for programmers using JavaScript and Python because it can represent a mix of different data types.

What you need to know is that JSON is very similar to the form of a Python dictionary, and it can contain other data structures (for example, lists).

_**As you start to work with APIs, remember to always put on your data detective hats and figure out what structure you are in and how to extract information from it!**_

### Getting historical web pages from the Wayback Machine API

Now we are ready to start using APIs! For our first example, we'll use the Wayback Machine, a service from the [Internet Archive](https://archive.org/), which is a database of historical webpages and media content.

For fun, let's look at a few:
* [CNN in June 2000](https://web.archive.org/web/20000815052826/http://www.cnn.com/)
* [Apple in April 1997](https://web.archive.org/web/19970404064444/http://www.apple.com:80/)
* [Whitman College in 2002](https://web.archive.org/web/20020124214454/http://www.whitman.edu/)


---
**Activity:**

Visit the Wayback Machine at [https://web.archive.org](https://web.archive.org/) and check out a historical version of a page that interests you. Share it with your partner.

---

Pretty fun, huh? Even better, we can access much of this information via an API! Here's some [info from the Wayback Machine about how their API works](https://archive.org/help/wayback_api.php). Let's try it out!

In [20]:
# First import the packages we need

# Lets us talk to other servers on the web
import requests

# APIs spit out data in JSON
import json

# Use BeautifulSoup to parse some HTML
from bs4 import BeautifulSoup

# Safetly quoting strings for URLs
from urllib.parse import unquote, quote

# Handling dates and times
from datetime import datetime

# DataFrames!
import pandas as pd
import numpy as np

# Data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb


The simplest API request we can make asks for the most recent snapshot of a webpage archived by the Wayback Machine. For example:

In [21]:
# API calls come in the form of URLs
wayback_url = 'http://archive.org/wayback/available?url=whitman.edu'

# We can use requests.get() to get the contents of that URL
wayback_response = requests.get(wayback_url)

# Finally, we render the response as JSON using .json()
wayback_response.json()


{'url': 'whitman.edu',
 'archived_snapshots': {'closest': {'status': '200',
   'available': True,
   'url': 'http://web.archive.org/web/20230307120534/https://www.whitman.edu/',
   'timestamp': '20230307120534'}}}

What do you notice about the response above? What information does it include? How is this information structured?'

---
__Fun note: APIs are just URLs!__ You don't need to write any code to check them out. Try pasting this URL, http://archive.org/wayback/available?url=whitman.edu, into your web browser. What happenes?

In [26]:
# Task: Extract the URL from this request (this is the location of the page)
# Save it as a variable called recent_whitman_url
recent_whitman_json = wayback_response.json()

recent_whitman_URL = recent_whitman_json['archived_snapshots']['closest']['url']
# YOUR CODE HERE

In [30]:
# Save the JSON
wayback_response_json = wayback_response.json()

# Within the JSON (which is structured like a dictionary), we want to access the URL
# It is actually nested within TWO dictionaries!
recent_whitman_url = wayback_response_json['archived_snapshots']['closest']['url']

In [21]:
# Show how to build it up bit by bit

In [34]:
# Task: Use requests.get() to grab the HTML from this latest page
whitman_HTML = requests.get(recent_whitman_url)

# Turn it into soup using BeautifulSoup
soup = BeautifulSoup(whitman_HTML.text)

# Find all the links on that page and print them out
links_text = [link.text for link in soup.find_all('a')]

for l in links_text:
    print(l)
# YOUR CODE HERE

# Use requests.get() to get the HTML and save it as a variable

# Turn that variable into soup using BeautifulSoup()

# Use find_all to get all of the links

# Loop through them and print out the text







Skip to main content









        Apply

        Alumni

        Diversity

        Library

        MyWhitman

        Families

        Make A Gift

        Bias Reporting

        Bookstore

        CARE Team

        Career Services

        Communications

        Employment Opportunities

        Giving

        Grievance Policy

        Newsroom

        Nondiscrimination Policy

        Right to Know

        Sexual Misconduct & Title IX

        Social Media 

        The Center for Writing and Speaking (COWS)

        Website Privacy Policy

        Welty Student Health Center





											A to Z Index

										






											Map

										






											Events Calendar

										






											Penrose Library

										






											myWhitman
													









												A to Z 

											






												Map 

											






												Events 

											






												Library 

											






												myWhitman 
											

In [27]:
# And now we can use requests.get() to grab the HTML
recent_whitman_response = requests.get(recent_whitman_url)

# Turn it into soup -- remember to use .text first!
recent_whitman_soup = BeautifulSoup(recent_whitman_response.text)

# And find all the links
links_text = [link.text for link in recent_whitman_soup.find_all('a')]

# And print them out
for l in links_text:
    print(l)


Skip to main content









        Apply

        Alumni

        Diversity

        Library

        MyWhitman

        Families

        Make A Gift

        Bias Reporting

        Bookstore

        CARE Team

        Career Services

        Communications

        Employment Opportunities

        Giving

        Grievance Policy

        Newsroom

        Nondiscrimination Policy

        Right to Know

        Sexual Misconduct & Title IX

        Social Media 

        The Center for Writing and Speaking (COWS)

        Website Privacy Policy

        Welty Student Health Center





											A to Z Index

										






											Map

										






											Events Calendar

										






											Penrose Library

										






											myWhitman
													









												A to Z 

											






												Map 

											






												Events 

											






												Library 

											






												myWhitman 
											

Ok, this is cool ... but it's much more fun to get HISTORICAL data. With the Wayback Machine API, we can also search for content around a given timestamp.

In [35]:
# Notice how we now have '&timestamp=20080201' in the URL
# What do you think this means?
wb_url = 'http://archive.org/wayback/available?url=whitman.edu&timestamp=20080201'

# Use requests.get() to get the response
wb_response = requests.get(wb_url)

# Render it as JSON
wb_response_json = wb_response.json()

# And examine
wb_response_json

# What do you notice?
# When was this page scraped by the Wayback Machine?

{'url': 'whitman.edu',
 'archived_snapshots': {'closest': {'status': '200',
   'available': True,
   'url': 'http://web.archive.org/web/20080517083534/http://whitman.edu/',
   'timestamp': '20080517083534'}},
 'timestamp': '20080201'}

In [40]:
# Task: Make an API request to find out when Facebook's privacy policy (http://www.facebook.com/policy.php)
# was archived in the Wayback Machine closest to January 1, 2008.

# First construct the API url
# base URL + query, which include Facebook URL + timestamp
url = 'http://archive.org/wayback/available?url=facebook.com/policy.php&timestamp=20080101'

# Then use requests.get() to get the response
response_JSON = requests.get(url).json()
# Turn it into JSON

# And extract the timestamp
response_JSON['archived_snapshots']['closest']['timestamp']
# What day was it archived?




'20080213201320'

In [33]:
# First construct the API url
url = 'http://archive.org/wayback/available?url=facebook.com/policy.php&timestamp=20080101'

# Then use requests.get() to get the response
response = requests.get(url)

# Turn it into JSON
response_json = response.json()

# And extract the timestamp
response_json['archived_snapshots']['closest']['timestamp']

'20080213201320'

What might you do with this? For exmaple, you could examine how Facebook (or any company's) privacy policies or terms of service changed over time. This would be your starting point -- then you could compile the text and do some natural language processing to analyze it!

A simple way to analyze how the privacy policies and terms of service have changed over time would be to see how the number of words has changed. Brian Keegan has an example of how to do this in his web scraping course -- I encourage you to [check it out!](https://github.com/CU-ITSS/Web-Data-Scraping-S2023/blob/main/Class%2004%20-%20Internet%20Archive%20and%20Wikipedia%20APIs/Class%2004%20-%20Scraping%20Internet%20Archive%20and%20Wikipedia.ipynb)

---

__Activity:__ Brainstorm with a partner for a few minutes about how you might use the Wayback Machine API to do a data science project.

### Using the TVMaze API

Ok, let's try out a different API, this time from [TVMaze](https://www.tvmaze.com/api). This is an API that has information about tons and tons of TV shows.

First let's make a basic request for information about a show.

In [41]:
# This is the basic URL -- we are going to build our query requests from this
base_url = "https://api.tvmaze.com"

# We can get information about specific shows by appending /show/ and then an ID number to the URL
# The code below requests the info for show 321
showInfo=requests.get(base_url +"/shows/321").json()
print(showInfo)

# What is the show name?

# What information is included?

# How is it structured?

{'id': 321, 'url': 'https://www.tvmaze.com/shows/321/arrested-development', 'name': 'Arrested Development', 'type': 'Scripted', 'language': 'English', 'genres': ['Comedy', 'Family'], 'status': 'Ended', 'runtime': None, 'averageRuntime': 30, 'premiered': '2003-11-02', 'ended': '2019-03-15', 'officialSite': 'https://www.netflix.com/title/70140358', 'schedule': {'time': '', 'days': []}, 'rating': {'average': 8.3}, 'weight': 99, 'network': None, 'webChannel': {'id': 1, 'name': 'Netflix', 'country': None, 'officialSite': 'https://www.netflix.com/'}, 'dvdCountry': None, 'externals': {'tvrage': 2649, 'thetvdb': 72173, 'imdb': 'tt0367279'}, 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/338/846049.jpg', 'original': 'https://static.tvmaze.com/uploads/images/original_untouched/338/846049.jpg'}, 'summary': '<p>After being passed over as partner at The Bluth Co., widower Michael resolves to quit the family business and move away to spend more quality time with his 13

In [44]:
# How would you print out the summary?

# YOUR CODE HERE
print(showInfo['summary'])

# How would you print out the average rating?

print(showInfo['rating']['average'])

<p>After being passed over as partner at The Bluth Co., widower Michael resolves to quit the family business and move away to spend more quality time with his 13-year-old son, George Michael. But when his father George Bluth Sr. is arrested for shifty accounting practices and the family assets are frozen, Michael is forced to stay in Orange County to help his wildly eccentric family pick up the pieces.</p>
8.3


In [37]:
# summary
print(showInfo['summary'])

<p>After being passed over as partner at The Bluth Co., widower Michael resolves to quit the family business and move away to spend more quality time with his 13-year-old son, George Michael. But when his father George Bluth Sr. is arrested for shifty accounting practices and the family assets are frozen, Michael is forced to stay in Orange County to help his wildly eccentric family pick up the pieces.</p>


In [39]:
# average rating
# note that we have a nested dictionary here!
print(showInfo['rating']['average'])

8.3


Ok, this is cool! But how do we know what shows are in the TVMaze database and what their IDs are?

For this, we can do a [show search](https://www.tvmaze.com/api#show-search).

In [45]:
# Search by a string:
showSearch = '/search/shows?q='
queryString = 'bachelor'

searchResults=requests.get(base_url + showSearch + queryString).json()
print(searchResults)

[{'score': 0.7007226, 'show': {'id': 914, 'url': 'https://www.tvmaze.com/shows/914/the-bachelor', 'name': 'The Bachelor', 'type': 'Reality', 'language': 'English', 'genres': ['Romance'], 'status': 'Running', 'runtime': 120, 'averageRuntime': 119, 'premiered': '2002-03-25', 'ended': None, 'officialSite': 'https://abc.com/shows/the-bachelor', 'schedule': {'time': '20:00', 'days': ['Monday']}, 'rating': {'average': 3.2}, 'weight': 98, 'network': {'id': 3, 'name': 'ABC', 'country': {'name': 'United States', 'code': 'US', 'timezone': 'America/New_York'}, 'officialSite': 'https://abc.com/'}, 'webChannel': None, 'dvdCountry': None, 'externals': {'tvrage': 5593, 'thetvdb': 70869, 'imdb': 'tt0313038'}, 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/442/1107419.jpg', 'original': 'https://static.tvmaze.com/uploads/images/original_untouched/442/1107419.jpg'}, 'summary': '<p><b>The Bachelor</b> is an American dating and relationship reality television series, revolvin

In [42]:
# YOUR TASK
# What is the format of the results?
# How many shows are in my results?
# Print out all the names of the shows in the results
# Print out all the IDs of the show results

# YOUR CODE HERE

In [46]:
# How many shows are in the results?
len(searchResults)

10

In [47]:
# Print out all the names
for item in searchResults:
    name = item['show']['name']
    showID = item['show']['id']
    print("Name: " + item['show']['name'] + ", ID: " + str(showID))

Name: The Bachelor, ID: 914
Name: Bachelor Pad, ID: 25345
Name: Bachelor Father, ID: 17434
Name: The Bachelor, ID: 35529
Name: Bachelor in Paradise, ID: 2401
Name: The Bachelor Live, ID: 10779
Name: The Bachelor Canada, ID: 9755
Name: The Bachelor Australia, ID: 3745
Name: Bachelor in Paradise, ID: 35580
Name: De Bachelor, ID: 33253


In [None]:
# You could now use the show IDs to get the show info!

In [50]:
# YOUR TASK: Search for a different show
# Based on the ID (or IDs) you get back, make an API request for the info TVMaze has about that show

# YOUR CODE HERE

# Search for a show and figure out its ID

queryString = 'marvel'

searchResults=requests.get(base_url + showSearch + queryString).json()
print(searchResults)

# Request info on that show based on the ID
showInfo=requests.get(base_url +"/shows/43519").json()

print(showInfo)

[{'score': 0.70088035, 'show': {'id': 43519, 'url': 'https://www.tvmaze.com/shows/43519/ms-marvel', 'name': 'Ms. Marvel', 'type': 'Scripted', 'language': 'English', 'genres': ['Comedy', 'Action', 'Science-Fiction'], 'status': 'Ended', 'runtime': None, 'averageRuntime': 46, 'premiered': '2022-06-08', 'ended': '2022-07-13', 'officialSite': 'https://disneyplus.com/series/ms-marvel/45BsikoMcOOo', 'schedule': {'time': '', 'days': ['Wednesday']}, 'rating': {'average': 6}, 'weight': 98, 'network': None, 'webChannel': {'id': 287, 'name': 'Disney+', 'country': None, 'officialSite': 'https://www.disneyplus.com/'}, 'dvdCountry': None, 'externals': {'tvrage': None, 'thetvdb': 368612, 'imdb': 'tt10857164'}, 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/405/1013952.jpg', 'original': 'https://static.tvmaze.com/uploads/images/original_untouched/405/1013952.jpg'}, 'summary': "<p><b>Ms. Marvel </b>introduces viewers to Kamala, a 16-year old Pakistani American from Jersey 

{'id': 43519, 'url': 'https://www.tvmaze.com/shows/43519/ms-marvel', 'name': 'Ms. Marvel', 'type': 'Scripted', 'language': 'English', 'genres': ['Comedy', 'Action', 'Science-Fiction'], 'status': 'Ended', 'runtime': None, 'averageRuntime': 46, 'premiered': '2022-06-08', 'ended': '2022-07-13', 'officialSite': 'https://disneyplus.com/series/ms-marvel/45BsikoMcOOo', 'schedule': {'time': '', 'days': ['Wednesday']}, 'rating': {'average': 6}, 'weight': 98, 'network': None, 'webChannel': {'id': 287, 'name': 'Disney+', 'country': None, 'officialSite': 'https://www.disneyplus.com/'}, 'dvdCountry': None, 'externals': {'tvrage': None, 'thetvdb': 368612, 'imdb': 'tt10857164'}, 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/405/1013952.jpg', 'original': 'https://static.tvmaze.com/uploads/images/original_untouched/405/1013952.jpg'}, 'summary': "<p><b>Ms. Marvel </b>introduces viewers to Kamala, a 16-year old Pakistani American from Jersey City.\xa0 An aspiring artist, a

These are just a few of the kinds of requests you can make using the TVMaze API.

What other things can you do?

**Acitivty:** Spend a few minutes lookat the the API's documentation. Try out a different kind of query. What did you find?

### Suggestions for working with APIs

1. Spend some time figuring out how they work...read the docs!
2. The docs often have example queries. Use these to your advantage!
3. Make some sample requests. Ask yourself: How is the data structured? Do I have any nested data structures?
4. Often, you need to request preliminary information (like the show IDs above) in order to get the information you reallky want (the show facts above).

Let's explore! Pick an API to explore. You can use any that you want, but here are some you might consider:
* [Pokemon API](https://pokeapi.co/)
* [Dog API](https://dog.ceo/dog-api/) (pictures of dogs)
* [SpaceX API](https://github.com/r-spacex/SpaceX-API/)
* [COVID19 API](https://covid19api.com/)
* [NASA APIs](https://api.nasa.gov/)
* [EPA's Air Quality Index API](https://aqs.epa.gov/aqsweb/documents/data_api.html)
* [Superhero API](https://superheroapi.com/?ref=apilist.fun)
* [Open Movie Database](https://www.omdbapi.com/)
* [New York Times](https://developer.nytimes.com/?ref=apilist.fun)
* [Spoonaculur Food API](https://spoonacular.com/food-api)
* [Open Library Books API](https://openlibrary.org/developers/api)


Then, answer the following questions:

1. Find the API's documentation. Spend some time reading about it -- what information does it have? How are queries structured? What kinds of different queries can you make?
2. Try to make some intersting queries. What can you do?
3. Think about how you might use this API to do a data science project. What questions might you be able to answer?
4. How would you go about storing the data that you are getting back from the API?

_Note: Some APIs require you to first request an API key. This is so you don't overload them with too many requests._