# Workbook: Introduction to APIs and Accessing loc.gov

This workbook provides prompts for working with the loc.gov API. 
For reference, also see: 

* Laura Wrubel's [documentation of the API](https://libraryofcongress.github.io/data-exploration/)
* New Library [explanation of the API](https://www.loc.gov/apis/json-and-yaml/).
* Library's [Data Exploration notebooks series](https://github.com/LibraryOfCongress/data-exploration).

## Setup
More advanced work may require additional modules, but the 
basic actions of interacting with an API can be acomplished 
using the requests module. 
Related work may benefit from Python's JSON and CSV libraries, 
as well as the os, os.path, and glob libraries
to support working with local files.     

In [1]:
# write your import statements
import requests

## Making an API call

Use requests to scrape some data from an API endpoint. In this case, we can use the Library of Congress search function, which is a REST API that responds to HTTP requests.

The documentation for requests can be found here: http://docs.python-requests.org/en/master/

There are multiple loc endpoints, which access different collection aspects. 
Consult the documentation here: https://www.loc.gov/apis/json-and-yaml/requests/. 

The endpoint for the search query is http://www.loc.gov/search/

The response can be modified by supplying various parameters:
* To request the json format, use the `fo=json` parameter. 
* To provide a search query, use the `q` parameter, for example how would you search for images of kittens? 
* To specify images, use the `fa=online-format:image` paramter.

In [2]:
searchendpoint = 'http://www.loc.gov/search/'

parameters = {
    'fo' : 'json',
    'q'  : 'kittens',
    'fa' : 'online-format:image'
}

In [3]:
r = requests.get(searchendpoint, params=parameters)

In [4]:
r.url

'https://www.loc.gov/search/?fa=online-format%3Aimage&fo=json&q=kittens'

You can explore the headers by calling `r.headers`, which is a dictionary. You can see the items in the dictionary using the `keys()` function:

In [5]:
for key in r.headers.keys():
    print(key)

Date
Content-Type
Content-Length
Connection
X-Robots-Tag
X-Frame-Options
Access-Control-Allow-Origin
Referrer-Policy
Strict-Transport-Security
X-Content-Type-Options
ETag
Expires
Content-Security-Policy
X-Grace
X-Nearside-Cache
X-Nearside-Cache-Hits
Cache-Control
CF-Cache-Status
Age
Accept-Ranges
Vary
Server
CF-RAY


The response is a JSON object. You can use requests built-in JSON function to parse this:  

In [6]:
r.json().keys()

dict_keys(['breadcrumbs', 'expert_resources', 'facet_trail', 'facet_views', 'facets', 'form_facets', 'options', 'pagination', 'results', 'search', 'timestamp', 'views'])

### What is the pagination info?

In [7]:
r.json()['pagination'].keys()

dict_keys(['current', 'first', 'from', 'last', 'next', 'of', 'page_list', 'perpage', 'perpage_options', 'previous', 'results', 'to', 'total'])

In [8]:
r.json()['pagination']['total']

6802

In theory, you could use this as the definition of a range function and request the list of all 31,000+ kitten images. To do this, you could use the `range()` function.

### Where are the data about the items? 

Hint: what are the `results`?

In [9]:
len(r.json()['results'])

25

This makes sense since the pagination is set to 25 per page, so this is likely the items. In other words, the things we want to take a look at. 

In [10]:
r.json()['results'][0].keys()

dict_keys(['access_restricted', 'aka', 'campaigns', 'contributor', 'date', 'dates', 'description', 'digitized', 'extract_timestamp', 'group', 'hassegments', 'id', 'image_url', 'index', 'item', 'language', 'location', 'location_country', 'mime_type', 'number', 'number_former_id', 'number_lccn', 'number_source_modified', 'online_format', 'original_format', 'other_title', 'partof', 'related', 'reproductions', 'resources', 'shelf_id', 'site', 'subject', 'timestamp', 'title', 'type', 'unrestricted', 'url'])

### How can we find the URL to the images?

Start by isolating the item information. . . 

In [11]:
kittenList = r.json()['results']

print(len(kittenList))

25


To see what keys are in the result listings, you can use the `keys()` function on a single result. Here, use the index `0` to look at the first result:

In [12]:
for key in kittenList[0].keys():
    print(key)

access_restricted
aka
campaigns
contributor
date
dates
description
digitized
extract_timestamp
group
hassegments
id
image_url
index
item
language
location
location_country
mime_type
number
number_former_id
number_lccn
number_source_modified
online_format
original_format
other_title
partof
related
reproductions
resources
shelf_id
site
subject
timestamp
title
type
unrestricted
url


What is the `url` value?

In [13]:
for kitten in kittenList: 
    print(kitten['url'], kitten['url'].strip().split('/')[-2])

https://www.loc.gov/item/2016892679/ 2016892679
https://www.loc.gov/item/2017650796/ 2017650796
https://www.loc.gov/item/2013646722/ 2013646722
https://www.loc.gov/item/jukebox-668708/ jukebox-668708
https://www.loc.gov/item/2016780779/ 2016780779
https://www.loc.gov/item/2016770792/ 2016770792
https://www.loc.gov/item/2022653071/ 2022653071
https://www.loc.gov/item/2016796464/ 2016796464
https://www.loc.gov/item/2016816441/ 2016816441
https://www.loc.gov/item/2016817090/ 2016817090
https://www.loc.gov/item/20002503/ 20002503
https://www.loc.gov/item/2002697127/ 2002697127
https://www.loc.gov/item/2005681032/ 2005681032
https://www.loc.gov/item/2022652300/ 2022652300
https://www.loc.gov/item/2002697126/ 2002697126
https://www.loc.gov/item/2014717546/ 2014717546
https://www.loc.gov/item/2008660988/ 2008660988
https://www.loc.gov/item/jukebox-61618/ jukebox-61618
https://www.loc.gov/item/89708607/ 89708607
https://www.loc.gov/item/2002706499/ 2002706499
https://www.loc.gov/item/202265388

### Can you save the item metadata to a local file?

This will require the `json` module:

In [14]:
import json

In [15]:
for kitten in kittenList:
    r = requests.get(kitten['url'], params={'fo': 'json'})
    print(r.url)
    id = kitten['url'].split('/')[-2]
    with open(id + '.json', 'w', encoding = 'utf-8') as f:
        json.dump(r.json()['item'], f, indent=2)
        print('wrote',id)

https://www.loc.gov/item/2016892679/?fo=json
wrote 2016892679
https://www.loc.gov/item/2017650796/?fo=json
wrote 2017650796
https://www.loc.gov/item/2013646722/?fo=json
wrote 2013646722
https://www.loc.gov/item/jukebox-668708/?fo=json
wrote jukebox-668708
https://www.loc.gov/item/2016780779/?fo=json
wrote 2016780779
https://www.loc.gov/item/2016770792/?fo=json
wrote 2016770792
https://www.loc.gov/item/2022653071/?fo=json
wrote 2022653071
https://www.loc.gov/item/2016796464/?fo=json
wrote 2016796464
https://www.loc.gov/item/2016816441/?fo=json
wrote 2016816441
https://www.loc.gov/item/2016817090/?fo=json
wrote 2016817090
https://www.loc.gov/item/20002503/?fo=json
wrote 20002503
https://www.loc.gov/item/2002697127/?fo=json
wrote 2002697127
https://www.loc.gov/item/2005681032/?fo=json
wrote 2005681032
https://www.loc.gov/item/2022652300/?fo=json
wrote 2022652300
https://www.loc.gov/item/2002697126/?fo=json
wrote 2002697126
https://www.loc.gov/item/2014717546/?fo=json
wrote 2014717546
http