# Initial takeaways from using the LC JSON API

I was trying out [Shawn Averkamp's code](https://github.com/saverkamp/loc-talk-2017/blob/master/womens-suffrage-collections-data/scripts/getLocMods.py) for collecting MODS records for items in LC digital collections and was having trouble getting results for a collection different from the one she worked with (National American Woman Suffrage Association). What was happening? 

In order to get a MODS record for each item in a collection, you need the item's identifier (e.g. 18012004) for constructing a URL for the MODS record. It appears that not all collections have an "item_id" field with the identifier. But let's explore this to see. 

To look at an example, here's some Python code to **request item IDs from a specific collection**. It retrieves two fields that seem to hold id info.

In [2]:
import requests

In [52]:
def get_item_ids(coll_name):
    call = requests.get("https://www.loc.gov/collections/{0}?fo=json".format(coll_name))
    data = call.json()
    results = data['results']
    for result in results:
        print(result.get("item_id"), result["id"])

First, let's see what is in those fields in the National American Woman Suffrage Association collection. 

In [53]:
collection = "national-american-woman-suffrage-association"
get_item_ids(collection)

18012004/ http://www.loc.gov/item/18012004/
33001926/ http://www.loc.gov/item/33001926/
23017479/ http://www.loc.gov/item/23017479/
15024465/ http://www.loc.gov/item/15024465/
08004839/ http://www.loc.gov/item/08004839/
08034439/ http://www.loc.gov/item/08034439/
ca26000179/ http://www.loc.gov/item/ca26000179/
28018616/ http://www.loc.gov/item/28018616/
28018623/ http://www.loc.gov/item/28018623/
28018620/ http://www.loc.gov/item/28018620/
tmp83029911/ http://www.loc.gov/item/tmp83029911/
37017721/ http://www.loc.gov/item/37017721/
24029000/ http://www.loc.gov/item/24029000/
15007485/ http://www.loc.gov/item/15007485/
29012783/ http://www.loc.gov/item/29012783/
33003563/ http://www.loc.gov/item/33003563/
15008748/ http://www.loc.gov/item/15008748/
27007548/ http://www.loc.gov/item/27007548/
86182880/ http://www.loc.gov/item/86182880/
09002749/ http://www.loc.gov/item/09002749/
09002748/ http://www.loc.gov/item/09002748/
09002744/ http://www.loc.gov/item/09002744/
04037023/ http://www.l

Now let's compare with another collection, Baseball Cards:

In [6]:
collection = "baseball-cards"
get_item_ids(collection)

None http://www.loc.gov/collections/baseball-cards/about-this-collection/
None http://www.loc.gov/item/2007678540/
None http://www.loc.gov/item/2007678541/
None http://www.loc.gov/item/2007678542/
None http://www.loc.gov/item/2007678545/
None http://www.loc.gov/item/2007678537/
None http://www.loc.gov/item/2007678538/
None http://www.loc.gov/item/2007677698/
None http://www.loc.gov/item/2007680699/
None http://www.loc.gov/item/2007677699/
None http://www.loc.gov/item/2007680750/
None http://www.loc.gov/item/2007680759/
None http://www.loc.gov/item/2007680760/
None http://www.loc.gov/item/2007680761/
None http://www.loc.gov/item/2007680762/
None http://www.loc.gov/item/2007680763/
None http://www.loc.gov/item/2007680764/
None http://www.loc.gov/item/2007680765/
None http://www.loc.gov/item/2007680766/
None http://www.loc.gov/item/2007680767/
None http://www.loc.gov/item/2007680768/
None http://www.loc.gov/item/2007680751/
None http://www.loc.gov/item/2007680769/
None http://www.loc.gov/

Looking at the results from those two collections, we can see that:
* The Womens Suffrage collection **has an "item id"** with the item number an "id" with the URL for the item
* The Baseball Cards collection **does not have an "item_id" field**, just the "id" field.  

### Getting a list of collection names
What if we wanted to check which identifier fields each collection uses? To check each collection, we first have to get a list of all of the collections. 

In [44]:
def get_collection_names(url, coll_list=[]): 
    call = requests.get(url)
    data = call.json()
    results = data["results"]   
    for result in results:
        coll = result.get("items")
        coll_list.append(coll)
    if data["pagination"]["next"] is not None: #make sure we haven't hit the end of the pages
        next_url = data["pagination"]["next"]
        get_collections(next_url, coll_list)
    
    return coll_list

url = "https://www.loc.gov/collections/?fo=json"
list_of_collections = get_collection_names(url)

How many collection URLs did we end up with?

In [33]:
len(list_of_collections)

307

Let's take a look at the list of collection URLs we ended up with:

In [36]:
list_of_collections

['https://www.loc.gov/collections/aaron-copland/',
 'https://www.loc.gov/collections/abdul-hamid-ii/',
 'https://www.loc.gov/collections/abraham-lincoln-papers/',
 'https://www.loc.gov/collections/afghanistan-web-archive/',
 'http://memory.loc.gov/ammem/aap/aaphome.html',
 'https://www.loc.gov/collections/african-american-photographs-1900-paris-exposition/',
 'https://www.loc.gov/collections/african-american-band-music/',
 'http://memory.loc.gov/ammem/afcphhtml/afcphhome.html',
 'https://www.loc.gov/collections/alan-lomax-manuscripts/',
 'https://www.loc.gov/collections/alan-lomax-in-michigan/',
 'https://www.loc.gov/collections/albert-schatz/',
 'https://www.loc.gov/collections/alexander-graham-bell-papers/',
 'https://www.loc.gov/collections/alexander-hamilton-papers/',
 'https://www.loc.gov/collections/alexander-hamilton-stephens-papers/',
 'https://www.loc.gov/collections/alfred-whital-stern-lincolniana/',
 'https://www.loc.gov/collections/amazing-grace/',
 'https://www.loc.gov/col

Wait, not all the collections URLs look the same! That's because (I think) not all of the collections are available via the loc.gov JSON API.  We need to limit our selection of collections to those that are actually available via the API. Collections with URLs beginning with the ones below are not available via the API:
* memory.loc.gov
* international.loc.gov
* lcweb2.loc.gov
* chroniclingamerica.loc.gov (Chronicling America has its own API, which is the best place to go for its data.)

Also, collections that lack "collections" in their path are not queryable. 
* www.loc.gov/vets
* www.loc.gov/jukebox


So let's only look at collections that have URLs that work for further API queries. 


In [74]:
def get_usable_collections(list_of_collections):
    usable_collections = []
    for collection in list_of_collections:
        if "www.loc.gov/collections/" in collection:
            usable_collections.append(collection)
    return usable_collections
            
usable_collections = get_usable_collections(list_of_collections)
usable_collections

['https://www.loc.gov/collections/aaron-copland/',
 'https://www.loc.gov/collections/abdul-hamid-ii/',
 'https://www.loc.gov/collections/abraham-lincoln-papers/',
 'https://www.loc.gov/collections/afghanistan-web-archive/',
 'https://www.loc.gov/collections/african-american-photographs-1900-paris-exposition/',
 'https://www.loc.gov/collections/african-american-band-music/',
 'https://www.loc.gov/collections/alan-lomax-manuscripts/',
 'https://www.loc.gov/collections/alan-lomax-in-michigan/',
 'https://www.loc.gov/collections/albert-schatz/',
 'https://www.loc.gov/collections/alexander-graham-bell-papers/',
 'https://www.loc.gov/collections/alexander-hamilton-papers/',
 'https://www.loc.gov/collections/alexander-hamilton-stephens-papers/',
 'https://www.loc.gov/collections/alfred-whital-stern-lincolniana/',
 'https://www.loc.gov/collections/amazing-grace/',
 'https://www.loc.gov/collections/america-at-work-and-leisure-1894-to-1915/',
 'https://www.loc.gov/collections/nineteenth-century-

How many collections are now in the list?

In [79]:
len(usable_collections)

290

Not sure if we'll need this but here's a new function that will get just the "slug" (the hyphenated name of the collection in the URL) for those collections that can be queried via the URL. 

In [71]:
def get_collection_slugs(url, coll_list=[]): 
    call = requests.get(url)
    data = call.json()
    results = data["results"]   
    for result in results:
        coll = result.get("items")
        if "www.loc.gov/collections/" in coll:
            slug = coll.split("/")[-2] # grab the collection slug at the end of the URL, before the trailing slash
            coll_list.append(slug)
    if data["pagination"]["next"] is not None: #make sure we haven't hit the end of the pages
        next_url = data["pagination"]["next"]
        get_collections(next_url, coll_list)
    
    return coll_list

url = "https://www.loc.gov/collections/?fo=json"
collection_slugs = get_collection_slugs(url)

collection_slugs

['aaron-copland',
 'abdul-hamid-ii',
 'abraham-lincoln-papers',
 'afghanistan-web-archive',
 'african-american-photographs-1900-paris-exposition',
 'african-american-band-music',
 'alan-lomax-manuscripts',
 'alan-lomax-in-michigan',
 'albert-schatz',
 'alexander-graham-bell-papers',
 'alexander-hamilton-papers',
 'alexander-hamilton-stephens-papers',
 'alfred-whital-stern-lincolniana',
 'amazing-grace',
 'america-at-work-and-leisure-1894-to-1915',
 'nineteenth-century-song-sheets',
 'dance-instruction-manuals-from-1490-to-1920',
 'american-choral-music',
 'american-colony-in-jerusalem',
 'american-english-dialect-recordings-from-the-center-for-applied-linguistics',
 'world-war-i-and-1920-election-recordings',
 'federal-writers-project',
 'travels-in-america-1750-to-1920',
 'american-revolutionary-war-maps',
 'andrew-jackson-papers',
 'anna-maria-brodeau-thornton-papers',
 'ansel-adams-manzanar',
 'architecture-design-and-engineering-drawings',
 'archive-of-hispanic-literature-on-tape',

## Back to exploring fields with ids

Let's get back to my original goal of understanding whether "id" or "item_id" is reliable for making item queries for specific collections. I'll check a record from each collection to see which ID fields are in use in that collection. 

In [76]:
def check_id_fields(collections):
    collection_ids = []
    for collection in collections: 
        url = collection + "?fo=json"
        call = requests.get(url)
        data = call.json()
        # the first result is always(?) the collection-level record, so look at the second record. 
        item = data["results"][1]
        # get the slug for saving for analysis
        slug = collection.split("/")[-2]
        # create a dictionary for each collection and its id fields
        entry = {"collection": slug, "item_id": item.get("item_id"), "id": item.get("id")}
        collection_ids.append(entry)
   
    return collection_ids

id_fields_per_coll = check_id_fields(usable_collections)

In [77]:
id_fields_per_coll

[{'collection': 'aaron-copland',
  'id': 'http://www.loc.gov/item/copland.sket0019/',
  'item_id': 'copland.sket0019/'},
 {'collection': 'abdul-hamid-ii',
  'id': 'http://www.loc.gov/item/2003673365/',
  'item_id': None},
 {'collection': 'abraham-lincoln-papers',
  'id': 'http://www.loc.gov/collections/abraham-lincoln-papers/about-this-collection/related-resources/',
  'item_id': None},
 {'collection': 'afghanistan-web-archive',
  'id': 'http://www.loc.gov/item/lcwaN0003131/',
  'item_id': None},
 {'collection': 'african-american-photographs-1900-paris-exposition',
  'id': 'http://www.loc.gov/item/98504044/',
  'item_id': None},
 {'collection': 'african-american-band-music',
  'id': 'http://www.loc.gov/item/ihas.100010758/',
  'item_id': None},
 {'collection': 'alan-lomax-manuscripts',
  'id': 'http://www.loc.gov/item/afc2004004.ms010101/',
  'item_id': 'afc2004004.ms010101/'},
 {'collection': 'alan-lomax-in-michigan',
  'id': 'http://www.loc.gov/item/afc1939007_afs02237b/',
  'item_id

Let's use pandas to take a quick look at how many times each field is used. 

In [61]:
import pandas as pd

In [78]:
collections_df = pd.DataFrame(id_fields_per_coll)
collections_df.count()

collection    290
id            290
item_id        76
dtype: int64

OK, that confirms that all collections have an id field (which is the URL of the item), so going forward, we should use the "id" field and not rely on "item_id". 

If we want to construct URLs that go to MODS records for a collection, then we need to grab the ID off the end of the URL in the "id" field. That can be done as follows:

## Takeaways: 
* If someone wants to use the API to get MODS records for items from a collection, they're going to need the identifiers of the items in a collection. Ideally:
  * All of the collections would be available via the API. I understand that's a work in progress. So, in the meantime,
  * Indicate whether a collection is available via the API (a separate field?) or don't include it at all.
  * Add item_id to all of the available collections
* Have a field with the slug for collection name (maybe)

