## Smithsonian OpenAccess Collection Data API

Let's use requests to scrape some data from an API endpoint. In this case, we can use the Smithsonian's [Open Access API](https://edan.si.edu/openaccess/apidocs/#api-_), which is a REST API that responds to HTTP requests. See the documentation at [https://edan.si.edu/openaccess/apidocs/#api-_footer](https://edan.si.edu/openaccess/apidocs/#api-_footer)

The documentation for requests can be found here: http://docs.python-requests.org/en/master/ 

The endpoint for the search query of the "content" API, which 
provides information for individual items is `https://api.si.edu/openaccess/api/v1.0/content/:id`.

To use the Smithsonian APIs, you will need an API key from the data.gov
API key generator. Register with [https://api.data.gov/signup/](https://api.data.gov/signup/) to get a key.

In [18]:
import requests

In [25]:
statsEndpoint = 'https://api.si.edu/openaccess/api/v1.0/stats'

In [3]:
API_Key = 'S26CqhCprwb819ULBJQG62Le5ySrxuCV5L3Ktiov'

The content API fetches metadata about objects in the Smithsonian's
collections using the ID or URL of the object.

For example, in this case to get information about an album in
the Folkways Records Collection, we will use the id `edanmdm:siris_arc_231998`.

To pass in the parameters, we can use a dictionary! Let's try using `params`

In [31]:
key = {
    'api_key': API_Key
}

First, let's try a basic call to the stats API, to see if things are working:

In [32]:
r = requests.get(statsEndpoint, params = key)

print('You requested:',r.url)
print('HTTP server response code:',r.status_code)
print('HTTP response headers',r.headers)

# notice that the headers method returns a dictionary, too? 
# We could ask what sort of content it's returning:

print('\nYour request has this content type:\n',r.headers['content-type'])

You requested: https://api.si.edu/openaccess/api/v1.0/stats?api_key=S26CqhCprwb819ULBJQG62Le5ySrxuCV5L3Ktiov
HTTP server response code: 200
HTTP response headers {'Date': 'Sat, 15 Jan 2022 00:46:49 GMT', 'Content-Type': 'application/json;charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'X-RateLimit-Limit': '1000', 'X-RateLimit-Remaining': '989', 'Access-Control-Allow-Origin': 'https://edan.si.edu', 'Age': '2', 'Via': 'https/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])', 'X-Cache': 'MISS', 'Strict-Transport-Security': 'max-age=31536000; preload', 'Content-Encoding': 'gzip'}

Your request has this content type:
 application/json;charset=utf-8


So the request has returned a json object! Access the response using the `.text` method. 

In [33]:
r.text[:500]

'{\n  "status": 200,\n  "responseCode": 1,\n  "response": {\n    "time": "2022-01",\n    "units": [\n      {\n        "total_objects": 4230673,\n        "metrics": {\n          "CC0_records": 4230673,\n          "CC0_records_with_CC0_media": 3065372\n        },\n        "unit": "NMNHBOTANY"\n      },\n      {\n        "total_objects": 1960541,\n        "metrics": {\n          "CC0_records": 1960541,\n          "CC0_records_with_CC0_media": 39645\n        },\n        "unit": "NMNHINV"\n      },\n      {\n        "total_'

In [32]:
type(r.text)

str

#### API Call question

We want to make a request to the Smithsonian API. Can you fill in the following & explain the missing elements? 

```
https://api.si.edu/openaccess/api/v1.0/content/:_____
```

What other items might you use after the `?`...

## Object information

Now, let's try using the "content" API to get information about individual objects:

In [38]:
contentEndpoint = 'https://api.si.edu/openaccess/api/v1.0/content/'

#object_id = 'edanmdm:nmah_852778' # Alexander Graham Bell's 1885 Mary Had a Little Lamb from Volta Labs
object_id = 'edanmdm:siris_arc_231998' # Smithsonian Folkways Music of Hungary

parameters = {
    'api_key' : API_Key
}

In [45]:
requestURL = contentEndpoint + object_id

r = requests.get(requestURL, params = parameters)

print('You requested:',r.url)
print('HTTP server response code:',r.status_code)
print('HTTP response headers',r.headers)

# notice that the headers method returns a dictionary, too? 
# We could ask what sort of content it's returning:

print('\nYour request has this content type:\n',r.headers['content-type'])

You requested: https://api.si.edu/openaccess/api/v1.0/content/edanmdm:siris_arc_231998?api_key=S26CqhCprwb819ULBJQG62Le5ySrxuCV5L3Ktiov
HTTP server response code: 200
HTTP response headers {'Date': 'Sat, 15 Jan 2022 00:56:33 GMT', 'Content-Type': 'application/json;charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'X-RateLimit-Limit': '1000', 'X-RateLimit-Remaining': '992', 'Access-Control-Allow-Origin': 'https://edan.si.edu', 'Age': '0', 'Via': 'https/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])', 'X-Cache': 'MISS', 'Strict-Transport-Security': 'max-age=31536000; preload', 'Content-Encoding': 'gzip'}

Your request has this content type:
 application/json;charset=utf-8


Take a look at the response information:

In [46]:
r.text[:500]

'{\n  "status": 200,\n  "responseCode": 1,\n  "response": {\n    "id": "edanmdm-siris_arc_231998",\n    "title": "Folk music of Hungary [sound recording] / recorded in Hungary under the supervision of Bela Bartok",\n    "unitCode": "CFCHFOLKLIFE",\n    "type": "edanmdm",\n    "url": "edanmdm:siris_arc_231998",\n    "content": {\n      "descriptiveNonRepeating": {\n        "record_ID": "siris_arc_231998",\n        "unit_code": "CFCHFOLKLIFE",\n        "title_sort": "FOLK MUSIC OF HUNGARY SOUND RECORDING RECORD'

Use the built-in `.json()` decoder in requests

In [64]:
object_json = r.json()

for element in object_json['response']:
    print(element)

id
title
unitCode
type
url
content
hash
docSignature
timestamp
lastTimeUpdated
version


In [67]:
object = object_json['response']

for k, v in object.items():
    print(k,':',v)

id : edanmdm-siris_arc_231998
title : Folk music of Hungary [sound recording] / recorded in Hungary under the supervision of Bela Bartok
unitCode : CFCHFOLKLIFE
type : edanmdm
url : edanmdm:siris_arc_231998
content : {'descriptiveNonRepeating': {'record_ID': 'siris_arc_231998', 'unit_code': 'CFCHFOLKLIFE', 'title_sort': 'FOLK MUSIC OF HUNGARY SOUND RECORDING RECORDED IN HUNGARY UNDER THE SUPERVISION OF BELA BARTOK', 'title': {'label': 'Title', 'content': 'Folk music of Hungary [sound recording] / recorded in Hungary under the supervision of Bela Bartok'}, 'metadata_usage': {'access': 'CC0'}, 'data_source': 'Ralph Rinzler Folklife Archives and Collections'}, 'indexedStructured': {'date': ['1950s'], 'object_type': ['Archival materials', 'Phonograph records'], 'culture': ['Hungarians'], 'name': [{'type': 'personal_main', 'content': 'Bartók, Béla'}, {'type': 'personal_main', 'content': 'Cowell, Henry'}, {'type': 'personal_main', 'content': 'Bartok, Peter'}, {'type': 'personal_main', 'conte

#### Resources

* [Real Python working with JSON data](https://realpython.com/python-json/)
* [Python json module documentation](https://docs.python.org/3/library/json.html)

### Parsing the Data from the API using json module

Now, we can get the response, let's save to a file. To do this, use the `json` module. 

In [68]:
import json

In [69]:
data = json.loads(r.text)

# what are the keys?
for element in data:
    print(element)

status
responseCode
response


In [71]:
for key, val in data['response'].items():
    print(key,':',val)

id : edanmdm-siris_arc_231998
title : Folk music of Hungary [sound recording] / recorded in Hungary under the supervision of Bela Bartok
unitCode : CFCHFOLKLIFE
type : edanmdm
url : edanmdm:siris_arc_231998
content : {'descriptiveNonRepeating': {'record_ID': 'siris_arc_231998', 'unit_code': 'CFCHFOLKLIFE', 'title_sort': 'FOLK MUSIC OF HUNGARY SOUND RECORDING RECORDED IN HUNGARY UNDER THE SUPERVISION OF BELA BARTOK', 'title': {'label': 'Title', 'content': 'Folk music of Hungary [sound recording] / recorded in Hungary under the supervision of Bela Bartok'}, 'metadata_usage': {'access': 'CC0'}, 'data_source': 'Ralph Rinzler Folklife Archives and Collections'}, 'indexedStructured': {'date': ['1950s'], 'object_type': ['Archival materials', 'Phonograph records'], 'culture': ['Hungarians'], 'name': [{'type': 'personal_main', 'content': 'Bartók, Béla'}, {'type': 'personal_main', 'content': 'Cowell, Henry'}, {'type': 'personal_main', 'content': 'Bartok, Peter'}, {'type': 'personal_main', 'conte

In [72]:
print(len(data['response']))

11


In [73]:
object_id = data['response']['id']

print(object_id)

'edanmdm-siris_arc_231998'

Compare to the online display.
See https://collections.si.edu/search/detail/edanmdm:siris_arc_231998

Is it possible to extract each result into its own file? 

In [74]:
# block testing an extaction of each result into a separate file

data = json.loads(r.text)

#grab the images into a list
objectInfo = data['response']
print(len(objectInfo))

11


In [13]:
## this is from Python 105a, TODO update
fname = 'kitten-result-'
format = '.json'
n = 0 

for item in kittensList:
    n = n + 1
    file = fname + str(n) + format
#    print(item)
    with open(file, 'w') as f:
        f.write(json.dumps(item))#, f, encoding='utf-8', sort_keys=True)
        print('wrote',file)
print('wrote',n,'files!')

wrote kitten-result-1.json
wrote kitten-result-2.json
wrote kitten-result-3.json
wrote kitten-result-4.json
wrote kitten-result-5.json
wrote kitten-result-6.json
wrote kitten-result-7.json
wrote kitten-result-8.json
wrote kitten-result-9.json
wrote kitten-result-10.json
wrote kitten-result-11.json
wrote kitten-result-12.json
wrote kitten-result-13.json
wrote kitten-result-14.json
wrote kitten-result-15.json
wrote kitten-result-16.json
wrote kitten-result-17.json
wrote kitten-result-18.json
wrote kitten-result-19.json
wrote kitten-result-20.json
wrote kitten-result-21.json
wrote kitten-result-22.json
wrote kitten-result-23.json
wrote kitten-result-24.json
wrote kitten-result-25.json
wrote 25 files!


How could we extract the image URLs?                       

In [78]:
for key in objectInfo['content']:
    print(key)

descriptiveNonRepeating
indexedStructured
freetext


In [85]:
for info in objectInfo['content']['indexedStructured']:
    print(info)

date
object_type
culture
name
topic
language
place
online_media_type


In [86]:
# doesn't seem to be a image url list ... 
