# Extract (Harvest) Metadata and Content

This notebook demonstrates the first step in the generalized workflow
to develop your reference digital collections. 
We will walk through each step in class, but you will need to adapt these demonstrations
so that they work for your own collections, so this work will also be self-guided.

## ETL Workflow

The overarching process here follows a generalized "Extract - Transform - Load" workflow.This is an abstract model for pulling data from one system, transporting, cleaning, and outputting to another system.
While often in reference to database work and data engineering, the goals here are the same for our digital collections:
**extract** the metadata from the Library of Congress, change (**transform**) it into structures that makes sense to our collection systems (CollectionBuilder and Omeka),
then ingest (**load**) that data and associated content into the systems.

## Learning objectives

After completing the course assignment, you should: 

* Have a conceptual and a practical understanding of how collection metadata is made available by a REST API.
* Be able to explain the concept of metadata extraction and transformation.
* Create a structure for documenting metadata practices in a collection or repository (a Metadata Application Profile) and implement that structure for transformations. 
* Use programming to work with data supplied by an API in JSON format, to manage and transform useful parts of that data into CSV format.
* Create ingest-ready collection metadata that conforms to Dublin Core and other digital collection metadata standards, which can be used to load content into another site (in this case, an Omeka S site). 

## Introduction

The main steps outlined in this notebook are as follows:

* **Extract the metadata.** This may be done in whatever way works for you. As illustrated here, there are two main steps that involve requesting JSON data from the Library of Congress: 
  1. Get collections list - using the requests library, make a request to the library of congress API to get the list of items in the "Free to Use" libraries collection. Write this to a local file (here called `collection_items_list.csv` and in the `data` directory). 


# Get collection list

In [1]:
import csv
import json
import requests

### Build the URL

In [2]:
endpoint = 'https://www.loc.gov/free-to-use'
parameters = {
    'fo' : 'json'
}

In [3]:
collection = 'gardens'

In [4]:
collection_list_response = requests.get(endpoint + '/' + collection, params=parameters)

In [5]:
collection_list_response.url

'https://www.loc.gov/free-to-use/gardens?fo=json'

### Examine the response

Look at the JSON response and find the data you want: the collection set list.

In [6]:
collection_json = collection_list_response.json()

Take a moment to look around in the JSON response. Where would you look for the data about the items in the collection of free to use library images? 

_Hint: At this point we're not really looking for the information about the images, but the pointers to them (such as headings, links, etc)._ 

In [7]:
# .keys() is a helpful function to see what the data elements are
collection_json.keys()

dict_keys(['breadcrumbs', 'content', 'content_is_post', 'description', 'disable_max_line_length', 'expert_resources', 'manifest', 'next', 'next_sibling', 'options', 'pages', 'portal', 'previous', 'previous_sibling', 'site_type', 'timestamp', 'title', 'type'])

Looking further into the dictionary, it seems that you can get a list of the items in the set by looking into `content`, then `set`, then the `items` element:

In [8]:
for k in collection_json['content']['set']['items']:
    print(k)

{'alt': 'Blooming cactus plant and several smaller flowering plants nearby.', 'image': '/static/portals/free-to-use/public-domain/gardens/gardens-1.jpg', 'link': '/resource/highsm.16140/', 'title': 'Flowers, including a blooming cactus, at the Desert Botanical Garden, Phoenix, Arizona. Photo by Carol M. Highsmith, around 2000.'}
{'alt': "Illustration of the fountain's center jet and tiered, color-lit basins. Chicago skyline in background.\n", 'image': '/static/portals/free-to-use/public-domain/gardens/gardens-2.jpg', 'link': '/resource/cph.3g05158/', 'title': "Buckingham Fountain on Chicago's lake front. Poster by John Buczak, 1939.\n"}
{'alt': 'Overview of a walled flower garden with coastal landscape in background.', 'image': '/static/portals/free-to-use/public-domain/gardens/gardens-3.jpg', 'link': '/resource/ppmsca.16216/', 'title': 'Gray Gardens, East Hampton, New York. Lantern slide by Frances B. Johnston and Mattie E. Hewitt, around 1914.'}
{'alt': 'Roses in several colors grow 

How many items are there in the set?

In [9]:
len(collection_json['content']['set']['items'])

50

 Now that you can find the list of items in the collection, note that each of these "items" has 3 elements: `image`, `link`, and `title`. 

In [10]:
collection_json['content']['set']['items'][0].keys()

dict_keys(['alt', 'image', 'link', 'title'])

In a more fully automated environment, you might want to make a function that can return and save the collection list, then reuse it in other code, but for this task, it is useful to save the information. So, extract these and save them locally to a CSV. 

In [11]:
# create a path for a CSV file, in this case to write to the collection-project directory
collection_set_list = 'collection_set_list_gardens.csv'
headers = ['alt', 'image','link','title']

with open(collection_set_list, 'w', encoding='utf-8', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=headers)
    writer.writeheader()
    for item in collection_json['content']['set']['items']:
        
        # clean up errant spaces in the title fields
        item['title'] = item['title'].rstrip()
        writer.writerow(item)
    print('wrote',collection_set_list)

wrote collection_set_list_gardens.csv


Now you have a re-usable collection list that you can work from.
This is useful for setting up loops in future, when you want to
perform batch operations for each thing in the collection.

The next step in this workflow development assignment is to harvest metadata for each of the individual items. Continue in the next notebook!