# Extracting and Transforming Metadata

This notebook provides information about
how to design your metadata application profile (MAP),
which is the topic of Assignment 3.

## Learning objectives

After completing the assignment associated with this notebook, you should: 

* Have a conceptual and a practical understanding of how collection metadata is made available by a REST API.
* Be able to explain the concept of metadata extraction and transformation.
* Create a structure for documenting metadata practices in a collection or repository (a Metadata Application Profile) and implement that structure for transformations. 
* Use programming to work with data supplied by an API in JSON format, to manage and transform useful parts of that data into CSV format.
* Create ingest-ready collection metadata that conforms to Dublin Core and other digital collection metadata standards, which can be used to load content into another site (in this case, an Omeka S site). 

## Introduction

The main steps outlined in this notebook are as follows:

* **Extract the metadata.** This may be done in whatever way works for you. As illustrated here, there are two main steps that involve requesting JSON data from the Library of Congress: 
  1. Get collections list - using the requests library, make a request to the library of congress API to get the list of items in the "Free to Use" libraries collection. Write this to a local file (here called `collection_items_list.csv` and in the `data` directory). 
  1. Get item metadata - using the list from the previous step, use that a source to query each item in the collection to get details about it. Save the JSON responses locally so we can extract information from them in the next steps. (In this example, you will have around 60 files, but a maximum of 62 as of September 2022. This number may vary when you run this code yourself since the website may have different response rates.)
* **Transform the metadata.** As illustrated here, there are three substeps: develop the conceptual model for your transformation (expressed in a Metadata Application Profile and an implementation of the MAP in a crosswalk), test the implementation on a small subset, then run your transformation on the entire set.
  1. Draft a metadata crosswalk - this is an exploratory activity and you will need to take some time examining one or two sample responses from the previous step to identify the attributes that you want to extract (the goal is to identify the information that you want to import to your Omeka site collection, essentially we are going to recreate the collection), to see how to extract these from the JSON, and to write a test transformation in the next step. This is largely conceptual and, although it is sketched out in this notebook will not use python like the other steps here. That said, the next step does require this step. 
  1. Develop your transformation script with a small subset of the metadata. In this case, one record.
  1. Transform the data you've gathered in JSON into a CSV file according to the metadata crosswalk you've developed. The goal in this step is to create a CSV that we can use to import items into your Omeka site (using the CSV Import module). Note that the code outlined here suggests how all of these data elements may be extracted and transformed, but it does not necessarily output all of the elements that you will need to complete your assignment. In other words, there is still work to do to complete this code, but you are welcome to adopt or reuse the code here.  
* **Load the metadata** into your target system, in this case Omeka which we are using as a display platform. This step is not described in this notebook, because it requires the use of the CSV developed here to be ingested to your Omeka site. Without the above steps, however, you wouldn't be able to directly display these items.

This notebok illustrates the "Draft a Metadata Crosswalk" in the **Transform** step of the above process.

In [None]:
# refer to the previous Jupyter notebooks to extract the data
# and save it locally

# Draft and Design a metadata crosswalk Plan

The following does not outline the entire design, but it should give you an idea
of how to proceed. The assignment requires you to identify at least 10 fields
that you want to import into your new site.
Most of these will be DublinCore terms, but you must also choose at least one field
from another scheme. I would suggest MODS (more of a bibliographic schema and allows for more granularity than DublinCore), since you can also import it into Omeka S (as you have already done in Asst 2b).

In the drafting process, you need to look closely at the metadata that you downloaded
in JSON files for each item in the extract process.
In looking through these files, you will not find the exact terms in all cases,
but you should find clear parallels between the data schemes.
Your goal should be to crosswalk as much as possible from the items
into your new collection presentation sites.

You can find MODS information for most (if not all) items in any of these sets. For example, looking at resource `highsm.20336`, note the last field in the item metadata is a URL to an `item` page: https://www.loc.gov/item/2012630017/. That item page links to MODS and DublinCore records.

Here's a draft table to start the process:


| source field name | source field path/dict name | target | target namespace | notes |
| --- | --- | --- | --- | --- |
| title | item['title'] | dc:title | DCTerms | Title provided by the orginal metadata, could also be mapped to MODS:titleInfo:title or other fields in other namespaces | 
| date | item['date'] | dc:date | DCTerms | This is a 4-digit year, corresponds to date of creation in most cases |
| LC call number | item['item']['call_number'] | dc:identifier | DCTerms | Alphanumeric string. A Library of Congress number, should record for source/provenance reasons. |
| LC control number | item['item']['control_number'] | dc:identifier @type=lccn | DC Element with attribute | Corresponds to the Library of Congress Control Number (can be checked at http://lccn.loc.gov/ |
| creator | item['creator'] | dc:creator | DCTerms | Should be a name. May be repeated. If possible, are various roles needed? Such as 'photographer', 'author', etc. |
| description | item['description'] / item['summary'] | mods:physicaldescription | MODS | In the source data, this seems most like physical description, although it might correspond to dc:format or dc:type. Content in the record may come from a controlled vocabulary, such as LC Genre & Form Thesaurus. |
| format, physical | item['type'] | mods:physicalDescription:form | MODS | Description of the original physical format of this item (photograph, book, poster). _Note:_ this may not be present or in the same place for the different types of objects in the collection |
| format | item['format'] | dc:format | DCTerms | The basic type of the digital surrogate (e.g., 'image' or 'text' | |
| notes (may be multiple) | item['notes'] (array) | dcterms:abstract | DC Terms | This appears to be closest to a "summary" or description of the content of the items. |
| subject_heading | | mods:subject | mods | |
| source_collection | | | | |
| rights | | | | |
| place | | | | |
| image (link to the full image) | | | | |
| languages | | | | |
| mime_type | | | DCTerms |
