# Transforming Metadata 1: Designing a Metadata Application Profile

This notebook provides information about
how to design your metadata application profile (MAP),
which is the topic of Assignment 3. This notebook illustrates one element 
of the "transform" process in the larger ETL process.
Keep in mind that this kind of work to transform or restructure the metadata
is a basic step that you might encounter in different forms in many workflows
that involve moving resources from one place to another. This step may often be
called by other names, including data wrangling, data cleaning, munging the data,
data normalization, or other steps. Whatever you call it, this is a fundamental
process in many digital curation activities. One of the "value adds" that you
can bring as a digital curator (whatever your title) is in providing guidance,
expertise, and structure for planning and documenting data transformations.

## Learning objectives

After completing the assignment associated with this notebook, you should: 

- Be able to explain the concept of metadata extraction and transformation.
- Design and plan a structure for documenting metadata practices in a collection or repository, also known as a Metadata Application Profile (MAP). 
- Understand the role of the MAP in planning and creating ingest-ready (loadable) collection metadata that conforms to Dublin Core and other digital collection metadata standards, which can be used to load content into a collection platform. 

## Transform the Metadata: Drafting Your MAP

* **Transform the metadata.** As illustrated here, there are three substeps: develop the conceptual model for your transformation (expressed in a Metadata Application Profile and an implementation of the MAP in a crosswalk), test the implementation on a small subset, then run your transformation on the entire set.
  1. Draft a metadata crosswalk - this is an exploratory activity and you will need to take some time examining one or two sample responses from the previous step to identify the attributes that you want to extract (the goal is to identify the information that you want to import to your Omeka site collection, essentially we are going to recreate the collection), to see how to extract these from the JSON, and to write a test transformation in the next step. This is largely conceptual and, although it is sketched out in this notebook will not use python like the other steps here. That said, the next step does require this step. 

This notebok illustrates the "Draft a Metadata Crosswalk" in the **Transform** step of the above process.

In [None]:
# refer to the previous Jupyter notebooks to extract the data
# and save it locally

# Draft and Design a metadata crosswalk Plan

The following does not outline the entire design, but it should give you an idea
of how to proceed. The assignment requires you to identify at least 10 fields
that you want to import into your new site.
Most of these will be DublinCore terms, but you must also choose at least one field
from another scheme. I would suggest MODS (more of a bibliographic schema and allows for more granularity than DublinCore), since you can also import it into Omeka S (as you have already done in Asst 2b).

## Identify and Choose the Fields to Crosswalk

In the drafting process, you need to look closely at the metadata that you downloaded
in JSON files for each item in the extract process.

To do this, use your python skills to investigate one example.
It's possible there will be differences between the examples, but even so,
this is a good way to start. 
As a reminder, you will find most of the information you're looking
for in the `item` element of the item JSON files:

In [1]:
import json
from os.path import join

In [None]:
# read in a sample file
with open(join('..','collection-site-materials','item-metadata','item_metadata-cph.3b41963.json'), encoding='utf-8') as file:
    metadata = json.load(file)

# check if it's there
print(json.dumps(metadata, indent=2)[:100])

{
  "_version_": 1731714874606616576,
  "access_restricted": false,
  "aka": [
    "https://www.loc.


In [3]:
# now take a look at the "item" key

for attribute in metadata['item'].items():
    print(attribute[0], ':\t', attribute[1])

call_number :	 SSF - Libraries--Georgia--Cordele <item> [P&P]
control_number :	 91787443
created :	 2016-04-21T09:17:00Z
created_published :	 [ca. 1916]
created_published_date :	 [ca. 1916]
date :	 [ca. 1916]
digital_id :	 ['cph 3b41963 //hdl.loc.gov/loc.pnp/cph.3b41963']
display_offsite :	 True
format :	 ['still image']
formats :	 [{'link': 'https://www.loc.gov/pictures/related/?fi=format&q=Photographic%20prints--1910-1920.&co=cph', 'title': 'Photographic prints--1910-1920.'}]
genre :	 ['Photographic prints--1910-1920']
id :	 91787443
link :	 https://www.loc.gov/pictures/item/91787443/
location :	 ['Georgia--Cordele']
marc :	 https://www.loc.gov/pictures/item/91787443/marc/
medium :	 ['1 photographic print.']
medium_brief :	 1 photographic print.
mediums :	 ['1 photographic print.']
modified :	 2016-04-21T09:17:00Z
notes :	 ['At bottom right of photo: "Cordele Book Co."', 'Wittemann Collection.']
number_former_id :	 ['https://www.loc.gov/item/91787443', 'https://www.loc.gov/item/11583

In [None]:
# for reusability, you may want to write this to a file

metadata_fields_file = join('..','collection-site-materials','metadata_fields.txt')

with open(metadata_fields_file, 'w') as f:
    f.write('attribute\tvalue\n')
    for attribute in metadata['item'].items():
        f.write(str(attribute[0]) + '\t' + str(attribute[1]) + '\n')

The above will create a tab-delimited file (aka `.tsv`, like a CSV).
You can view it in VSCode as a plain text file, or
you can open it in a spreadsheet application like Excel or Sheets.

Use that export to start your MAP list. Exploring the data a bit will
help you understand the data and develop a transformation plan. For example,
the cells below demonstrate how I looked into the date fields and decided what
information was best to keep and how to map it.
From looking at the previous list exported to the TXT file, I knew that fields with `date` and `created` in their field names were likely to have related information:

In [5]:
for attribute in metadata.keys():
    if 'created' in attribute:
        print(attribute)

created
created_published
created_published_date
source_created


In [6]:
for attribute in metadata.keys():
    if 'date' in attribute:
        print(attribute)

created_published_date
date
dates
sort_date


In [7]:
created = metadata['created']
date = metadata['date']
created_published_date = metadata['created_published_date']
source_created = metadata['source_created']
dates = metadata['dates']

print(created)
print(date)
print(created_published_date)
print(source_created)
print(dates)

2016-04-21T09:17:00Z
1916-01-01
[ca. 1916]
1991-08-22T00:00:00Z
[{'1916': 'https://www.loc.gov/search/?dates=1916/1916&fo=json'}]


It's clear that the 1991 and 2006 dates refer to some collection management action.
Look into that another time. The 1916 dates are of most interest.
So in this case, the `created_published_date` is most useful.
It also maps cleanly to DublinCore's [created](http://purl.org/dc/terms/created) term.
It's possible that not all of the items has this field, so I would also focus on the `date` field.

Thus, you can start building up your data structure:

In [8]:
item_data = {
    'date': metadata['date'],
    'created': metadata['created_published_date']
}

print(item_data)

{'date': '1916-01-01', 'created': '[ca. 1916]'}


## Documenting your decisions

For each field that you select to crosswalk into your new collection,
you will need to create a MAP entry for each metadata element you choose. 

For each element, you will make two MAP entries: 

1. List the term in the MAP table (see example below, or look at the samples linked in the assignment).
2. Create a row in the DCTAP profile.

For the most part, these are the same information. The first is intended for human readers,
while the second is intended for "machine" readability.

Taking as an example the date fields noted above, a sample MAP entry of the first type might look something like this:

| Element Name | date |
| --- | ------ |
| Label in My New Collection | Date of Creation |
| Mapping for My New Collection | dcterms:date |
| Description | This is a date extracted from the LOC's original metadata, which indicates when the original resource was scanned or digitized |
| Required? | No. Optional, but unless no date was provided in original item, this is strongly encouraged |
| Repeatable? | No |
| Entry Rules | Use ISO-8601 Date formatting * format should be YYYY, YYYY-MM, or YYYY-MM-DD |
| Data Type | literal (a plain string value with ISO-8601 formatting) |
| Example Entry | 1916 |
| Source (LOC) Attribute Name | date |
| DC Mapping | dcterms:date |


A DCTAP row for this might look as follows:

```csv
shapeID,shapeLabel,propertyID,propertyLabel,mandatory,repeatable,valueNodeType,valueDataType,valueConstraint,valueConstraintType,valueShape,note
       ,          ,dcterms:title,Title,TRUE,FALSE,literal,xsd:string,,,,
       ,           ,dcterms:date,Date,FALSE,FALSE,literal,xsd:date,,,,
```

## Developing your crosswalking implementation script

This is actually in preparation for Assignment 4 (your transformation script),
but in practice and concept these two steps are linked.

Simultaneously, start to make notes about how you will map the original fields
to the destinations for your new collection sites.
A good way to do this is to start a spreadsheet file, which you could base on the previously exported TSV file.
Or, you can [use a template like this one designed in Google Sheets for the course](https://docs.google.com/spreadsheets/d/1m2nq-PInOIN1GTRKRVGtKtc5qI0DTojw6waMe5hVzMc/edit?usp=sharing).

In looking through these files, you will not find the exact matches for each data element in the JSON,
but you should find clear parallels between the source data, DublinCore, and/or MODS.
Your goal should be to crosswalk as much as possible from the items
into your new collection presentation sites.

Here's a draft table to start the process:


| source field name | source field path/dict name | target | target namespace | notes |
| --- | --- | --- | --- | --- |
| title | item['title'] | dc:title | DCTerms | Title provided by the orginal metadata, could also be mapped to MODS:titleInfo:title or other fields in other namespaces | 
| date | item['date'] | dc:date | DCTerms | This is a 4-digit year, corresponds to date of creation in most cases |
| LC control number | item['item']['control_number'] | dc:identifier @type=lccn | DC Element with attribute | Corresponds to the Library of Congress Control Number |
| creator | item['creator'] | dc:creator | DCTerms | Should be a name. May be repeated. If possible, are various roles needed? Such as 'photographer', 'author', etc. |
| description | item['description'] \/ item['summary'] | mods:physicaldescription | MODS | In the source data, this seems most like physical description, although it might correspond to dc:format or dc:type. Content in the record may come from a controlled vocabulary, such as LC Genre & Form Thesaurus. |
| format, physical | item['type'] | mods:physicalDescription:form | MODS | Description of the original physical format of this item \(photograph, book, poster\). _Note:_ this may not be present or in the same place for the different types of objects in the collection |
| format | item['format'] | dc:format | DCTerms | The basic type of the digital surrogate \(e.g., 'image' or 'text'\) |
| notes (may be multiple) | item['notes'] (array) | dcterms:abstract | DC Terms | This appears to be closest to a "summary" or description of the content of the items. |
| subject_heading | | mods:subject | mods | |
| source_collection | | | | |
| rights | | | | |
| place | | | | |
| languages | | | | |
| mime_type | | | DCTerms | |

**Note:** this table does not include information about the digital assets,
it's only about the descriptive metadata.