# Curation Script - Harvard College Obervatory Announcement Cards

**About**: This script contains the curation and deposit workflow for adding data from the Harvard College Observatory Announcement Cards to Harvard Dataverse. 

**Notes**: 
- Uses metadata.ipynb to build a metadata spreadsheet for each dataset and data file
- Be sure that `curation.py` is in the correct directory. 

#### 0.0 Set global variables and import libraries

In [1]:
#set globals
# set curation source path
g_module_path = '/Users/katherinemika/Desktop/curation/historic_datasets/hco'

# path to output file
g_dataverse_inventory_file = '/Users/katherinemika/Desktop/curation/historic_datasets/hco/hco_batch_metadata.csv'
# series names
g_series_names = []

# dataset inventories (keyed on series name)
g_series_inventories = {}

# dataset metadata (keyed on series name)
g_dataset_metadata = {}

# dataverse installation
g_dataverse_installation_url = 'https://demo.dataverse.org'

# dataverse API key
g_dataverse_api_key = 'e487101b-6f0e-47dc-8109-7a72a9a9b0ed'

# dataverse collection name
g_dataverse_collection = 'hco_card'

# dataverse inventory dataframe
g_dataverse_inventory_df = None

# dataset author
g_dataset_author = 'Mika, Katherine'

# dataset author affiliation
g_dataset_author_affiliation = 'Harvard Library'

# dataset contact information
g_dataset_contact = 'Mika, Katherine'
g_dataset_contact_email = 'katherine_mika@harvard.edu'

# full path to location of datafiles (e.g., ../data/trade_statistics)
g_datafiles_path = '/Users/katherinemika/Desktop/curation/historic_datasets/hco/data_files'
# dataverse dataset information (keyed on series name)
g_dataverse_dataset_info = {}

# datafile metadata (dataframe of datafile metadata, keyed on series name)
g_datafile_metadata = {}

# datafile description template
g_datafile_description_template_txt = 'File contains OCR text with data from: '
g_datafile_description_template_csv = 'File contains csv table with data from: '
g_datafile_description_template_xml = 'File contains xml tree with OCR bounding box data from: '
g_datafile_description_template_jpg = 'File contains jpg image of: '


# dataset batches (array of batches of series to create/upload)
g_dataset_batches = []

In [2]:
#import libraries
import sys
if g_module_path not in sys.path:
    sys.path.append(g_module_path)

import curate
import requests
import numpy as np
import pandas as pd
import pprint as pprint
import rich

from easyDataverse import Dataverse

#### 0.1 Build local functions

In [3]:
 #add files to datasets
def add_files_to_dataset(dataset, file_inventory, datafiles_path):
    for index, row in file_inventory.iterrows():
        dataset.add_file(
            local_path = datafiles_path + "/" + row['filename'],
            description = row['description'],
            categories = row['tags']
        )
    return dataset


## Curate Inventory
### 1.0. Prepare inventory data for curation
#### 1.1 Read `dataverse_inventory`

- Review example_spreadsheet.csv for proper table formatting
- Create `DataFrame` for later use
- Note: the `upload_datafiles()` function expects all files to be in a single, flat directory (i.e. not grouped by file type)

In [4]:
# read the dataverse inventory file
import chardet

with open(g_dataverse_inventory_file, 'rb') as f: 
    result = chardet.detect(f.read())
    encoding = result['encoding']
    print(f"detected encoding: {encoding}")
    
g_dataverse_inventory_df = pd.read_csv(g_dataverse_inventory_file,
                                       index_col=None,
                                       dtype={'card_date_month': str, 'card_date_day': str},
                                       encoding=encoding, 
                                       low_memory=False)

detected encoding: UTF-8-SIG


In [5]:
g_dataverse_inventory_df

Unnamed: 0,filename,file_type,card_number,card_date_year,card_date_month,card_date_day,contributor,observation,all_observations,series_name,url,volume_title,published,subjects,topic_class,permalink
0,HCOAnnouncement0001_0001.innodata.xml,xml,1,1926,3,12,Harlow Shapley,,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
1,HCOAnnouncement0001_0001.innodata.jpg,jpg,1,1926,3,12,Harlow Shapley,,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
2,HCOAnnouncement0001_0001.innodata.txt,txt,1,1926,3,12,Harlow Shapley,,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
3,HCOAnnouncement0001_0001_a.innodata.csv,csv,1,1926,3,12,Harlow Shapley,BLATHWAYT‚ÄôS COMET,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
4,HCOAnnouncement0001_0001_b.innodata.csv,csv,1,1926,3,12,Harlow Shapley,BLATHWAYT‚ÄôS COMET,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
5,HCOAnnouncement0001_0002.innodata.xml,xml,2,1926,3,18,Harlow Shapley,,ENSOR'S COMET; BLATHWAYT'S COMET,Announcement Card number: 2,http://tamkin2.eps.harvard.edu/IAUCs/HAC0002.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
6,HCOAnnouncement0001_0002.innodata.jpg,jpg,2,1926,3,18,Harlow Shapley,,ENSOR'S COMET; BLATHWAYT'S COMET,Announcement Card number: 2,http://tamkin2.eps.harvard.edu/IAUCs/HAC0002.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
7,HCOAnnouncement0001_0002.innodata.txt,txt,2,1926,3,18,Harlow Shapley,,ENSOR'S COMET; BLATHWAYT'S COMET,Announcement Card number: 2,http://tamkin2.eps.harvard.edu/IAUCs/HAC0002.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
8,HCOAnnouncement0001_0002_a.innodata.csv,csv,2,1926,3,18,Harlow Shapley,ENSOR'S COMET,ENSOR'S COMET; BLATHWAYT'S COMET,Announcement Card number: 2,http://tamkin2.eps.harvard.edu/IAUCs/HAC0002.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
9,HCOAnnouncement0001_0002_b.innodata.csv,csv,2,1926,3,18,Harlow Shapley,ENSOR'S COMET,ENSOR'S COMET; BLATHWAYT'S COMET,Announcement Card number: 2,http://tamkin2.eps.harvard.edu/IAUCs/HAC0002.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html


#### 1.2 Create dataset inventories
- Get the list of series names
- Crate  a `dict` of inventories keyed on series name

In [6]:
# get list of series in the full inventory
g_series_names = list(g_dataverse_inventory_df.series_name.unique())

# create series inventories
for name in g_series_names:
    # get series inventory
    g_series_inventories[name] = g_dataverse_inventory_df.loc[g_dataverse_inventory_df['series_name'] == name]

pprint.pprint(g_series_names)

['Announcement Card number: 1',
 'Announcement Card number: 2',
 'Announcement Card number: 3',
 'Announcement Card number: 4',
 'Announcement Card number: 5',
 'Announcement Card number: 6',
 'Announcement Card number: 7',
 'Announcement Card number: 8',
 'Announcement Card number: 9',
 'Announcement Card number: 10',
 'Announcement Card number: 11',
 'Announcement Card number: 12',
 'Announcement Card number: 13']


#### 1.3 Create dataset metadata
- Create `dict` of dataset metadata extracted from each series 

In [7]:
# for each series name, create dataset metadata
for series_name in g_series_names:
    # get series inventory
    series_inventory = g_series_inventories[series_name]
    md = curate.create_dataset_metadata(g_dataset_author, g_dataset_author_affiliation, 
                                        g_dataset_contact, g_dataset_contact_email,
                                        series_name, series_inventory)
    g_dataset_metadata[series_name] = md

pprint.pprint(g_dataset_metadata)

{'Announcement Card number: 1': {'astroFacility': ['Harvard Bureau of '
                                                   'Astronomical Telegrams'],
                                 'astroObject': ["BLATHWAYT'S COMET"],
                                 'astroType': 'Observation',
                                 'author': [{'authorAffiliation': 'Harvard '
                                                                  'Library',
                                             'authorName': 'Mika, Katherine'}],
                                 'contact': [{'datasetContactAffiliation': 'Harvard '
                                                                           'Library',
                                              'datasetContactEmail': 'katherine_mika@harvard.edu',
                                              'datasetContactName': 'Mika, '
                                                                    'Katherine'}],
                                 'creation_date': '19

#### 1.4 Create datafile metadata
- Create `dict` of `DataFrames` containing metadata about individual files

In [8]:
for series_name in g_series_names:
    # get dataset metadata for the series
    series_metadata = g_dataset_metadata[series_name]
    # get the series inventory
    series_inventory_df = g_series_inventories[series_name]
    # create datafile metadata
    g_datafile_metadata[series_name] = curate.create_datafile_metadata(series_inventory_df,
                                                                       g_datafile_description_template_csv,
                                                                       g_datafile_description_template_txt,
                                                                       g_datafile_description_template_xml)

In [9]:
g_datafile_metadata['Announcement Card number: 1']

Unnamed: 0,filename,file_type,description,mimetype,tags
0,HCOAnnouncement0001_0001.innodata.xml,xml,File contains xml tree with OCR bounding box d...,application/xml,[Data]
1,HCOAnnouncement0001_0001.innodata.jpg,jpg,File contains OCR text with data from: Announ...,image/jpeg,[Data]
2,HCOAnnouncement0001_0001.innodata.txt,txt,File contains OCR text with data from: Announ...,text/plain,[Data]
3,HCOAnnouncement0001_0001_a.innodata.csv,csv,File contains OCR text with data from: Announ...,text/csv,"[Data, BLATHWAYT‚ÄôS COMET]"
4,HCOAnnouncement0001_0001_b.innodata.csv,csv,File contains OCR text with data from: Announ...,text/csv,"[Data, BLATHWAYT‚ÄôS COMET]"


### 2.0 Initilize easyDataverse
- Connect to your server


In [10]:
dataverse = Dataverse(
    server_url = g_dataverse_installation_url,
    api_token = g_dataverse_api_key
)

Output()





In [11]:
#poking around
dataset = dataverse.create_dataset()

In [12]:
dataverse.list_metadatablocks(detailed=False)

geospatial
socialscience
astrophysics
biomedical
journal
customMRA
customGSD
customARCS
customPSRI
customPSI
customCHIA
customDigaai
citation
customSAEF
computationalworkflow


In [13]:
dataset.astrophysics.info()

In [14]:
# Singular fields
dataset.astrophysics.astroType = ["Observation"]

print(dataset)

ValidationError: 1 validation error for Astrophysics
astroType
  Object has no attribute 'astroType' [type=no_such_attribute, input_value=['Observation'], input_type=list]
    For further information visit https://errors.pydantic.dev/2.7/v/no_such_attribute

In [15]:
#single fields - with validation - having trouble with astroType....
from pydantic import ValidationError

try:
    dataset.astrophysics.astroType = ["Observation"]
except ValidationError as e:
    rich.print(e)

### 3.0 Create datasets
- Create datasets from metadata inventory
- Add datafiles to datasets

#### 3.1. Create datasets

In [16]:
#for each series, create a dataset and save its information

for series_name in g_series_names:
    #initiate datset for each series
    series_dataset = dataverse.create_dataset()
    #get the series metadata
    series_metadata = g_dataset_metadata[series_name]
    #create the dataset
    g_dataverse_dataset_info[series_name] = curate.create_dataset(series_dataset, series_metadata)

In [17]:
rich.print(g_dataverse_dataset_info['Announcement Card number: 1'].dataverse_dict())

#### 3.2 Add files to datasets 

In [18]:
for dataset_name, dataset in g_dataverse_dataset_info.items():
    file_inventory = g_datafile_metadata.get(dataset_name)
    add_files_to_dataset(dataset, file_inventory, g_datafiles_path)
    title = dataset.citation.title
    rich.print(f"Added {len(dataset.files)} files to {title}")

### 4.0 Upload datasets to the repository
- Upload datasets to create drafts
- Unlock datasets that are locked
- Publish final datasets!

In [19]:
#upload datasets
collection_pids = []
for dataset_name, dataset in g_dataverse_dataset_info.items():
    pid = dataset.upload(dataverse_name = g_dataverse_collection, n_parallel=4)
    collection_pids.append(pid)

Dataset with pid 'doi:10.70122/FK2/3YFXJF' created.




Output()









Output()





Dataset with pid 'doi:10.70122/FK2/K1KQBH' created.




Output()









Output()

ClientResponseError: 500, message='Internal Server Error', url=URL('https://demo.dataverse.org/api/datasets/:persistentId/replaceFiles?persistentId=doi:10.70122/FK2/K1KQBH')

In [15]:
#unlock locked datasets when needed
from pyDataverse.api import NativeApi

In [16]:
g_api = NativeApi(g_dataverse_installation_url, g_dataverse_api_key)

# print results
print('{}'.format(g_api))

Native API: https://demo.dataverse.org/api/v1


In [18]:
curate.unlock_datasets(g_api, g_dataverse_collection)

{'doi:10.70122/FK2/XCMDK2': {'status': True,
  'message': 'publish_dataset::Success - unlocked dataset: 200:doi:10.70122/FK2/XCMDK2'}}

## Testing section

### Metadata

In [28]:
# get the first series
first_series = g_series_names[0]
first_series_metadata = g_dataset_metadata[first_series]
first_series_inventory_df = g_series_inventories[first_series]

In [29]:
first_series_inventory_df

Unnamed: 0,filename,file_type,card_number,card_date_year,card_date_month,card_date_day,contributor,observation,all_observations,series_name,url,volume_title,published,subjects,topic_class,permalink
0,HCOAnnouncement0001_0001.innodata.xml,xml,1,1926,3,12,Harlow Shapley,,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
1,HCOAnnouncement0001_0001.innodata.jpg,jpg,1,1926,3,12,Harlow Shapley,,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
2,HCOAnnouncement0001_0001.innodata.txt,txt,1,1926,3,12,Harlow Shapley,,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
3,HCOAnnouncement0001_0001_a.innodata.csv,csv,1,1926,3,12,Harlow Shapley,BLATHWAYT‚ÄôS COMET,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html
4,HCOAnnouncement0001_0001_b.innodata.csv,csv,1,1926,3,12,Harlow Shapley,BLATHWAYT‚ÄôS COMET,BLATHWAYT'S COMET,Announcement Card number: 1,http://tamkin2.eps.harvard.edu/IAUCs/HAC0001.jpg,Harvard College Observatory Announcement Cards,Harvard College Observatory,Astronomy--Observations; Astronomy--Research; ...,History of Science; History of Astronomy,http://tamkin2.eps.harvard.edu/services/HACs.html


In [30]:
g_datafile_metadata[first_series] = curate.create_datafile_metadata(first_series_inventory_df,
                                                                    g_datafile_description_template_csv,
                                                                    g_datafile_description_template_txt,
                                                                    g_datafile_description_template_xml)

In [31]:
g_datafile_metadata[first_series]

Unnamed: 0,filename,file_type,description,mimetype,tags
0,HCOAnnouncement0001_0001.innodata.xml,xml,File contains xml tree with OCR bounding box d...,application/xml,[Data]
1,HCOAnnouncement0001_0001.innodata.jpg,jpg,File contains OCR text with data from: Announ...,image/jpeg,[Data]
2,HCOAnnouncement0001_0001.innodata.txt,txt,File contains OCR text with data from: Announ...,text/plain,[Data]
3,HCOAnnouncement0001_0001_a.innodata.csv,csv,File contains OCR text with data from: Announ...,text/csv,"[Data, BLATHWAYT‚ÄôS COMET]"
4,HCOAnnouncement0001_0001_b.innodata.csv,csv,File contains OCR text with data from: Announ...,text/csv,"[Data, BLATHWAYT‚ÄôS COMET]"


### Create datasets 

In [32]:
dataset = dataverse.create_dataset()
print(dataset)

metadatablocks: {}



In [18]:
dataset.astrophysics.info()

In [19]:
dataset.astrophysics.astro_object = ['blathwayts comet']

In [20]:
dataset.citation.info()

In [41]:
#create dataset function
def create_dataset(api, dataset_metadata):
    """
    Create a dataverse dataset using easyDataverse.
    Note that metadata fields are hardcoded to reflect dataset's requirements. 

    Parameters
    ----------
    api : easyDataverse initialized dataverse 
    dataset_metadata : dict
        Dictionary of dataset metadata values

    Return
    ------
    dict: 
        {status: bool, dataset_id: int, dataset_pid: str}

    """
    # validate parameters
    if ((not dataverse) or
        (not dataset_metadata)):
        return {
            'status':False, 
            'dataset_id':-1, 
            'dataset_pid':''
        }

    # create the easyDataverse dataset model
    ds = dataverse.create_dataset()
    # populate the dataset model with metadata values
    ds.citation.title = dataset_metadata.get('title')

    for authors in dataset_metadata.get('author'):
        ds.citation.add_author(name = authors['authorName'],
                              affiliation = authors['authorAffiliation'])

    for desc in dataset_metadata.get('description'):
        ds.citation.add_ds_description(value=desc['dsDescriptionValue'])
    
    for contact in dataset_metadata.get('contact'):
        ds.citation.add_dataset_contact(name = contact['datasetContactName'],
                                        email = contact['datasetContactEmail'])

    ds.citation.subject = dataset_metadata.get('subject')

    for keyword in dataset_metadata.get('keywords'):
        ds.citation.add_keyword(value = keyword['keywordValue'],
                                vocabulary = keyword['keywordVocabulary'],
                                vocabulary_uri = keyword['keywordVocabularyURI'])

    for topic in dataset_metadata.get('topic_classification'):
            ds.citation.add_topic_classification(value = topic['topicClassValue'])

    ds.citation.data_sources = dataset_metadata.get('data_source')
    ds.citation.distribution_date = dataset_metadata.get('creation_date')
    ds.astrophysics.astro_object = dataset_metadata.get('astroObject')
    ds.astrophysics.astro_facility = dataset_metadata.get('astroFacility')


    #dict = rich.print(ds.dataverse_dict())
    return ds


In [33]:
dataverse = Dataverse(
    server_url = g_dataverse_installation_url,
    api_token = g_dataverse_api_key
)

Output()





In [66]:
dataset = create_dataset(dataverse, first_series_metadata)

In [67]:
rich.print(dataset.dataverse_dict())

In [68]:
dataset.add_file(
    local_path = g_datafiles_path + "/HCOAnnouncement0001_0001.innodata.txt", # Path to the file on your system
    description = g_datafile_description_template_txt + "Announcement Card number: 1",
    categories = ["Data"]
)

In [69]:
dataset.add_file(
    local_path = g_datafiles_path + "/HCOAnnouncement0001_0001.innodata.xml", # Path to the file on your system
    description = g_datafile_description_template_xml + "Announcement Card number: 1",
    categories = ["Data"]
)

In [70]:
dataset.add_file(
    local_path = g_datafiles_path + "/HCOAnnouncement0001_0001_a.innodata.csv", # Path to the file on your system
    description = g_datafile_description_template_txt + "Announcement Card number: 1",
    categories = ["Data", "Blathwayt's Comet"]
)

In [71]:
dataset.add_file(
    local_path = g_datafiles_path + "/HCOAnnouncement0001_0001_b.innodata.csv", # Path to the file on your system
    description = g_datafile_description_template_txt + "Announcement Card number: 1",
    categories = ["Data", "Blathwayt's Comet"]
)

In [72]:
dataset.add_file(
    local_path = g_datafiles_path + "/HCOAnnouncement0001_0001.innodata.jpg", # Path to the file on your system
    description = g_datafile_description_template_jpg + "Announcement Card number: 1",
    categories = ["Data"]
)

In [73]:
#upload everything
pid = dataset.upload(dataverse_name= g_dataverse_collection, n_parallel=4)

Dataset with pid 'doi:10.70122/FK2/XCMDK2' created.




Output()









Output()





In [None]:
dataset.add_directory(
    dirpath= g_datafiles_path + "/data_files/test",
    ignores=[
        "^\..*",         # Ignore hidden files and dirs
    ]
)

In [163]:
import importlib
importlib.reload(curate)

<module 'curate' from '/Users/katherinemika/Desktop/curation/historic_datasets/hco/curate.py'>

In [24]:
#single fields - with validation - having trouble with astroType....
from pydantic import ValidationError

try:
    dataset.astrophysics.astroType = "Observation"
except ValidationError as e:
    rich.print(e)