# DataCite Dataset Metadata Analysis Notebook
*Created by Isaac Wink, University of Kentucky and available for reuse under a CC-BY license.*


### Before You Begin

In order to be able to edit this notebook and run the code, you will first need to make your own copy. To do so, click **File**, then select **Save a copy in Drive**. You will be able to edit and run code in your copied version of this file.

#### Functionality
This notebook is intended to aid librarians and other stakeholders in understanding where researchers from their institutions are sharing data and the completeness of accompanying metadata. It does so through an API call to DataCite, one of the major minters of DOIs for datasets.

Interested parties can also see an overview of datasets and other research outputs from their institution by searching for the institution on [DataCite Commons](https://commons.datacite.org/). However, bear in mind that DataCite Commons only lists datasets that include the institution's matching ROR ID in their metadata; as a result, datasets are frequently missed. By contrast, this notebook searches for the text of the institution name in the ```creators.affiliation.name``` fields and may therefore identify datasets that may be missed using the ROR ID alone.

In [None]:
#@title Enter Institution Name
#@markdown Enter the name of the institution whose datasets you would like to access. (Note that this field is case sensitive. "University of Kentucky" will return different results than "University Of Kentucky".)
institution_name = "University of Kentucky" #@param [] {allow-input: true}

# Convert institution name to fit the URL:
converted_institution_name = "%22" + "%20".join(institution_name.split()) + "%22*"

### Library Imports ###
import os
import shutil
import requests
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from google.colab import files

# Create directories for outputs:
if not os.path.exists("DOI_Metadata"):
    os.makedirs("DOI_Metadata/data")
    os.makedirs("DOI_Metadata/figures")

### Custom Functions ###
def has_metadata(dict, key):
    #Check for the presence of a given piece of metadata
    try:
        if dict[key] != "":
            return True
        else:
            return False
    except:
        return False

def has_complete_creator(creators):
    #Return True if at least one creator has a name and identifier
    complete = False
    for creator in creators:
        if len(creator["nameIdentifiers"]) != 0:
            if has_metadata(creator, "name") and has_metadata(creator["nameIdentifiers"][0], "nameIdentifier"):
                complete = True
                break
    return complete

def has_complete_affiliation(creators):
    #Return True if at least one creator's affiliation includes a name and identifier
    complete = False
    for creator in creators:
        for affiliate in creator["affiliation"]:
            if has_metadata(affiliate, "name") and has_metadata(affiliate, "affiliationIdentifier"):
                complete = True
                break
    return complete

def has_complete_funder(funding_ref):
    #Return True if at least one funder is listed with a name, funder identifier, and award number
    complete = False
    for funder in funding_ref:
        if has_metadata(funder, "funderName") and has_metadata(funder, "funderIdentifier") and has_metadata(funder, "awardNumber"):
            complete = True
            break
    return complete


# Build the API URL:
URL = "https://api.datacite.org/dois?query=creators.affiliation.name:" + converted_institution_name + "&affiliation=true&resource-type-id=Dataset"
print(f"Accessing metadata for datasets using {URL}.\n\nIf that URL leads to a blank page, you will need to adjust your institution name.\n")

# Pulling data from the first page that the API call returns:
start_response = requests.get(URL)
start_json_data = start_response.json()


# Pull relevant metadata from each DOI and create separate dataframes for dataset metadata and metadata completeness:
done = False
metadata_dict = {"title" : [], "localResearcher" : [], "other_localResearcher": [], "affiliation" : [], "resourceType" : [], "resourceTypeGeneral" : [], "doi" : [], "repository" : [], "publicationYear" : [], "rights" : [], "viewCount" : [], "downloadCount" : [], "referenceCount" : []}
completeness_dict = {"doi" : [], "publicationYear": [], "completeCreator" : [], "completeAffiliation" : [], "completeFunder" : [], "relatedIDs" : [], "repository" : []}
page_count = 1
response = start_response
json_data = start_json_data

while not done:
    for record in json_data["data"]:
        record_created = False
        addl_local_researcher = []
        for creator in record["attributes"]["creators"]:
            for affiliation in creator["affiliation"]:
                #For records with multiple creators from the target institution, only create one row, but list other creators from the same institution
                if not record_created:
                    if affiliation["name"] == institution_name:
                        try:
                            metadata_dict["title"].append(record["attributes"]["titles"][0]["title"])
                        except:
                            metadata_dict["title"].append(None)
                        try:
                            metadata_dict["localResearcher"].append(creator["name"])
                        except:
                            metadata_dict["localResearcher"].append(None)
                        try:
                            metadata_dict["affiliation"].append(affiliation["name"])
                        except:
                            metadata_dict["affiliation"].append(None)
                        try:
                            metadata_dict["resourceType"].append(record["attributes"]["types"]["resourceType"])
                        except:
                            metadata_dict["resourceType"].append(None)
                        try:
                            metadata_dict["resourceTypeGeneral"].append(record["attributes"]["types"]["resourceTypeGeneral"])
                        except:
                            metadata_dict["resourceTypeGeneral"].append(None)
                        try:
                            metadata_dict["doi"].append(record["id"])
                            completeness_dict["doi"].append(record["id"])
                        except:
                            metadata_dict["doi"].append(None)
                            completeness_dict["doi"].append(None)
                        try:
                            metadata_dict["repository"].append(record["attributes"]["publisher"])
                            completeness_dict["repository"].append(record["attributes"]["publisher"])
                        except:
                            metadata_dict["repository"].append(None)
                            completeness_dict["repository"].append(None)
                        try:
                            metadata_dict["publicationYear"].append(record["attributes"]["publicationYear"])
                            completeness_dict["publicationYear"].append(record["attributes"]["publicationYear"])
                        except:
                            metadata_dict["publicationYear"].append(None)
                            completeness_dict["publicationYear"].append(None)
                        try:
                            metadata_dict["rights"].append(record["attributes"]["rightsList"][0]["rights"])
                        except:
                            metadata_dict["rights"].append(None)
                        try:
                            metadata_dict["viewCount"].append(record["attributes"]["viewCount"])
                        except:
                            metadata_dict["viewCount"].append(None)
                        try:
                            metadata_dict["downloadCount"].append(record["attributes"]["downloadCount"])
                        except:
                            metadata_dict["downloadCount"].append(None)
                        try:
                            metadata_dict["referenceCount"].append(record["attributes"]["referenceCount"])
                        except:
                            metadata_dict["referenceCount"].append(None)
                        # Metadata completeness-specific attributes:
                        if has_complete_creator(record["attributes"]["creators"]):
                            completeness_dict["completeCreator"].append(True)
                        else:
                            completeness_dict["completeCreator"].append(False)
                        if has_complete_affiliation(record["attributes"]["creators"]):
                            completeness_dict["completeAffiliation"].append(True)
                        else:
                            completeness_dict["completeAffiliation"].append(False)
                        if has_complete_funder(record["attributes"]["fundingReferences"]):
                            completeness_dict["completeFunder"].append(True)
                        else:
                            completeness_dict["completeFunder"].append(False)
                        try:
                            completeness_dict["relatedIDs"].append(record["attributes"]["relatedIdentifiers"])
                        except:
                            completeness_dict["relatedIDs"].append(None)

                        record_created = True
                else:
                    if affiliation["name"] == institution_name:
                        try:
                            addl_local_researcher.append(creator["name"])
                        except:
                            pass
        metadata_dict["other_localResearcher"].append(addl_local_researcher)

    print(f'Done with page {page_count}')
    try:
        next_key = json_data["links"]["next"]
        response = requests.get(next_key)
        json_data = response.json()
        page_count += 1
    except:
        print("Done accessing DataCite.")
        done = True
    if page_count >= 50:
        print(f"Forced stoppage after {page_count} pages. Done accessing DataCite.")
        break

metadata_df = pd.DataFrame(metadata_dict)
completeness_df = pd.DataFrame(completeness_dict)

print("\nMatching datasets have been found in the following repositories:\n")
for repository in metadata_df["repository"].unique():
    print(repository)
print("\nIf any of these look like duplicates, use the following boxes to replace them with a common name.")

In [None]:
#@title Data Cleaning
#@markdown <h3>Adjust Repository Names</h3>
#@markdown Referencing the list of repositories above, if your dataset contains the same repository listed under multiple names, enter them in the box below in the following format: {"Name to Keep" : ["Name to Replace 1", "Name to Replace 2"]}
names_to_replace = {"ICPSR": ["ICPSR - Interuniversity Consortium for Political and Social Research"], "UKnowledge" : ["University of Kentucky", "University of Kentucky Libraries"]} #@param {type:"raw"}

#@markdown <h3>Set Year Range</h3>
#@markdown Limit the dataset to only include entries made within a certain time range (inclusive).
bottom_limit = 2014 # @param {type:"integer"}
top_limit = 2023 # @param {type:"integer"}

#@markdown <h3>Drop Infrequently Used Repositories</h3>
#@markdown To produce cleaner visualizations, you may wish to remove infrequently-used repositories.
#@markdown Enter a threshold value in the box below. Any repositories that contain fewer datasets from your chosen institution than that threshold will be dropped before generating visualizations:
threshold_value = 10 # @param {type:"integer"}

# Replace repository names:
metadata_df_cleaned = metadata_df.copy()
completeness_df_cleaned = completeness_df.copy()

for key in names_to_replace.keys():
  metadata_df_cleaned["repository"] = metadata_df_cleaned["repository"].replace(names_to_replace[key], key)
  completeness_df_cleaned["repository"] = completeness_df_cleaned["repository"].replace(names_to_replace[key], key)

# Limit to selected year range:
keep = list(range(bottom_limit, top_limit + 1))
metadata_df_cleaned = metadata_df_cleaned[metadata_df_cleaned["publicationYear"].isin(keep)]
completeness_df_cleaned = completeness_df_cleaned[completeness_df_cleaned["publicationYear"].isin(keep)]

# Drop infrequently-used repositories:
repo_counts = metadata_df_cleaned["repository"].value_counts()
keep = repo_counts[repo_counts >= threshold_value].index
metadata_df_cleaned = metadata_df_cleaned[metadata_df_cleaned["repository"].isin(keep)]
completeness_df_cleaned = completeness_df_cleaned[completeness_df_cleaned["repository"].isin(keep)]
print(f"Removing entries in repositories used less than {threshold_value} times and outside the {bottom_limit}-{top_limit} range reduced the dataset from {metadata_df.shape[0]} rows to {metadata_df_cleaned.shape[0]} rows.")


In [None]:
#@title Repository Usage Visualizations
#@markdown Run this cell to produce some charts based on your cleaned dataset.

# Create subplots with 1 row and 2 columns
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot the raw number of datasets in each repository by year:
repo_count = metadata_df_cleaned.groupby(["publicationYear", "repository"]).size().unstack()
repo_count.plot(kind="line", marker="o", ax=axes[0])
axes[0].set_xlabel("Year")
axes[0].set_ylabel("Dataset Publications")
axes[0].grid(axis="y")
axes[0].set_title(f"Number of {institution_name}-Afilliated Dataset Deposits \nby Repository and Year")
axes[0].legend(loc="best")

# Plot the number of researchers depositing datasets by repository and by year:
researcher_count = metadata_df_cleaned.groupby(["publicationYear", "repository"])['localResearcher'].nunique().unstack()
researcher_count.plot(kind="line", marker="o", ax=axes[1])
axes[1].set_xlabel("Year")
axes[1].set_ylabel("Count of Researchers")
axes[1].grid(axis="y")
axes[1].set_title(f"Number of {institution_name}-Afilliated Researchers \nDepositing Datasets by Repository and Year")
axes[1].legend(loc="best")

plt.tight_layout()
plt.savefig("DOI_Metadata/figures/RepositoryUsageStats.png")
plt.show()

In [None]:
#@title Metadata Completeness Visualizations
#@markdown Check for metadata completeness based on [OSTP recommendations](https://www.whitehouse.gov/wp-content/uploads/2022/08/08-2022-OSTP-Public-access-Memo.pdf).\
#@markdown Credit goes to Alicia Mohr's [RADS Metadata Analysis](https://ajhmohr.github.io/rads_metadata/#DataCite_Affiliation_data) for translation of DOI metadata to OSTP metadata categories.
#@markdown - DOI (`doi`)
#@markdown - Creators:
#@markdown    - Resource Author (`creators`)
#@markdown    - Resource Author Identifier (`nameIdentifier`)
#@markdown - Publication year (`publicationYear`)
#@markdown - Affiliation:
#@markdown     - Resource Author Affiliation (`affiliation`)
#@markdown     - ROR (`affiliationIdentifier` under `affiliation`)
#@markdown - Related Identifiers (`relatedIdentifiers`)
#@markdown - Funding References:
#@markdown    - Project Funder (`funderName` in `fundingReferences`)
#@markdown    - Funder Project Identifier (`funderIdentifier` `fundingReferences`)
#@markdown    - Funder Identifier (`awardNumber` in `fundingReferences`)\
#@markdown
#@markdown Note that these visualizations just check for the <b>presence</b> of these pieces of matadata, not their values:

# New dataframe grouping datasets by repository:
repos = completeness_df_cleaned["repository"].unique()
repo_completeness = {"repository": [], "Creator": [], "Affiliation": [], "Funder": []}
for repo in repos:
    repo_completeness["repository"].append(repo)
    repo_slice = completeness_df_cleaned[completeness_df_cleaned["repository"] == repo]

    creator_proportion = repo_slice[repo_slice["completeCreator"] == True].shape[0] / repo_slice.shape[0]
    repo_completeness["Creator"].append(creator_proportion)

    affiliation_proportion = repo_slice[repo_slice["completeAffiliation"] == True].shape[0] / repo_slice.shape[0]
    repo_completeness["Affiliation"].append(affiliation_proportion)

    funder_proportion = repo_slice[repo_slice["completeFunder"] == True].shape[0] / repo_slice.shape[0]
    repo_completeness["Funder"].append(funder_proportion)

repo_completeness_df = pd.DataFrame(repo_completeness)

# Set index to repository and sort
repo_completeness_df.set_index("repository", inplace=True)
repo_completeness_df.sort_values(by="Creator", ascending=False, inplace=True)

# Make plot:
ax = repo_completeness_df.plot(kind="bar", figsize=(10, 6))
plt.xlabel("Repository")
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
plt.ylabel("Proportion of Datasets with Complete Metadata")
plt.title(f"Proportion of {institution_name} Datasets with Complete Metadata by Repository")
#plt.legend(loc='best')
ax.legend(title="Completeness of:", loc="center left", bbox_to_anchor=(1, 0.5))

plt.savefig("DOI_Metadata/figures/MetadataCompleteness.png")
plt.show()

In [None]:
#@title Bonus: Dryad Dataset Usage Statistics
#@markdown The Dryad repository updates dataset DOIs with metadata about the number of times that datasets are viewed, downloaded, and cited. If your results included datasets in Dryad, run this cell for visualizations of these usage statistics.

dryad_metrics_df = metadata_df[metadata_df["repository"] == "Dryad"]
dryad_downloads = dryad_metrics_df["downloadCount"].sort_values(ascending=False)
dryad_views = dryad_metrics_df["viewCount"].sort_values(ascending=False)
dryad_references = dryad_metrics_df["referenceCount"].sort_values(ascending=False)

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5), sharex=True)

fig.suptitle(f"Metrics for {institution_name} Datasets in Dryad", fontsize=16)

dryad_views.plot(kind="bar", ax=ax1)
ax1.set_xticks([])
ax1.set_ylabel("View Count")
ax1.set_title(f"Views")

dryad_downloads.plot(kind="bar", ax=ax2)
ax2.set_xticks([])
ax2.set_ylabel("Download Count")
ax2.set_title(f"Downloads")

dryad_references.plot(kind="bar", ax=ax3)
ax3.set_xticks([])
ax3.set_ylabel("Reference Count")
ax3.set_yticks([0, 1, 2])
ax3.set_title(f"References")

plt.savefig("DOI_Metadata/figures/DryadUsageStats.png")
plt.show()

In [None]:
#@title Download Dataset and Visualizations
#@markdown Run this cell to download a zip file containing your raw and cleaned datasets along with the figures produced above.

metadata_df.to_csv("DOI_Metadata/data/DOI_Metadata_raw.csv", index=False)
metadata_df_cleaned.to_csv("DOI_Metadata/data/DOI_Metadata_cleaned.csv", index=False)
completeness_df_cleaned.to_csv("DOI_Metadata/data/Metadata_Completeness_cleaned.csv", index=False)
shutil.make_archive("DOI_Metadata", "zip", "DOI_Metadata")
files.download("DOI_Metadata.zip")

# Additonal Resources

RADS Metadata Analysis: https://ajhmohr.github.io/rads_metadata/

DataCite API Documentation: https://support.datacite.org/docs/api

DataCite Python API: https://datacite.readthedocs.io/en/latest/