# Datasets with license plus terms of use metadata

## Purpose and goals of this notebook

This notebook explores the "Terms" metadata of datasets in Dataverse installations in order to learn more about the effects of the "multiple-license" feature making its way to the Dataverse software (https://github.com/IQSS/dataverse/pull/7920). When that update is applied to Dataverse installations, it will change how depositors enter "Terms" metadata (metadata about how the data should or must be used). If depositors choose one of several licenses that installation administrators can configure and display in a dropdown menu, the software hides the fields that are in the "Terms of Use" panel, such as the Confidentiality Declaration and Special Permissions fields. This design decision was made because those fields may conflict with a predefined license. If the installation allows it, depositors can choose "Custom Terms" from the new license dropdown. The fields in the "Terms of Use" panel appear and any text entered in those fields are considered "Custom Terms".

For installations that already have published datasets that have CC0 licenses (or, for some forked installations, CCBY licenses) plus something entered in any of the Terms of Use fields (e.g. such as Confidentiality Declaration and Special Permissions), when those installations apply the software update, the datasets' Terms will be considered  "Custom Terms."

For more information about the update, see the Multiple License Consensus Proposal at https://docs.google.com/document/d/10htygglMdlABYWqtcZpqd8sHOwIe6sLL_UJtTv8NEKw and the following GitHub issues and pull request:

- https://github.com/IQSS/dataverse/issues/7742
- https://github.com/IQSS/dataverse/issues/7440
- https://github.com/IQSS/dataverse/pull/7920

This notebook explores and answers the following questions to get a sense of the scale of change that the update will have on existing Dataverse installations:

- Which installations have these types of datasets, that is, datasets whose Terms metadata the updated Dataverse software would consider "Custom Terms" because they have CC0 waivers (or, for some forked installations, CCBY licenses) plus any of the eight "Terms of Use" fields filled?
- And how many of these datasets does each installation have?

We'll also see the number of datasets published by each installation and how many of those datasets have CC0 waivers (or CCBY licenses).

## Methods

The data in terms_metadata.tab was created by:
- Downloading all zip files in the dataset at https://doi.org/10.7910/DVN/DCDKZQ.
- Using a Python script to extract the JSON metadata files in each Zip file into a single directory. The JSON metadata files in the dataset at https://doi.org/10.7910/DVN/DCDKZQ were downloaded between August 4 and August 7, 2021. How those JSON files were downloaded from each installation is documented in the dataset's metadata.
- Parsing the terms metadata of every JSON metadata file in that directory into a single CSV file using the [parse_terms_metadata.py](https://github.com/jggautier/dataverse_scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_terms_metadata.py) script in the https://github.com/jggautier/dataverse_scripts repository.
- Parsing other basic metadata, namely the value of each JSON files' "publisher" key, which indicates the installation that published the metadata in each JSON file, into a single CSV file using the [parse_basic_metadata.py](https://github.com/jggautier/dataverse_scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_basic_metadata.py) script in the https://github.com/jggautier/dataverse_scripts repository.
- Joining both CSV files into a single CSV file that contains, the persistent URLs, dataset version IDs, installation names (publishers), and Terms metadata for every published dataset version in 56 of the known 73 Dataverse installations.

In [1]:
# Import Python packages
from functools import reduce
import numpy as np
import pandas as pd

In [2]:
# Import data as a dataframe
allVersions = pd.read_csv('terms_metadata.tab', sep='\t', na_filter = False)

In [3]:
allVersions.head()

Unnamed: 0,persistentUrl,datasetVersionId,publisher,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,conditions,disclaimer,termsOfAccess,dataAccessPlace,originalArchive,availabilityStatus,contactForAccess,sizeOfCollection,studyCompletion
0,https://doi.org/10.15454/GHAUFR,256773,Portail Data INRAE,NONE,CC BY 2.0,,,,,,,,,,,,,,
1,https://doi.org/10.15454/TLXRVW,52619,Portail Data INRAE,NONE,CC BY 2.0,,,,,,,,,,,,,,
2,https://doi.org/10.15454/CX70I3,170336,Portail Data INRAE,NONE,CC BY 2.0,,,,,,,,,,,,,,
3,https://doi.org/10.21410/7E4/4WG94W,1200,data.sciencespo,NONE,,,,,,,,,,,,,,,
4,https://doi.org/10.7910/DVN/5PRYPC,198147,Harvard Dataverse,CC0,CC0 Waiver,,,,,,,,,,,,,,


In [4]:
print('Number of total dataset versions: %s' % (len(allVersions)))
print('Number of unnique datasets: %s' % (len(allVersions.persistentUrl.unique())))

Number of total dataset versions: 382596
Number of unnique datasets: 155719


**About versioning and Terms metadata**

While some datasets contain versions where each version has its own terms metadata (such as a dataset where each version represents a wave of the same longitudinal research study and each wave has its own license), let's assume that for a large majority of datasets, the metadata of the latest versions, including the Terms metadata, have replaced the metadata of all previous versions.

For example, if a dataset has two published versions, the first version has no CC0 waiver, and second version has a CC0 waiver, the CC0 waiver is what applies to the dataset.

Since versioning in the Dataverse software is designed more as a way to record improvements to a dataset over time, as opposed to a way to publish disparate but connected datasets (e.g. waves in a longitudinal study), it seems safe to assume that this is how most depositors use versioning. Additionally, the Dataverse software favors the latest published version of each dataset, tending to expose (e.g. through indexing and metadata distributing) the metadata in the latest published version more than it exposes the metadata of previously published versions.

So let's get the metadata for only the latest version of each dataset.

In [5]:
# Get only metadata for the latest versions of each dataset
latestversion = (allVersions
                 .iloc[allVersions
                    .groupby('persistentUrl')['datasetVersionId']
                    .agg(pd.Series.idxmax)]
                 .sort_values(by=['publisher'], inplace=False, ascending=True)
                 .reset_index(drop=True, inplace=False))

In [6]:
# Replace any blank values with NaN, making it easier to count and sort later
latestversion = latestversion.replace(r'^\s*$', np.nan, regex=True)

# Dartmouth's repository exports JSON files with "Root" in the publisher key. Replace "Root" with "Dartmouth"
latestversion['publisher'] = latestversion['publisher'].replace(['Root'],'Dartmouth')


In [7]:
print('Number of unique datasets: %s' % (len(latestversion)))
print('Number of installationsL %s' % (len(latestversion.publisher.unique())))

Number of unique datasets: 155719
Number of installationsL 56


We'll be exploring the Terms metadata of 155,719 datasets in 56 Dataverse installations. Let's get a count of each dataset in each installation.

In [8]:
countlatestversionByInstallation = latestversion.value_counts(subset=['publisher']).to_frame('count of datasets')
countlatestversionByInstallation.head()

Unnamed: 0_level_0,count of datasets
publisher,Unnamed: 1_level_1
Portail Data INRAE,72010
Harvard Dataverse,44057
openforestdata.pl,10116
UNC Dataverse,4537
Scholars Portal Dataverse,4096


Now we want to know: How many datasets in each repository have a license (their license field is not null or NONE) plus one or more values in the Terms of Use fields?

In [9]:
# Retain only publisher, dataset PID, license and Terms of Use columns
latestversion_termsofuse = latestversion[[
    'publisher', 'persistentUrl', 'license', 'termsOfUse', 'confidentialityDeclaration', 'specialPermissions',
    'restrictions', 'citationRequirements', 'depositorRequirements', 'conditions', 'disclaimer'
]]

In [10]:
# List unique licences in all repository's license fields
latestversion_termsofuse.license.unique()

array(['NONE', 'CC0', 'CC BY', nan, 'CCBY'], dtype=object)

In [11]:
# Create dataframe containing only datasets with something in their license fields
latestversionCC = (latestversion_termsofuse
    .query('license == "CC0" or license == "CCBY" or license == "CC BY"')
    .reset_index(drop = True, inplace = False)
)

In [12]:
latestversionCC.head()

Unnamed: 0,publisher,persistentUrl,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,conditions,disclaimer
0,ADA Dataverse,http://dx.doi.org/10.4225/87/B5AJXD,CC0,CC0 Waiver,,,,,,,
1,ASU Library Research Data Repository,https://doi.org/10.48349/ASU/UKDZ2U,CC0,CC0 Waiver,,,,,,,
2,ASU Library Research Data Repository,https://doi.org/10.48349/ASU/QXXWGP,CC0,CC0 Waiver,,,,,,,
3,ASU Library Research Data Repository,https://doi.org/10.48349/ASU/TCJR5Z,CC0,CC0 Waiver,,,,,,,
4,ASU Library Research Data Repository,https://doi.org/10.48349/ASU/VPO9LD,CC0,CC0 Waiver,,,,,,,


In [13]:
print('There are %s datasets with something in their license fields.' % (len(latestversionCC)))

There are 40976 datasets with something in their license fields.


How many of those are in each Dataverse installation?

In [14]:
countlatestversionCCByInstallation = latestversionCC.value_counts(subset=['publisher']).to_frame('count of datasets with license')

countlatestversionCCByInstallation.head()

Unnamed: 0_level_0,count of datasets with license
publisher,Unnamed: 1_level_1
Harvard Dataverse,29323
RIN Dataverse,3936
Scholars Portal Dataverse,2329
DataverseNO,1060
DataverseNL,764


Among the datasets with a license, how many also have values in one or more Terms of Use fields? How many are in each of the 56 Dataverse installations?

In [15]:
# Create dataframe with only datasets with a license and one or more Terms of Use fields filled in
latestversionCCPlusCustomTerms = latestversionCC.query(
    'confidentialityDeclaration.notnull() or\
    specialPermissions.notnull() or\
    restrictions.notnull() or\
    citationRequirements.notnull() or\
    depositorRequirements.notnull() or\
    conditions.notnull() or\
    disclaimer.notnull()'
)

print('Number of datasets with a license plus "custom Terms of Use" metadata: %s' % (len(latestversionCCPlusCustomTerms)))

Number of datasets with a license plus "custom Terms of Use" metadata: 621


In [16]:
# Create a dataset showing count of these datasets in each installation
latestversionCCPlusCustomTermsByInstallation = (latestversionCCPlusCustomTerms
    .value_counts(subset=['publisher'])
    .to_frame('count of datasets with license and ToU'))

latestversionCCPlusCustomTermsByInstallation.head()

Unnamed: 0_level_0,count of datasets with license and ToU
publisher,Unnamed: 1_level_1
Harvard Dataverse,394
UNC Dataverse,131
Scholars Portal Dataverse,51
Portail Data INRAE,6
Peking University Open Research Data Platform,6


Finally, let's join all of the dataframes we've created so we can see the counts of datasets for each query for each installation.

In [17]:
dataframes = [countlatestversionByInstallation, countlatestversionCCByInstallation, latestversionCCPlusCustomTermsByInstallation]
allCountsByInstallation = reduce(lambda left, right: left.join(right, how='outer'), dataframes)

# Format dataframe to replace NA values with 0, cast values as integers and sort by greatest number of datasets with a license plus ToU
allCountsByInstallation = (allCountsByInstallation
                           .fillna(0)
                           .astype('int32')
                           .sort_values([
                                'count of datasets with license and ToU',
                                'count of datasets',
                                'count of datasets with license'], ascending=False)
                           )

In [30]:
allCountsByInstallation.head()

Unnamed: 0_level_0,count of datasets,count of datasets with license,count of datasets with license and ToU
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Harvard Dataverse,44057,29323,394
UNC Dataverse,4537,644,131
Scholars Portal Dataverse,4096,2329,51
Portail Data INRAE,72010,55,6
Peking University Open Research Data Platform,320,314,6


In [29]:
# Export to allCountsByInstallation dataframe to a CSV file
fileName = 'allCountsByInstallation.csv'
allCountsByInstallation.to_csv(fileName, index=True)