# Datasets with license plus terms of use metadata

## Goals of this notebook

This notebook explores the "Terms" metadata of datasets in Dataverse installations, producing a snapshot of the state of the "Terms" metadata (metadata about how the data should or must be used) in 56 known installations as of August 2021. This snapshot may be used to learn more about the effects of the "multiple-license" update as it makes its way to the Dataverse software (https://github.com/IQSS/dataverse/pull/7920) and is applied to Dataverse installations.

The "multiple-license" update will change how depositors enter "Terms" metadata. A goal of this update is to encourage the use and machine-readable application of standard licenses to datasets in Dataverse installations. This should make it easier for other people and systems to determine how the data can and can't be used.

Additionally, with this update, if a dataset's Terms metadata includes values in any of the fields in the software's "Terms of Use" panel, such as Confidentiality Declaration and Special Permissions, the software will consider those terms as "Custom Terms," even when a CC0 waiver or, for some forked Dataverse installations, a CCBY license was also applied.

For more information about the update, see the Multiple License Consensus Proposal ([Google Doc](https://docs.google.com/document/d/10htygglMdlABYWqtcZpqd8sHOwIe6sLL_UJtTv8NEKw)) and the following GitHub issues and pull request:

- https://github.com/IQSS/dataverse/issues/7742
- https://github.com/IQSS/dataverse/issues/7440
- https://github.com/IQSS/dataverse/pull/7920

So the two questions this notebook seeks to answer are:
- How much of the dataset metadata published by each Dataverse repository includes any information about how the data can be used? This snapshot might be the first of efforts to track how Terms metadata changes over time and particularly after each repository applies this and similar updates, as a way to measure the success of these changes and justify them.
- When installations apply these changes, which ones have published datasets whose Terms metadata will be considered "Custom Terms" because they have CC0 waivers (or, for some forked installations, CCBY licenses) plus any of the eight "Terms of Use" fields filled? And how many of these datasets has each installation published? The community might use these numbers to get a sense of the scale of the change for each Dataverse installation, and the numbers might encourage each installation to look into the effects of this change.

## Methods

The Terms metadata published by 56 Dataverse installations (of the known 73) is recorded in the terms_metadata.tab file. That tabular file was created by:
- Downloading all zip files in the dataset at https://doi.org/10.7910/DVN/DCDKZQ. That dataset contains the JSON metadata files collected between August 4 and August 7, 2021 using a Python script. The methods for getting this metadata are described in the dataset's metadata.
- Using another Python script to extract the JSON metadata files in each Zip file into a directory.
- Parsing the terms metadata of every JSON metadata file in that directory into a single CSV file using the [parse_terms_metadata.py](https://github.com/jggautier/dataverse-scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_terms_metadata.py) script in the GitHub repository at https://github.com/jggautier/dataverse-scripts.
- Parsing the value of each JSON files' "publisher" key, which indicates the installation that published the metadata in each JSON file, into a single CSV file using the [parse_basic_metadata.py](https://github.com/jggautier/dataverse_scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_basic_metadata.py) script in the GitHub repository at https://github.com/jggautier/dataverse-scripts.
- Joining both CSV files into a single CSV file that contains the persistent URLs, dataset version IDs, installation names ("publishers"), and Terms metadata for every published dataset version, and converting that CSV file into a .tab file (so that it's easier to make this notebook accessible using the Dataverse software's [binder integration](https://guides.dataverse.org/en/5.7/admin/integrations.html?highlight=integrations#binder).

This notebook uses the pandas and numpy Python packages to filter, reshape, and provide simple analysis of the data in the terms_metadata.tab file in order to help answer the two questions.

## Exploration

In [1]:
# Import Python packages
from functools import reduce
import numpy as np
import pandas as pd

### Importing and preparing the data

In [2]:
# Import data as a dataframe
allVersions = pd.read_csv('terms_metadata.tab', sep='\t', na_filter = False, parse_dates=['datasetPublicationDate', 'versionCreateTime'])

In [4]:
allVersions.head(5)

Unnamed: 0,datasetVersionId,persistentUrl,persistent_id,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,...,availabilityStatus,contactForAccess,sizeOfCollection,studyCompletion,datasetPublicationDate,versionCreateTime,versionState,majorVersionNumber,minorVersionNumber,publisher
0,256773,https://doi.org/10.15454/GHAUFR,doi:10.15454/GHAUFR,NONE,CC BY 2.0,,,,,,...,,,,,2019-02-15,2019-09-18 22:00:00+00:00,RELEASED,3,0,Portail Data INRAE
1,52619,https://doi.org/10.15454/TLXRVW,doi:10.15454/TLXRVW,NONE,CC BY 2.0,,,,,,...,,,,,2019-09-11,2018-09-12 03:19:24+00:00,RELEASED,1,0,Portail Data INRAE
2,170336,https://doi.org/10.15454/CX70I3,doi:10.15454/CX70I3,NONE,CC BY 2.0,,,,,,...,,,,,2019-01-21,2019-01-21 10:54:41+00:00,RELEASED,2,0,Portail Data INRAE
3,1200,https://doi.org/10.21410/7E4/4WG94W,doi:10.21410/7E4/4WG94W,NONE,,,,,,,...,,,,,2020-05-05,2020-05-13 16:06:28+00:00,RELEASED,2,1,data.sciencespo
4,198147,https://doi.org/10.7910/DVN/5PRYPC,doi:10.7910/DVN/5PRYPC,CC0,CC0 Waiver,,,,,,...,,,,,2020-05-27,2020-06-17 23:49:50+00:00,RELEASED,4,0,Harvard Dataverse


In [4]:
print('Number of dataset versions: %s' % (len(allVersions)))
print('Number of unique datasets: %s' % (len(allVersions.persistentUrl.unique())))

Number of dataset versions: 382596
Number of unique datasets: 155719


**About dataset versioning and Terms metadata**

While we know from experience that some datasets published in Dataverse repositories contain versions where each version has its own terms metadata (such as a dataset where each version represents a wave of the same longitudinal research study and data re-users should consider each dataset version's license), let's assume that for a large majority of datasets, the Terms metadata of the latest versions have replaced the metadata of all previous versions.

For example, if a dataset has two published versions, the first version has no CC0 waiver, and the second version has a CC0 waiver, the CC0 waiver is what should apply to the dataset (and one could argue that previous versions should have been deaccessioned).

Since versioning in the Dataverse software is designed more as a way to record improvements to a dataset over time, as opposed to a way to publish disparate but connected datasets (such as waves in a longitudinal study), we'll assume that this is how most people use dataset versioning when publishing datasets in Dataverse installations. Additionally, the Dataverse software favors the latest published version of each dataset, tending to expose the metadata in the latest published version more than it exposes the metadata of previously published versions (such as through indexing and search result displays, metadata distributing, and how PIDs direct users to the latest published version).

So let's get and consider the Terms metadata for only the latest version of each dataset.

In [5]:
# Get only metadata for the latest versions of each dataset
latestversion = (allVersions
                 .iloc[allVersions
                    .groupby('persistentUrl')['datasetVersionId']
                    .agg(pd.Series.idxmax)]
                 .sort_values(by=['publisher'], inplace=False, ascending=True)
                 .reset_index(drop=True, inplace=False))

Let's prepare the rest of the data so it's easier to query

In [6]:
# Replace any blank values with NaN, making it easier to count and sort later
latestversion = latestversion.replace(r'^\s*$', np.nan, regex=True)

# Dartmouth's repository exports JSON files with "Root" in the "publisher" key. Replace "Root" with "Dartmouth"
latestversion['publisher'] = latestversion['publisher'].replace(['Root'],'Dartmouth')

There shouldn't be too many unique values entered in the license fields, so let's see what's there.

In [7]:
# Let's see what's been entered in the license fields
latestversion.license.unique()

array(['NONE', 'CC0', 'CC BY', nan, 'CCBY'], dtype=object)

From earlier experiences working with this metadata, I know that both the string "NONE" and null values (NaN) have been used to indicate that there's nothing in the license field. Let's replace all 'NONE' strings with NaN to make querying easier.

In [8]:
latestversion = latestversion.replace('NONE', np.nan)
latestversion.head(5)

Unnamed: 0,datasetVersionId,persistentUrl,persistent_id,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,...,availabilityStatus,contactForAccess,sizeOfCollection,studyCompletion,datasetPublicationDate,versionCreateTime,versionState,majorVersionNumber,minorVersionNumber,publisher
0,59,https://doi.org/10.25825/FK2/8YKSQV,,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,...,,,,,2019-11-26,2019-11-14 18:21:51+00:00,RELEASED,1,0,ACSS Dataverse
1,139,https://doi.org/10.25825/FK2/VXVPVP,,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,...,,,,,2019-10-17,2021-01-06 16:02:23+00:00,RELEASED,2,1,ACSS Dataverse
2,115,https://doi.org/10.25825/FK2/VNAJ1I,,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,...,,,,,2019-11-26,2019-11-18 21:23:56+00:00,RELEASED,1,0,ACSS Dataverse
3,30,https://doi.org/10.25825/FK2/VEO88P,,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,...,,,,,2019-11-26,2019-11-13 15:13:57+00:00,RELEASED,1,0,ACSS Dataverse
4,26,https://doi.org/10.25825/FK2/VEANNS,,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,...,,,,,2019-11-26,2019-11-12 20:34:20+00:00,RELEASED,1,0,ACSS Dataverse


Lastly, lets remove the datasetVersionId column and reorder the remaining columns so that the columns with basic info about each dataset are together and are followed by columns with the Terms metadata.

In [9]:
# Remove versionID column and reorder remaining columns
latestversion = latestversion[[
    'publisher', 'persistentUrl', 'majorVersionNumber', 'minorVersionNumber', 'versionCreateTime', 'license', 'termsOfUse', 'confidentialityDeclaration', 'specialPermissions',
    'restrictions', 'citationRequirements', 'depositorRequirements', 'conditions', 'disclaimer'
]]

In [10]:
print('Number of datasets: %s' % (len(latestversion)))
print('Number of installations: %s' % (len(latestversion.publisher.unique())))

Number of datasets: 155719
Number of installations: 56


### Question 1: How many datasets published by each Dataverse installation include any information about how the data can be used?

We'll be exploring the Terms metadata of 155,719 datasets in 56 Dataverse installations.

First let's get the counts of published datasets in each installation. Then we can compare those counts to the counts of published datasets that have any kind of Terms metadata.

In [7]:
countDatasetsByInstallation = latestversion.value_counts(subset=['publisher']).to_frame('count of datasets')
countDatasetsByInstallation.head(5)

Unnamed: 0_level_0,count of datasets
publisher,Unnamed: 1_level_1
Portail Data INRAE,72010
Harvard Dataverse,44057
openforestdata.pl,10116
UNC Dataverse,4537
Scholars Portal Dataverse,4096


How many of those datasets have metadata with any Terms metadata?

In [8]:
# Let's take a look at a few rows from the latestversion dataframe to see how we might query it
latestversion.head(5)

Unnamed: 0,publisher,persistentUrl,majorVersionNumber,minorVersionNumber,versionCreateTime,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,conditions,disclaimer
0,ACSS Dataverse,https://doi.org/10.25825/FK2/8YKSQV,1,0,2019-11-14 18:21:51+00:00,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,
1,ACSS Dataverse,https://doi.org/10.25825/FK2/VXVPVP,2,1,2021-01-06 16:02:23+00:00,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,
2,ACSS Dataverse,https://doi.org/10.25825/FK2/VNAJ1I,1,0,2019-11-18 21:23:56+00:00,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,
3,ACSS Dataverse,https://doi.org/10.25825/FK2/VEO88P,1,0,2019-11-13 15:13:57+00:00,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,
4,ACSS Dataverse,https://doi.org/10.25825/FK2/VEANNS,1,0,2019-11-12 20:34:20+00:00,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,


In [12]:
# Create dataframe containing only datasets with something in their license fields or any Terms of Use fields
datasetsWithAnyTerms = (latestversion
                    .query('license.notnull() or termsOfUse.notnull() or confidentialityDeclaration.notnull() or specialPermissions.notnull() or restrictions.notnull() or citationRequirements.notnull() or depositorRequirements.notnull() or conditions.notnull() or disclaimer.notnull()')
                   .reset_index(drop = True, inplace = False)
                   )

How many of those datasets are in each installation?

In [13]:
countDatasetsWithTermsByInstallation = datasetsWithAnyTerms.value_counts(subset=['publisher']).to_frame('count of datasets with any Terms')
countDatasetsWithTermsByInstallation.head(5)

Unnamed: 0_level_0,count of datasets with any Terms
publisher,Unnamed: 1_level_1
Portail Data INRAE,72005
Harvard Dataverse,34447
openforestdata.pl,10042
UNC Dataverse,4284
RIN Dataverse,4016


In [14]:
# Combine the two dataframes so we can compare the count of datasets with any Terms to the count of all datasets
dataframes = [
    countDatasetsByInstallation,
    countDatasetsWithTermsByInstallation,
    ]

countDatasetsWithAndWithoutTermsByInstallation = reduce(lambda left, right: left.join(right, how='outer'), dataframes)

# Format the dataframe to replace NA values with 0 and cast the counts as integers
countDatasetsWithAndWithoutTermsByInstallation = (countDatasetsWithAndWithoutTermsByInstallation
    .fillna(0)
    .astype('int32')
    )

# Add a column showing the percentage of datasets with any Terms
countDatasetsWithAndWithoutTermsByInstallation['percent of datasets with any Terms'] = countDatasetsWithAndWithoutTermsByInstallation['count of datasets with any Terms'] / countDatasetsWithAndWithoutTermsByInstallation['count of datasets']

countDatasetsWithAndWithoutTermsByInstallation.head(5)

Unnamed: 0_level_0,count of datasets,count of datasets with any Terms,percent of datasets with any Terms
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACSS Dataverse,119,119,1.0
ADA Dataverse,1583,1583,1.0
ASU Library Research Data Repository,26,26,1.0
AUSSDA,1147,1147,1.0
Abacus Data Network,2191,2190,0.999544


I think it might be helpful for some installations to see which datasets have no Terms metadata, so lets query for that and export the results to a CSV file.

In [24]:
# Create dataframe containing only datasets with nothing in their license fields or any Terms of Use fields
datasetsWithNoTerms = (latestversion
                        .query('(license == "NONE" or license.isnull()) and (termsOfUse.isnull() and confidentialityDeclaration.isnull() and specialPermissions.isnull() and restrictions.isnull() and citationRequirements.isnull() and depositorRequirements.isnull() and conditions.isnull() and disclaimer.isnull())')
                        .reset_index(drop = True, inplace = False)
                        )

print('Number of datasets with no Terms metadata: %s' % (len(datasetsWithNoTerms)))

# Export dataframe to a CSV file
datasetsWithNoTerms.to_csv('datasetsWithNoTerms.csv', index=False)

Number of datasets with no Terms metadata: 12810


### Question 2: When installations apply the changes in the "multiple license" software update, how many datasets will have "Custom Terms"? And how many of these datasets has each installation published?

Again, we could use these numbers to get a sense of the scale of change for each installation, and the numbers might encourage each installation to investigate the effects of the change.

In [25]:
# Create a new dataframe with the datasets that have a value in the licenses field (that is, their license field is not null or "NONE") plus one or more values in the Terms of Use fields
datasetsWithCustomTerms = (latestversion
                            .query('(license.notnull()) and (confidentialityDeclaration.notnull() or specialPermissions.notnull() or restrictions.notnull() or citationRequirements.notnull() or depositorRequirements.notnull() or conditions.notnull() or disclaimer.notnull())')
                            .reset_index(drop = True, inplace = False)
                            )

print('Number of datasets with a license plus values in one or more Terms of Use fields: %s' % (len(datasetsWithCustomTerms)))

Number of datasets with a license plus values in one or more Terms of Use fields: 621


In [17]:
datasetsWithCustomTerms.head(5)

Unnamed: 0,publisher,persistentUrl,majorVersionNumber,minorVersionNumber,versionCreateTime,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,conditions,disclaimer
0,CROSSDA,https://doi.org/10.23669/JVNVNR,1,1,2021-07-19 13:23:04+00:00,CC0,CC0 Waiver,,,,"The citation for this study is:\nVlašiček, D.,...",,,
1,DR-NTU (Data),https://doi.org/10.21979/N9/DHYM9H,1,0,2020-01-13 03:23:43+00:00,CC0,CC0 Waiver,,,"To download these audio files, researchers mus...","To cite this dataset:\nStyles, Suzy J; Bin Mus...",,,
2,DR-NTU (Data),https://doi.org/10.21979/N9/OF5ZDK,1,0,2020-01-13 03:56:25+00:00,CC0,CC0 Waiver,,,Researchers must give their name and current r...,"To cite this dataset: \nStyles, Suzy; Travers ...",,,
3,DaRUS,https://doi.org/10.18419/darus-1851,1,0,2021-05-14 18:02:49+00:00,CC BY,CC BY Waiver,,,,"See ""Related Publication"".",,"The code and software included under ""software...",
4,Dataverse,https://doi.org/10.48510/FK2/DUIKUT,1,0,2021-05-27 13:41:48+00:00,CC0,CC0 Waiver,,,,,,,If you use the published dataset or parts of ...


How many of these datasets are in each Dataverse installation?

In [20]:
countDatasetsWithCustomTermsByInstallation = datasetsWithCustomTerms.value_counts(subset=['publisher']).to_frame('count of datasets with "Custom Terms"')
countDatasetsWithCustomTermsByInstallation.head(5)

Unnamed: 0_level_0,"count of datasets with ""Custom Terms"""
publisher,Unnamed: 1_level_1
Harvard Dataverse,394
UNC Dataverse,131
Scholars Portal Dataverse,51
Portail Data INRAE,6
Peking University Open Research Data Platform,6


Finally, let's join this dataframe to the dataframe showing all datasets in each installation

In [21]:
dataframes = [
    countDatasetsByInstallation,
    countDatasetsWithCustomTermsByInstallation]

countDatasetsWithAndWithoutCustomTermsByInstallation = reduce(lambda left, right: left.join(right, how='outer'), dataframes)

# Format allCountsByInstallation dataframe to replace NA values with 0, cast values as integers and sort by greatest number of datasets with a license plus ToU
countDatasetsWithAndWithoutCustomTermsByInstallation = (countDatasetsWithAndWithoutCustomTermsByInstallation
                           .fillna(0)
                           .astype('int32')
                           .sort_values([
                                'count of datasets with "Custom Terms"',
                                'count of datasets'], ascending=False)
                           )

# Add a column showing the percentage of datasets with any Terms
countDatasetsWithAndWithoutCustomTermsByInstallation['percent of datasets with "Custom Terms"'] = countDatasetsWithAndWithoutCustomTermsByInstallation['count of datasets with "Custom Terms"'] / countDatasetsWithAndWithoutCustomTermsByInstallation['count of datasets']

In [22]:
countDatasetsWithAndWithoutCustomTermsByInstallation.head(5)

Unnamed: 0_level_0,count of datasets,"count of datasets with ""Custom Terms""","percent of datasets with ""Custom Terms"""
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Harvard Dataverse,44057,394,0.008943
UNC Dataverse,4537,131,0.028874
Scholars Portal Dataverse,4096,51,0.012451
Portail Data INRAE,72010,6,8.3e-05
Peking University Open Research Data Platform,320,6,0.01875


Finally, it might also be helpful for some installations to see which datasets will be considered to have "Custom Terms", so lets export that dataframe to a CSV file.

In [26]:
# Export dataframe to a CSV file
datasetsWithCustomTerms.to_csv('datasetsWithCustomTerms.csv', index=False)