# Datasets with license plus terms of use metadata

Goals of this notebook

This notebook explores the "Terms" metadata of datasets in Dataverse installations in order to learn more about the effects of the "multiple-license" feature making its way to the Dataverse software (https://github.com/IQSS/dataverse/pull/7920). When applied to Dataverse installations, this update will change how depositors enter "Terms" metadata (metadata about how the data should or must be used). A goal of this update is to encourage the application of standard licenses to datasets in Dataverse repositories, which should make it easier for other people and discovery systems to determine how the data can and can't be used.

For more information about the update, see the Multiple License Consensus Proposal ([Google Doc](https://docs.google.com/document/d/10htygglMdlABYWqtcZpqd8sHOwIe6sLL_UJtTv8NEKw)) and the following GitHub issues and pull request:

- https://github.com/IQSS/dataverse/issues/7742
- https://github.com/IQSS/dataverse/issues/7440
- https://github.com/IQSS/dataverse/pull/7920

As a result, this update might result in an increase in the share of dataset metadata published by Dataverse repositories that include information about how the data can be used, and an increase in the use of standardized Terms (such as with the application of Creative Commons licenses).

Additionally, with this update any dataset metadata with values in any of the fields in the software's "Terms of Use" panel, such as Confidentiality Declaration and Special Permissions, will be considered "Custom Terms," even when a CC0 waiver (or for some forked installations, a CCBY license) was applied.

So the two questions we seek to answer are:
- How much of the dataset metadata published by each Dataverse repository includes any information about how the data can be used? We can track how these numbers change over time and particularly after each repository applies this and similar updates, as a way to measure the success of these changes and justify them.
- When installations apply these changes, which ones have published datasets whose Terms metadata will be considered "Custom Terms" because they have CC0 waivers (or, for some forked installations, CCBY licenses) plus any of the eight "Terms of Use" fields filled? And how many of these datasets has each installation published? We could use these numbers to get a sense of the scale of the change for each installation, and the numbers might encourage each installation to look into the effects of this change.

## Methods

The Terms metadata published by 56 Dataverse installations (of the known 73) is recorded in the terms_metadata.tab file. That tabular file was created by:
- Downloading all zip files in the dataset at https://doi.org/10.7910/DVN/DCDKZQ. That dataset contains the JSON metadata files collected between August 4 and August 7, 2021 using a Python script. The methods for getting this metadata are described in the dataset.
- Using another Python script to extracting the JSON metadata files in each Zip file into a single directory.
- Parsing the terms metadata of every JSON metadata file in that directory into a single CSV file using the [parse_terms_metadata.py](https://github.com/jggautier/dataverse_scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_terms_metadata.py) script in the https://github.com/jggautier/dataverse_scripts repository.
- Parsing the value of each JSON files' "publisher" key, which indicates the installation that published the metadata in each JSON file, into a single CSV file using the [parse_basic_metadata.py](https://github.com/jggautier/dataverse_scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_basic_metadata.py) script in the https://github.com/jggautier/dataverse_scripts repository.
- Joining both CSV files into a single CSV file that contains the persistent URLs, dataset version IDs, installation names (publishers), and Terms metadata for every published dataset version in 56, and converting that CSV file into a .tab file (so that it's easier to make this notebook accessible using the Dataverse software's [binder integration](https://guides.dataverse.org/en/5.7/admin/integrations.html?highlight=integrations#binder).

This notebook uses the pandas and numpy Python packages to filter, reshape and visualize the data in the terms_metadata.tab file in order to help answer the two questions in the Goals section of this notebook.

## Exploration

In [1]:
# Import Python packages
from functools import reduce
import numpy as np
import pandas as pd

### Importing and preparing the data

In [2]:
# Import data as a dataframe
allVersions = pd.read_csv('terms_metadata.tab', sep='\t', na_filter = False)

In [3]:
allVersions.head()

Unnamed: 0,persistentUrl,datasetVersionId,publisher,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,conditions,disclaimer,termsOfAccess,dataAccessPlace,originalArchive,availabilityStatus,contactForAccess,sizeOfCollection,studyCompletion
0,https://doi.org/10.15454/GHAUFR,256773,Portail Data INRAE,NONE,CC BY 2.0,,,,,,,,,,,,,,
1,https://doi.org/10.15454/TLXRVW,52619,Portail Data INRAE,NONE,CC BY 2.0,,,,,,,,,,,,,,
2,https://doi.org/10.15454/CX70I3,170336,Portail Data INRAE,NONE,CC BY 2.0,,,,,,,,,,,,,,
3,https://doi.org/10.21410/7E4/4WG94W,1200,data.sciencespo,NONE,,,,,,,,,,,,,,,
4,https://doi.org/10.7910/DVN/5PRYPC,198147,Harvard Dataverse,CC0,CC0 Waiver,,,,,,,,,,,,,,


In [4]:
print('Number of total dataset versions: %s' % (len(allVersions)))
print('Number of unique datasets: %s' % (len(allVersions.persistentUrl.unique())))

Number of total dataset versions: 382596
Number of unique datasets: 155719


**About dataset versioning and Terms metadata**

While we know from experience that some datasets published in Dataverse repositories contain versions where each version has its own terms metadata (such as a dataset where each version represents a wave of the same longitudinal research study and data re-users should consider each dataset version's license), let's assume that for a large majority of datasets, the Terms metadata of the latest versions have replaced the metadata of all previous versions.

For example, if a dataset has two published versions, the first version has no CC0 waiver, and second version has a CC0 waiver, the CC0 waiver is what should apply to the dataset (and one could argue that previous versions should have been deaccessioned).

Since versioning in the Dataverse software is designed more as a way to record improvements to a dataset over time, as opposed to a way to publish disparate but connected datasets (such as waves in a longitudinal study), we'll assume that this is how most depositors use dataset versioning. Additionally, the Dataverse software favors the latest published version of each dataset, tending to expose the metadata in the latest published version more than it exposes the metadata of previously published versions (such as through indexing and search result displays, metadata distributing, and how PIDs direct users to the latest dataset version).

So let's get and consider the Terms metadata for only the latest version of each dataset.

In [5]:
# Get only metadata for the latest versions of each dataset
latestversion = (allVersions
                 .iloc[allVersions
                    .groupby('persistentUrl')['datasetVersionId']
                    .agg(pd.Series.idxmax)]
                 .sort_values(by=['publisher'], inplace=False, ascending=True)
                 .reset_index(drop=True, inplace=False))

In [6]:
# Replace any blank values with NaN, making it easier to count and sort later
latestversion = latestversion.replace(r'^\s*$', np.nan, regex=True)

# Dartmouth's repository exports JSON files with "Root" in the "publisher" key. Replace "Root" with "Dartmouth"
latestversion['publisher'] = latestversion['publisher'].replace(['Root'],'Dartmouth')


In [19]:
print('Number of datasets: %s' % (len(latestversion)))
print('Number of installations: %s' % (len(latestversion.publisher.unique())))

Number of datasets: 155719
Number of installations: 56


### Question 1: How many datasets published by each Dataverse installation include any information about how the data can be used?

We'll be exploring the Terms metadata of 155,719 datasets in 56 Dataverse installations. Let's get a count of each dataset in each installation to compare to the count of the datasets with metadata that includes any Terms (including standard licenses in any Terms field).

In [8]:
countlatestversionByInstallation = latestversion.value_counts(subset=['publisher']).to_frame('count of datasets')
countlatestversionByInstallation.head()

Unnamed: 0_level_0,count of datasets
publisher,Unnamed: 1_level_1
Portail Data INRAE,72010
Harvard Dataverse,44057
openforestdata.pl,10116
UNC Dataverse,4537
Scholars Portal Dataverse,4096


How many of those datasets have metadata with any Terms information?

In [24]:
# Retain only columns containing the publisher and dataset PID and any values in the license and Terms of Use fields
latestversion = latestversion[[
    'publisher', 'persistentUrl', 'license', 'termsOfUse', 'confidentialityDeclaration', 'specialPermissions',
    'restrictions', 'citationRequirements', 'depositorRequirements', 'conditions', 'disclaimer'
]]

In [25]:
# List unique licences in all repositories' license fields
latestversion.license.unique()

array(['NONE', 'CC0', 'CC BY', nan, 'CCBY'], dtype=object)

In [28]:
# Create dataframe containing only datasets with something in their license fields or any Terms of Use fields
latestversionAnyTerms = (latestversion
                    .query('license == "CC0" or license == "CCBY" or license == "CC BY" or termsOfUse.notnull() or confidentialityDeclaration.notnull() or specialPermissions.notnull() or restrictions.notnull() or citationRequirements.notnull() or depositorRequirements.notnull() or conditions.notnull() or disclaimer.notnull()')
                   .reset_index(drop = True, inplace = False)
                   )

In [31]:
print('Number of datasets with any Terms metadata: %s' % (len(latestversionAnyTerms)))
fileName = 'latestversionAnyTerms.csv'
latestversionAnyTerms.to_csv(fileName, index=False)

Number of datasets with any Terms metadata: 142909


How many of those datasets are in each installation?

In [23]:
countDatasetsWithTermsByInstallation = latestversionAnyTerms.value_counts(subset=['publisher']).to_frame('count of datasets with Terms')
countDatasetsWithTermsByInstallation.head(60)

Unnamed: 0_level_0,count of datasets with Terms
publisher,Unnamed: 1_level_1
Harvard Dataverse,32593
UNC Dataverse,4130
RIN Dataverse,3950
Scholars Portal Dataverse,2943
Abacus Data Network,2137
ADA Dataverse,1545
DataverseNO,1064
DataverseNL,1015
DR-NTU (Data),749
UAL Dataverse,374


It'll be helpful to know which datasets have no Terms metadata, so lets query for that and save the results as a CSV file.

In [None]:
# Create dataframe containing only datasets with nothing in their license fields or any Terms of Use fields
latestversionNoTerms = (latestversionAnyTerms
                         .query(
                            'license == "NONE" or license.null() or\
                            termsOfUse.notnull() or\
                            confidentialityDeclaration.notnull() or\
                            specialPermissions.notnull() or\
                            restrictions.notnull() or\
                            citationRequirements.notnull() or\
                            depositorRequirements.notnull() or\
                            conditions.notnull() or\
                            disclaimer.notnull()'
                        )
                         .reset_index(drop = True, inplace = False)
                         )

Now we want to know: How many datasets in each repository have a value in the more machine readable licenses field (their license field is not null or NONE) plus one or more values in the Terms of Use fields?

In [14]:
# Create dataframe containing only datasets with something in their license fields
latestversionCC = (latestversion_termsofuse
    .query('license == "CC0" or license == "CCBY" or license == "CC BY"')
    .reset_index(drop = True, inplace = False)
)

In [15]:
latestversionCC.head()

Unnamed: 0,publisher,persistentUrl,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,conditions,disclaimer
0,ADA Dataverse,http://dx.doi.org/10.4225/87/B5AJXD,CC0,CC0 Waiver,,,,,,,
1,ASU Library Research Data Repository,https://doi.org/10.48349/ASU/UKDZ2U,CC0,CC0 Waiver,,,,,,,
2,ASU Library Research Data Repository,https://doi.org/10.48349/ASU/QXXWGP,CC0,CC0 Waiver,,,,,,,
3,ASU Library Research Data Repository,https://doi.org/10.48349/ASU/TCJR5Z,CC0,CC0 Waiver,,,,,,,
4,ASU Library Research Data Repository,https://doi.org/10.48349/ASU/VPO9LD,CC0,CC0 Waiver,,,,,,,


In [16]:
print('There are %s datasets with something in their license fields.' % (len(latestversionCC)))

There are 40976 datasets with something in their license fields.


How many of those are in each Dataverse installation?

In [18]:
countlatestversionCCByInstallation = latestversionCC.value_counts(subset=['publisher']).to_frame('count of datasets with license field value')

countlatestversionCCByInstallation.head()

Unnamed: 0_level_0,count of datasets with license field value
publisher,Unnamed: 1_level_1
Harvard Dataverse,29323
RIN Dataverse,3936
Scholars Portal Dataverse,2329
DataverseNO,1060
DataverseNL,764


Among the datasets with a value in their license fields, how many also have values in one or more Terms of Use fields? How many are in each of the 56 Dataverse installations?

In [19]:
# Create dataframe with only datasets with a license and one or more Terms of Use fields filled in
latestversionCCPlusCustomTerms = latestversionCC.query(
    'confidentialityDeclaration.notnull() or\
    specialPermissions.notnull() or\
    restrictions.notnull() or\
    citationRequirements.notnull() or\
    depositorRequirements.notnull() or\
    conditions.notnull() or\
    disclaimer.notnull()'
)

print('Number of datasets with a value in the license field plus values in one or more "Terms of Use" fields: %s' % (len(latestversionCCPlusCustomTerms)))

Number of datasets with a value in the license field plus values in one or more "Terms of Use" fields: 621


In [21]:
# Create a dataset showing count of these datasets in each Dataverse installation
latestversionCCPlusCustomTermsByInstallation = (latestversionCCPlusCustomTerms
    .value_counts(subset=['publisher'])
    .to_frame('count of datasets with values in license and ToU field(s)'))

latestversionCCPlusCustomTermsByInstallation.head()

Unnamed: 0_level_0,count of datasets with values in license and ToU field(s)
publisher,Unnamed: 1_level_1
Harvard Dataverse,394
UNC Dataverse,131
Scholars Portal Dataverse,51
Portail Data INRAE,6
Peking University Open Research Data Platform,6


Finally, let's join all of the dataframes we've created so we can see the counts of datasets for each query for each installation.

In [23]:
dataframes = [
    countlatestversionByInstallation,
    countlatestversionCCByInstallation,
    latestversionCCPlusCustomTermsByInstallation]

allCountsByInstallation = reduce(lambda left, right: left.join(right, how='outer'), dataframes)

# Format allCountsByInstallation dataframe to replace NA values with 0, cast values as integers and sort by greatest number of datasets with a license plus ToU
allCountsByInstallation = (allCountsByInstallation
                           .fillna(0)
                           .astype('int32')
                           .sort_values([
                                'count of datasets with values in license and ToU field(s)',
                                'count of datasets',
                                'count of datasets with license field value'], ascending=False)
                           )

In [24]:
allCountsByInstallation.head()

Unnamed: 0_level_0,count of datasets,count of datasets with license field value,count of datasets with values in license and ToU field(s)
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Harvard Dataverse,44057,29323,394
UNC Dataverse,4537,644,131
Scholars Portal Dataverse,4096,2329,51
Portail Data INRAE,72010,55,6
Peking University Open Research Data Platform,320,314,6


In [29]:
# Export to allCountsByInstallation dataframe to a CSV file
fileName = 'allCountsByInstallation.csv'
allCountsByInstallation.to_csv(fileName, index=True)