# Datasets with license plus terms of use metadata

Goals of this notebook

This notebook explores the "Terms" metadata of datasets in Dataverse installations in order to learn more about the effects of the "multiple-license" feature making its way to the Dataverse software (https://github.com/IQSS/dataverse/pull/7920). When applied to Dataverse installations, this update will change how depositors enter "Terms" metadata (metadata about how the data should or must be used). A goal of this update is to encourage the application of standard licenses to datasets in Dataverse repositories, which should make it easier for other people and discovery systems to determine how the data can and can't be used.

For more information about the update, see the Multiple License Consensus Proposal ([Google Doc](https://docs.google.com/document/d/10htygglMdlABYWqtcZpqd8sHOwIe6sLL_UJtTv8NEKw)) and the following GitHub issues and pull request:

- https://github.com/IQSS/dataverse/issues/7742
- https://github.com/IQSS/dataverse/issues/7440
- https://github.com/IQSS/dataverse/pull/7920

As a result, this update might result in an increase in the share of dataset metadata published by Dataverse repositories that include information about how the data can be used, and an increase in the use of standardized Terms (such as with the application of Creative Commons licenses).

Additionally, with this update any dataset metadata with values in any of the fields in the software's "Terms of Use" panel, such as Confidentiality Declaration and Special Permissions, will be considered "Custom Terms," even when a CC0 waiver (or for some forked installations, a CCBY license) was applied.

So the two questions we seek to answer are:
- How much of the dataset metadata published by each Dataverse repository includes any information about how the data can be used? We can track how these numbers change over time and particularly after each repository applies this and similar updates, as a way to measure the success of these changes and justify them.
- When installations apply these changes, which ones have published datasets whose Terms metadata will be considered "Custom Terms" because they have CC0 waivers (or, for some forked installations, CCBY licenses) plus any of the eight "Terms of Use" fields filled? And how many of these datasets has each installation published? We could use these numbers to get a sense of the scale of the change for each installation, and the numbers might encourage each installation to look into the effects of this change.

## Methods

The Terms metadata published by 56 Dataverse installations (of the known 73) is recorded in the terms_metadata.tab file. That tabular file was created by:
- Downloading all zip files in the dataset at https://doi.org/10.7910/DVN/DCDKZQ. That dataset contains the JSON metadata files collected between August 4 and August 7, 2021 using a Python script. The methods for getting this metadata are described in the dataset.
- Using another Python script to extracting the JSON metadata files in each Zip file into a single directory.
- Parsing the terms metadata of every JSON metadata file in that directory into a single CSV file using the [parse_terms_metadata.py](https://github.com/jggautier/dataverse_scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_terms_metadata.py) script in the https://github.com/jggautier/dataverse_scripts repository.
- Parsing the value of each JSON files' "publisher" key, which indicates the installation that published the metadata in each JSON file, into a single CSV file using the [parse_basic_metadata.py](https://github.com/jggautier/dataverse_scripts/blob/main/get-dataverse-metadata/parse_metadata_fields/parse_basic_metadata.py) script in the https://github.com/jggautier/dataverse_scripts repository.
- Joining both CSV files into a single CSV file that contains the persistent URLs, dataset version IDs, installation names (publishers), and Terms metadata for every published dataset version in 56, and converting that CSV file into a .tab file (so that it's easier to make this notebook accessible using the Dataverse software's [binder integration](https://guides.dataverse.org/en/5.7/admin/integrations.html?highlight=integrations#binder).

This notebook uses the pandas and numpy Python packages to filter, reshape and visualize the data in the terms_metadata.tab file in order to help answer the two questions in the Goals section of this notebook.

## Exploration

In [106]:
# Import Python packages
from datetime import datetime,date
from functools import reduce
import numpy as np
import pandas as pd
import panel as pn
import plotly.express as px

pn.extension()

### Importing and preparing the data

In [72]:
# Import data as a dataframe
allVersions = pd.read_csv('terms_metadata.tab', sep='\t', na_filter = False)

In [73]:
allVersions.head()

Unnamed: 0,datasetVersionId,persistentUrl,persistent_id,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,...,availabilityStatus,contactForAccess,sizeOfCollection,studyCompletion,datasetPublicationDate,versionCreateTime,versionState,majorVersionNumber,minorVersionNumber,publisher
0,256773,https://doi.org/10.15454/GHAUFR,doi:10.15454/GHAUFR,NONE,CC BY 2.0,,,,,,...,,,,,2019-02-15,2019-09-18T22:00:00Z,RELEASED,3,0,Portail Data INRAE
1,52619,https://doi.org/10.15454/TLXRVW,doi:10.15454/TLXRVW,NONE,CC BY 2.0,,,,,,...,,,,,2019-09-11,2018-09-12T03:19:24Z,RELEASED,1,0,Portail Data INRAE
2,170336,https://doi.org/10.15454/CX70I3,doi:10.15454/CX70I3,NONE,CC BY 2.0,,,,,,...,,,,,2019-01-21,2019-01-21T10:54:41Z,RELEASED,2,0,Portail Data INRAE
3,1200,https://doi.org/10.21410/7E4/4WG94W,doi:10.21410/7E4/4WG94W,NONE,,,,,,,...,,,,,2020-05-05,2020-05-13T16:06:28Z,RELEASED,2,1,data.sciencespo
4,198147,https://doi.org/10.7910/DVN/5PRYPC,doi:10.7910/DVN/5PRYPC,CC0,CC0 Waiver,,,,,,...,,,,,2020-05-27,2020-06-17T23:49:50Z,RELEASED,4,0,Harvard Dataverse


In [74]:
print('Number of total dataset versions: %s' % (len(allVersions)))
print('Number of unique datasets: %s' % (len(allVersions.persistentUrl.unique())))

Number of total dataset versions: 382596
Number of unique datasets: 155719


**About dataset versioning and Terms metadata**

While we know from experience that some datasets published in Dataverse repositories contain versions where each version has its own terms metadata (such as a dataset where each version represents a wave of the same longitudinal research study and data re-users should consider each dataset version's license), let's assume that for a large majority of datasets, the Terms metadata of the latest versions have replaced the metadata of all previous versions.

For example, if a dataset has two published versions, the first version has no CC0 waiver, and second version has a CC0 waiver, the CC0 waiver is what should apply to the dataset (and one could argue that previous versions should have been deaccessioned).

Since versioning in the Dataverse software is designed more as a way to record improvements to a dataset over time, as opposed to a way to publish disparate but connected datasets (such as waves in a longitudinal study), we'll assume that this is how most depositors use dataset versioning. Additionally, the Dataverse software favors the latest published version of each dataset, tending to expose the metadata in the latest published version more than it exposes the metadata of previously published versions (such as through indexing and search result displays, metadata distributing, and how PIDs direct users to the latest dataset version).

So let's get and consider the Terms metadata for only the latest version of each dataset.

In [132]:
# Get only metadata for the latest versions of each dataset
latestversion = (allVersions
                 .iloc[allVersions
                    .groupby('persistentUrl')['datasetVersionId']
                    .agg(pd.Series.idxmax)]
                 .sort_values(by=['publisher'], inplace=False, ascending=True)
                 .reset_index(drop=True, inplace=False))

In [133]:
# Replace any blank values with NaN, making it easier to count and sort later
latestversion = latestversion.replace(r'^\s*$', np.nan, regex=True)

# Dartmouth's repository exports JSON files with "Root" in the "publisher" key. Replace "Root" with "Dartmouth"
latestversion['publisher'] = latestversion['publisher'].replace(['Root'],'Dartmouth')

# Add column with just the date (no time) of the versionCreateTime for each dataset
latestversion['versionCreateDate'] = pd.to_datetime(latestversion['versionCreateTime']).dt.date

# Make sure values in the two date columns are interpreted as dates
dateColumns = ['versionCreateTime', 'versionCreateDate']
latestversion[dateColumns] = latestversion[dateColumns].apply(pd.to_datetime)

In [134]:
print('Number of datasets: %s' % (len(latestversion)))
print('Number of installations: %s' % (len(latestversion.publisher.unique())))

Number of datasets: 155719
Number of installations: 56


### Question 1: How many datasets published by each Dataverse installation include any information about how the data can be used?

We'll be exploring the Terms metadata of 155,719 datasets in 56 Dataverse installations. Let's get a count of each dataset in each installation to compare to the count of the datasets with metadata that includes any Terms (including standard licenses in any Terms field).

In [135]:
countDatasetsByInstallation = latestversion.value_counts(subset=['publisher']).to_frame('count of datasets')
countDatasetsByInstallation.head()

Unnamed: 0_level_0,count of datasets
publisher,Unnamed: 1_level_1
Portail Data INRAE,72010
Harvard Dataverse,44057
openforestdata.pl,10116
UNC Dataverse,4537
Scholars Portal Dataverse,4096


How many of those datasets have metadata with any Terms information?

In [136]:
# Retain only columns containing the publisher name, dataset PID, the date columns when the latest versions were published, and any values in the license and Terms of Use fields
latestversion = latestversion[[
    'publisher', 'persistentUrl', 'versionCreateTime', 'versionCreateDate', 'license', 'termsOfUse', 'confidentialityDeclaration', 'specialPermissions',
    'restrictions', 'citationRequirements', 'depositorRequirements', 'conditions', 'disclaimer'
]]

In [137]:
latestversion.head()

Unnamed: 0,publisher,persistentUrl,versionCreateTime,versionCreateDate,license,termsOfUse,confidentialityDeclaration,specialPermissions,restrictions,citationRequirements,depositorRequirements,conditions,disclaimer
0,ACSS Dataverse,https://doi.org/10.25825/FK2/8YKSQV,2019-11-14 18:21:51+00:00,2019-11-14,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,
1,ACSS Dataverse,https://doi.org/10.25825/FK2/VXVPVP,2021-01-06 16:02:23+00:00,2021-01-06,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,
2,ACSS Dataverse,https://doi.org/10.25825/FK2/VNAJ1I,2019-11-18 21:23:56+00:00,2019-11-18,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,
3,ACSS Dataverse,https://doi.org/10.25825/FK2/VEO88P,2019-11-13 15:13:57+00:00,2019-11-13,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,
4,ACSS Dataverse,https://doi.org/10.25825/FK2/VEANNS,2019-11-12 20:34:20+00:00,2019-11-12,NONE,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,


In [139]:
# List unique licences in all repositories' license fields
latestversion.license.unique()

array(['NONE', 'CC0', 'CC BY', nan, 'CCBY'], dtype=object)

In [140]:
# Create dataframe containing only datasets with something in their license fields or any Terms of Use fields
latestversionAnyTerms = (latestversion
                    .query('license == "CC0" or license == "CCBY" or license == "CC BY" or termsOfUse.notnull() or confidentialityDeclaration.notnull() or specialPermissions.notnull() or restrictions.notnull() or citationRequirements.notnull() or depositorRequirements.notnull() or conditions.notnull() or disclaimer.notnull()')
                   .reset_index(drop = True, inplace = False)
                   )

How many of those datasets are in each installation?

In [141]:
countDatasetsWithTermsByInstallation = latestversionAnyTerms.value_counts(subset=['publisher']).to_frame('count of datasets with any Terms')
countDatasetsWithTermsByInstallation.head()

Unnamed: 0_level_0,count of datasets with any Terms
publisher,Unnamed: 1_level_1
Portail Data INRAE,72005
Harvard Dataverse,34447
openforestdata.pl,10042
UNC Dataverse,4284
RIN Dataverse,4016


I think it might be helpful for some installations to see which datasets have no Terms metadata, so lets query for that and export the results to a CSV file.

In [142]:
# Create dataframe containing only datasets with nothing in their license fields or any Terms of Use fields
latestversionNoTerms = (latestversion
                        .query('(license == "NONE" or license.isnull()) and (termsOfUse.isnull() and confidentialityDeclaration.isnull() and specialPermissions.isnull() and restrictions.isnull() and citationRequirements.isnull() and depositorRequirements.isnull() and conditions.isnull() and disclaimer.isnull())')
                        .reset_index(drop = True, inplace = False)
                        )

print('Number of datasets with no Terms metadata: %s' % (len(latestversionNoTerms)))

# Export dataframe to a CSV file
latestversionNoTerms.to_csv('latestversionNoTerms.csv', index=False)

Number of datasets with no Terms metadata: 12810


### Question 2: When installations apply the changes in the "multiple license" software update, which ones have published datasets whose Terms metadata will be considered "Custom Terms"? And how many of these datasets has each installation published?

Again, we could use these numbers to get a sense of the scale of the change for each installation, and the numbers might encourage each installation to investigate the effects of this change.

In [143]:
# Create a new dataframe with the datasets that have a value in the licenses field (that is, their license field is not null or "NONE") plus one or more values in the Terms of Use fields
datasetsWithCustomTerms = (latestversion
                            .query('(license == "CC0" or license == "CCBY" or license == "CC BY") and (confidentialityDeclaration.notnull() or specialPermissions.notnull() or restrictions.notnull() or citationRequirements.notnull() or depositorRequirements.notnull() or conditions.notnull() or disclaimer.notnull())')
                            .reset_index(drop = True, inplace = False)
                            )

print('Number of datasets with a license plus values in one or more Terms of Use fields: %s' % (len(datasetsWithCustomTerms)))

Number of datasets with a license plus values in one or more Terms of Use fields: 621


In [146]:
datasetsWithCustomTerms.head()
datasetsWithCustomTerms.to_csv('datasetsWithCustomTerms.csv', index=False)

In [148]:
# Get count of these datasets by versionCreateTime
dailyCountOfDatasetsWithCustomTerms = (

    # Retain only these columns
    datasetsWithCustomTerms[[
        'versionCreateDate', 'persistentUrl']]

        # Group versionCreateDate column to get count of persistentUrls each day
        .value_counts(subset=['versionCreateDate'])
        .to_frame('count of datasets published')

        .sort_values(by=['versionCreateDate'], inplace=False, ascending=True)
        .reset_index(drop=False, inplace=False)
)

dailyCountOfDatasetsWithCustomTerms.head(20)


Unnamed: 0,versionCreateDate,count of datasets published
0,2013-03-15,1
1,2013-03-25,1
2,2013-03-26,1
3,2013-04-24,1
4,2013-06-12,1
5,2014-08-20,1
6,2014-09-03,1
7,2014-10-10,1
8,2015-06-11,1
9,2015-08-31,1


In [150]:
fig = px.line(dailyCountOfDatasetsWithCustomTerms, x="versionCreateDate", y="count of datasets published", title='Datasets published with "Custom Terms')
fig.show()

How many of those are in each Dataverse installation?

In [66]:
countDatasetsWithCustomTermsByInstallation = latestversionCustomTerms.value_counts(subset=['publisher']).to_frame('count of datasets with "Custom Terms"')
countDatasetsWithCustomTermsByInstallation.head()

Unnamed: 0_level_0,"count of datasets with ""Custom Terms"""
publisher,Unnamed: 1_level_1
Harvard Dataverse,394
UNC Dataverse,131
Scholars Portal Dataverse,51
Portail Data INRAE,6
Peking University Open Research Data Platform,6


Finally, let's join all of the dataframes we've created so we can see the counts of datasets for each query for each installation.

In [70]:
dataframes = [
    countDatasetsByInstallation,
    countDatasetsWithTermsByInstallation,
    countDatasetsWithCustomTermsByInstallation]

allCountsByInstallation = reduce(lambda left, right: left.join(right, how='outer'), dataframes)

# Format allCountsByInstallation dataframe to replace NA values with 0, cast values as integers and sort by greatest number of datasets with a license plus ToU
allCountsByInstallation = (allCountsByInstallation
                           .fillna(0)
                           .astype('int32')
                           .sort_values([
                                'count of datasets with any Terms',
                                'count of datasets',
                                'count of datasets with "Custom Terms"'], ascending=False)
                           )

In [71]:
allCountsByInstallation.head(60)

Unnamed: 0_level_0,count of datasets,count of datasets with any Terms,"count of datasets with ""Custom Terms"""
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Portail Data INRAE,72010,72005,6
Harvard Dataverse,44057,34447,394
openforestdata.pl,10116,10042,2
UNC Dataverse,4537,4284,131
RIN Dataverse,4017,4016,0
Scholars Portal Dataverse,4096,3753,51
Abacus Data Network,2191,2190,0
ADA Dataverse,1583,1583,0
AUSSDA,1147,1147,0
DataverseNL,1865,1138,4


In [29]:
# Export to allCountsByInstallation dataframe to a CSV file
fileName = 'allCountsByInstallation.csv'
allCountsByInstallation.to_csv(fileName, index=True)