<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Plan" data-toc-modified-id="Plan-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Plan</a></span></li><li><span><a href="#Code" data-toc-modified-id="Code-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Code</a></span><ul class="toc-item"><li><span><a href="#Import-modules-and-load-functions" data-toc-modified-id="Import-modules-and-load-functions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import modules and load functions</a></span></li><li><span><a href="#Get-dataverse-info" data-toc-modified-id="Get-dataverse-info-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Get dataverse info</a></span></li><li><span><a href="#Get-dataset-info" data-toc-modified-id="Get-dataset-info-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Get dataset info</a></span></li></ul></li></ul></div>

## Plan

For chosen dataverse:
    - Show dataverse info:
        - Show number of subdataverses, if any
        - Show whether or not the (sub)dataverse has a description or tagline
        - List metadatablocks of each (sub)dataverse
        - List the facets set for each (sub)dataverse
        - Verify that contact email address is valid for each (sub)dataverse
        (see https://medium.com/@arjunsinghy96/verify-emails-over-socks-proxy-using-python-5589cb75c405
        and https://github.com/Gardener-gg/email-verifier)
    - Show dataset info
        - Show number of datasets
        - Show date of first published dataset
        - Show date of most recently published or updated dataset
        - Show average age of datasets
        - Show average number of dataset versions
        - Metadata (of latest published version of each dataset):
            - Show average number of characters in the dataset descriptions
                - List datasets with fewer than a certain number of characters in their descriptions
            - List datasets with CC0 or Terms of Use metadata
                - Versus number of datasets with no CC0 or TOU metadata
                - List datasets with no CC0 or TOU metadata
            - Show number of files that have no description metadata
                - If a certain percentage of datasets have 1 or more files with no descriptions, list those datasets
            - Related publication metadata
                - Show number of datasets with related publication metadata
                    - List datasets with no related publication metadata
                - Show number of datasets with no PID in related publication metadata
                    - List datasets with no PID in related publication metadata
            - Show datasets that have no metadata for any non-citation metadatablocks enabled in the dataverse
        - Data
            - List count of each unique file format
            - Show number of datasets with no files
                - List datasets that have no files
            - Show number of datasets with 1 or more uningested tabular files
                - List datasets that contain 1 or more uningested tabular files
            - Show number of datasets with 1 or more restricted files
        - Contact emails
            - Get number of datasets that have a contact email address that's different email from the dataverse contact email address
            - Get unique list of contact email addresses and check if they're valid
            - Show any datasets that have no valid email addresses

## Code

### Import modules and load functions

In [1]:
from datetime import datetime, timezone
import numpy as np
import pandas as pd
import requests
import time


def improved_get(_dict, path, default=None):
    for key in path.split('.'):
        try:
            _dict = _dict[key]
        except KeyError:
            return default
    return _dict


def list_to_string(list):
    # Alphabetize list in case-insensitive way
    list = sorted(list, key=lambda s: s.casefold())
    # Change list to comma-separated string
    delimiter = ","
    string = delimiter.join(list)
    return string


def string_to_datetime(string):
    newDatetime = datetime.strptime(string, '%Y-%m-%dT%H:%M:%S%z')
    return newDatetime

# current_time = str.strftime('%Y.%m.%d_%H.%M.%S')
# current_time = datetime.strftime('%Y-%m-%dT%H:%M:%S%z')
# currentTime = datetime.now()
currentTime = datetime.now(timezone.utc)


### Get dataverse info

In [19]:
# Get dataverse server and alias from user - return error if there's no alias or if alias is the Root dataverse
# server = 'https://demo.dataverse.org'
# dataverseAlias = 'sefsef'
server = 'https://dataverse.harvard.edu'
dataverseAlias = 'IOJ'

repositoryMetadataBlocksApi = '%s/api/v1/metadatablocks' % (server)
response = requests.get(repositoryMetadataBlocksApi)
repositoryMetadataBlocks = response.json()

repositoryMetadataBlockNames = []
for repositoryMetadataBlock in repositoryMetadataBlocks['data']:
    repositoryMetadataBlockNames.append(repositoryMetadataBlock['name'])

In [20]:
# Get info from that dataverse: whether or not the dataverse has a description and/or tagline, metadatablocks enabled, facets enabled, validate contact email
dataverseInfoApi = '%s/api/dataverses/%s' % (server, dataverseAlias)
response = requests.get(dataverseInfoApi)
dataverseMetadata = response.json()

In [21]:
if dataverseMetadata['status'] == 'ERROR':
    print('No dataverse found. Is the dataverse published on Harvard Dataverse?')
elif dataverseMetadata['status'] == 'OK':
    dataverseDescription = improved_get(dataverseMetadata, 'data.description')
    if dataverseDescription is not None:
        print('Dataverse has a description')
    else:
        print('Dataverse has no description')

    tagline = improved_get(dataverseMetadata, 'data.theme.tagline')
    if tagline is not None:
        print('Dataverse has a tagline')
    else:
        print('Dataverse has no tagline')

#     contactEmails = []
#     for contact in dataverseMetadata['data']['dataverseContacts']:
#         contactEmails.append(contact['contactEmail'])
#     print(contactEmails)

    dataverseFacetsApi = '%s/api/dataverses/%s/facets' % (server, dataverseAlias)
    response = requests.get(dataverseFacetsApi)
    dataverseFacets = response.json()
    facets = []
    for facet in dataverseFacets['data']:
        facets.append(facet)
    print('Number of search facets used: %s' % (len(facets)))    

#     # See if dataverse inherits its metadatablocks from its parent dataverse
#     metadatablocksInheritedApi = '%s/api/dataverses/%s/metadatablocks/isRoot' % (server, dataverseAlias)
#     response = requests.get(metadatablocksInheritedApi)
#     metadatablocksInherited = response.json()
#     print(metadatablocksInherited)
    
    # Get list of metadatablocks enabled in the dataverse
    dataverseMetadatablocksList = []
    dataverseMetadatablocksApi = '%s/api/dataverses/%s/metadatablocks' % (server, dataverseAlias)
    response = requests.get(dataverseMetadatablocksApi)
    dataverseMetadatablocks = response.json()
    for dataverseMetadatablock in dataverseMetadatablocks['data']:
        dataverseMetadatablock = dataverseMetadatablock['name']
        dataverseMetadatablocksList.append(dataverseMetadatablock)
    print('Number of metadatablocks enabled (in addition to Citation): %s' % (len(dataverseMetadatablocksList) - 1))


Dataverse has a description
Dataverse has a tagline
Number of search facets used: 13
Number of metadatablocks enabled (in addition to Citation): 0


### Get dataset info

In [22]:
# Get PIDs of all published datasets and their files in that dataverse
getContentsApi = '%s/api/dataverses/%s/contents' % (server, dataverseAlias)
response = requests.get(getContentsApi)
dataverseContents = response.json()


In [23]:
datasetPIDs = []
for item in dataverseContents['data']:
    if item['type'] == 'dataset':
        datasetPID = item['persistentUrl'].replace('https://doi.org/', 'doi:')
        datasetPIDs.append(datasetPID)
print('Number of datasets: %s' % (len(datasetPIDs)))


Number of datasets: 48


Create a dataframe for dataset info: date of publication, the release date of the latest version, number of versions

_Getting this info can be slow. For example, getting the info of ~375 datasets might take 45 min_

In [24]:
# List of file types that Dataverse can convert to .tab files
uningestedFileTypes = ['application/x-rlang-transport', 'application/x-stata-13', 'application/x-spss-por',
                      'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'text/csv', 'text/tsv',
                      'application/x-spss-sav', 'text/comma-separated-values', 'application/x-stata',
                      'application/x-stata-14']

rowList = []
datasetCount = 0
for datasetPID in datasetPIDs:
    getAllVersionsApi = '%s/api/datasets/:persistentId/versions?persistentId=%s' % (server, datasetPID)
    response = requests.get(getAllVersionsApi)
    datasetVersions = response.json()
    
    # Get only datasets with metadata (exclude responses with no values in 'data' key, e.g. deaccessioned datasets)
    if datasetVersions['status'] == 'OK' and len(datasetVersions['data']) > 0:

        # Get index location of first dataset version
        firstVersion = len(datasetVersions['data']) - 1

        publicationDate = string_to_datetime(datasetVersions['data'][firstVersion]['releaseTime'])
        latestReleaseDate = string_to_datetime(datasetVersions['data'][0]['releaseTime'])
        
        # Get age of dataset from today's date
        delta = currentTime - publicationDate
        ageOfDataset = delta.days
        
        # Get number of days since last update
        delta = currentTime - latestReleaseDate
        ageOfLastUpdate = delta.days
        
        # Get length of description text
        descriptionLength = 0
        for field in datasetVersions['data'][0]['metadataBlocks']['citation']['fields']:
            if field['typeName'] == 'dsDescription':
                for i in field['value']:
                    descriptionLength = descriptionLength + len(i['dsDescriptionValue']['value'])

        # See whether CC0 or Terms of Use metadata exists
        license = datasetVersions['data'][0]['license']
        try:
            termsOfUse = datasetVersions['data'][0]['termsOfUse']
            termsOfUse = True
        except KeyError:
            termsOfUse = False

        if license != 'CC0' and termsOfUse == False:
            termsExist = False
        else:
            termsExist = True

        # Get info about related publication metadata
        relPubCount = 0
        relPubPIDCount = 0
        for field in datasetVersions['data'][0]['metadataBlocks']['citation']['fields']:
            if field['typeName'] == 'publication':
                for value in field['value']:
                    relPubCount += 1
                    try:
                        publicationIDType = value['publicationIDType']
                        publicationIDNumber = value['publicationIDNumber']
                        relPubPIDCount += 1
                    except KeyError:
                        relPubPIDCount = relPubPIDCount
            else:
                relPubCount = relPubCount
        
        # Show metadatablocks whose fields are used by the dataset
        usedMetadataBlocks = []
        for repositoryMetadataBlockName in repositoryMetadataBlockNames:
            try:
                fieldCount = len(datasetVersions['data'][0]['metadataBlocks'][repositoryMetadataBlockName]['fields'])
                if fieldCount > 0:
                    usedMetadataBlocks.append(repositoryMetadataBlockName)
            except KeyError:
                usedMetadataBlocks = usedMetadataBlocks
        if len(usedMetadataBlocks) == 0:
            usedMetadataBlocks = ''
        else:
            usedMetadataBlocks = list_to_string(usedMetadataBlocks)
        
        # Get number of files
        numberOfFiles = len(datasetVersions['data'][0]['files'])

        # Get file info
        noFileDescriptionCount = 0
        contentType = []
        ingestedTabFilesCount = 0
        uningestedTabFilesCount = 0
        restrictedFilesCount = 0
        fileTags = []
        for file in datasetVersions['data'][0]['files']:            
            if 'description' in file:
                noFileDescriptionCount = noFileDescriptionCount
            else:
                noFileDescriptionCount += 1
            contentType.append(file['dataFile']['contentType'])
            if file['restricted'] == True:
                restrictedFilesCount += 1
            if file['dataFile']['contentType'] in uningestedFileTypes:
                uningestedTabFilesCount += 1
            if file['dataFile']['contentType'] == 'text/tab-separated-values':
                ingestedTabFilesCount += 1
            try:
                for tags in file['categories']:
                    fileTags.append(tags)
            except KeyError:
                fileTags = fileTags

        if len(fileTags) == 0:
            fileTagsExist = False
        else:
            fileTagsExist = True

        if len(contentType) == 0:
            uniqueContentTypes = 'NA'
        else:
            uniqueContentTypes = list_to_string(list(set(contentType)))

        # Create dictionary
        newRow = {'datasetPID': datasetPID,
                  'datasetPIDUrl' : datasetPID.replace('doi:', 'https://doi.org/'),
                  'numberOfVersions': len(datasetVersions['data']),
                  'publicationDate': string_to_datetime(datasetVersions['data'][firstVersion]['releaseTime']),
                  'latestReleaseDate': string_to_datetime(datasetVersions['data'][0]['releaseTime']),
                  'ageOfDataset(Days)': ageOfDataset,
                  'ageOfLastUpdate(Days)': ageOfLastUpdate,
                  'descriptionLenth': descriptionLength,
                  'license': license,
                  'termsExist': termsExist,
                  'relPubCount': relPubCount,
                  'relPubPIDCount': relPubPIDCount,
                  'usedMetadataBlocks': usedMetadataBlocks,
                  'numberOfFiles': numberOfFiles,
                  'noFileDescriptionCount': noFileDescriptionCount,
                  'fileTagsExist': fileTagsExist,
                  'uniqueContentTypes': uniqueContentTypes,
                  'ingestedTabFilesCount': ingestedTabFilesCount,
                  'uningestedTabFilesCount': uningestedTabFilesCount,
                  'restrictedFilesCount': restrictedFilesCount
                 }
        rowList.append(dict(newRow))
        datasetCount += 1
        print('%s of %s (%s)' % (datasetCount, len(datasetPIDs), datasetPID))
        
if len(datasetPIDs) != datasetCount:
    print('The metadata of %s dataset(s) could not be retrieved' % (len(datasetPIDs) - datasetCount))


1 of 48 (doi:10.7910/DVN/WEYEUM)
2 of 48 (doi:10.7910/DVN/1GGBBK)
3 of 48 (doi:10.7910/DVN/1NTFKO)
4 of 48 (doi:10.7910/DVN/U6ZBWT)
5 of 48 (doi:10.7910/DVN/GAAKFB)
6 of 48 (doi:10.7910/DVN/J0PGNY)
7 of 48 (doi:10.7910/DVN/XJP8WL)
8 of 48 (doi:10.7910/DVN/1IHXRB)
9 of 48 (doi:10.7910/DVN/SRXJLN)
10 of 48 (doi:10.7910/DVN/EI1JMG)
11 of 48 (doi:10.7910/DVN/J7O3EM)
12 of 48 (doi:10.7910/DVN/VJTPJK)
13 of 48 (doi:10.7910/DVN/SYJZNG)
14 of 48 (doi:10.7910/DVN/PDM9UH)
15 of 48 (doi:10.7910/DVN/8EHQWP)
16 of 48 (doi:10.7910/DVN/J1WQHI)
17 of 48 (doi:10.7910/DVN/YCSFSC)
18 of 48 (doi:10.7910/DVN/KYBEM4)
19 of 48 (doi:10.7910/DVN/ZGM9HF)
20 of 48 (doi:10.7910/DVN/WOKBPF)
21 of 48 (doi:10.7910/DVN/S3K4RH)
22 of 48 (doi:10.7910/DVN/FD8O8W)
23 of 48 (doi:10.7910/DVN/WOED4N)
24 of 48 (doi:10.7910/DVN/3LEWIE)
25 of 48 (doi:10.7910/DVN/GBATE9)
26 of 48 (doi:10.7910/DVN/7BLIA3)
27 of 48 (doi:10.7910/DVN/5PEDGW)
28 of 48 (doi:10.7910/DVN/U3ZQYY)
29 of 48 (doi:10.7910/DVN/26HDEO)
30 of 48 (doi:10.7910/D

In [25]:
df = pd.DataFrame(rowList)               
df

Unnamed: 0,datasetPID,datasetPIDUrl,numberOfVersions,publicationDate,latestReleaseDate,ageOfDataset(Days),ageOfLastUpdate(Days),descriptionLenth,license,termsExist,relPubCount,relPubPIDCount,usedMetadataBlocks,numberOfFiles,noFileDescriptionCount,fileTagsExist,uniqueContentTypes,ingestedTabFilesCount,uningestedTabFilesCount,restrictedFilesCount
0,doi:10.7910/DVN/WEYEUM,https://doi.org/10.7910/DVN/WEYEUM,1,2020-10-21 18:23:49+00:00,2020-10-21 18:23:49+00:00,6,6,1292,CC0,True,0,0,citation,2,2,False,"application/x-stata-syntax,text/tab-separated-...",1,0,0
1,doi:10.7910/DVN/1GGBBK,https://doi.org/10.7910/DVN/1GGBBK,1,2020-10-19 17:53:58+00:00,2020-10-19 17:53:58+00:00,8,8,1072,CC0,True,0,0,citation,18,18,False,"application/dbf,application/pdf,application/sh...",8,0,0
2,doi:10.7910/DVN/1NTFKO,https://doi.org/10.7910/DVN/1NTFKO,2,2020-10-16 19:37:42+00:00,2020-10-16 20:57:31+00:00,11,11,1685,CC0,True,0,0,citation,6,6,False,"application/pdf,text/tab-separated-values,type...",4,0,0
3,doi:10.7910/DVN/U6ZBWT,https://doi.org/10.7910/DVN/U6ZBWT,1,2020-08-05 14:44:22+00:00,2020-08-05 14:44:22+00:00,83,83,41,CC0,True,0,0,citation,7,5,False,"application/x-stata-14,application/x-stata-syn...",2,2,0
4,doi:10.7910/DVN/GAAKFB,https://doi.org/10.7910/DVN/GAAKFB,1,2020-07-31 14:02:44+00:00,2020-07-31 14:02:44+00:00,89,89,100,CC0,True,1,0,citation,8,8,False,"application/x-stata-syntax,text/tab-separated-...",5,0,0
5,doi:10.7910/DVN/J0PGNY,https://doi.org/10.7910/DVN/J0PGNY,1,2020-07-20 13:20:12+00:00,2020-07-20 13:20:12+00:00,100,100,117,CC0,True,1,0,citation,11,11,False,"application/pdf,text/csv,text/plain,text/tab-s...",3,1,0
6,doi:10.7910/DVN/XJP8WL,https://doi.org/10.7910/DVN/XJP8WL,1,2020-06-25 19:23:02+00:00,2020-06-25 19:23:02+00:00,124,124,186,CC0,True,0,0,citation,4,4,False,"application/x-stata-syntax,text/tab-separated-...",2,0,0
7,doi:10.7910/DVN/1IHXRB,https://doi.org/10.7910/DVN/1IHXRB,1,2020-06-16 19:04:53+00:00,2020-06-16 19:04:53+00:00,133,133,272,CC0,True,1,0,citation,6,6,False,application/x-stata-syntax,0,0,0
8,doi:10.7910/DVN/SRXJLN,https://doi.org/10.7910/DVN/SRXJLN,1,2020-06-12 08:41:56+00:00,2020-06-12 08:41:56+00:00,138,138,151,CC0,True,0,0,citation,4,4,False,"application/x-stata-syntax,text/tab-separated-...",3,0,0
9,doi:10.7910/DVN/EI1JMG,https://doi.org/10.7910/DVN/EI1JMG,1,2020-06-05 14:19:18+00:00,2020-06-05 14:19:18+00:00,145,145,120,CC0,True,0,0,citation,20,20,False,"application/pdf,application/vnd.ms-excel,appli...",12,0,0


In [26]:
# Save dataframe to CSV
csvFileName = '%s_datasets_2.csv' % (dataverseAlias)
df.to_csv(csvFileName)

In [10]:
# Show date of first published dataset
publicationDates = df['publicationDate']
firstPublicationDate = publicationDates.min()

# Show date of most recently published or updated dataset
latestReleaseDates = df['latestReleaseDate']
lastReleaseDate = latestReleaseDates.min()

# Show age of most recently published or updated dataset
ageOfLastUpdates = df['ageOfLastUpdate']
ageOfLastUpdate = ageOfLastUpdates.min()

# Show average age of datasets
ageOfDatasets = df['ageOfDataset(Days)']
averageAge = ageOfDatasets.mean()

# Show average number of dataset versions
numberOfVersions = df['numberOfVersions']
averageNumberOfVersions = numberOfVersions.mean()

# Create list of datasets with fewer than a certain number of characters in their descriptions
lowDescriptionCount = df[df['descriptionLenth'] < 20]
lowDescriptionCount = lowDescriptionCount['datasetPID']


KeyError: 'ageOfLastUpdate'

In [None]:
# List count of each unique file format