<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Plan" data-toc-modified-id="Plan-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Plan</a></span></li><li><span><a href="#Code" data-toc-modified-id="Code-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Code</a></span><ul class="toc-item"><li><span><a href="#Import-modules-and-load-functions" data-toc-modified-id="Import-modules-and-load-functions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import modules and load functions</a></span></li><li><span><a href="#Get-dataverse-info" data-toc-modified-id="Get-dataverse-info-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Get dataverse info</a></span></li><li><span><a href="#Get-aliases-of-any-sub-dataverses-in-the-given-dataverse" data-toc-modified-id="Get-aliases-of-any-sub-dataverses-in-the-given-dataverse-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Get aliases of any sub-dataverses in the given dataverse</a></span></li><li><span><a href="#Get-dataset-info" data-toc-modified-id="Get-dataset-info-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Get dataset info</a></span></li></ul></li></ul></div>

## Plan

For chosen dataverse:
    - Show dataverse info:
        - Show number of subdataverses, if any
        - Show whether or not the (sub)dataverse has a description or tagline
        - List metadatablocks of each (sub)dataverse
        - List the facets set for each (sub)dataverse
        - Verify that contact email address is valid for each (sub)dataverse
        (see https://medium.com/@arjunsinghy96/verify-emails-over-socks-proxy-using-python-5589cb75c405
        and https://github.com/Gardener-gg/email-verifier)
    - Show dataset info
        - Show number of datasets
        - Show date of first published dataset
        - Show date of most recently published or updated dataset
        - Show average age of datasets
        - Show average number of dataset versions
        - Metadata (of latest published version of each dataset):
            - Show average number of characters in the dataset descriptions
                - List datasets with fewer than a certain number of characters in their descriptions
            - List datasets with CC0 or Terms of Use metadata
                - Versus number of datasets with no CC0 or TOU metadata
                - List datasets with no CC0 or TOU metadata
            - Show number of files that have no description metadata
                - If a certain percentage of datasets have 1 or more files with no descriptions, list those datasets
            - Related publication metadata
                - Show number of datasets with related publication metadata
                    - List datasets with no related publication metadata
                - Show number of datasets with no PID in related publication metadata
                    - List datasets with no PID in related publication metadata
            - Show datasets that have no metadata for any non-citation metadatablocks enabled in the dataverse
        - Data
            - List count of each unique file format
            - Show number of datasets with no files
                - List datasets that have no files
            - Show number of datasets with 1 or more uningested tabular files
                - List datasets that contain 1 or more uningested tabular files
            - Show number of datasets with 1 or more restricted files
        - Contact emails
            - Get number of datasets that have a contact email address that's different email from the dataverse contact email address
            - Get unique list of contact email addresses and check if they're valid
            - Show any datasets that have no valid email addresses

## Code

### Import modules and load functions

In [287]:
from datetime import datetime, timezone
from functools import reduce
import numpy as np
import pandas as pd
import requests
import sys
import time


def improved_get(_dict, path, default=None):
    for key in path.split('.'):
        try:
            _dict = _dict[key]
        except KeyError:
            return default
    return _dict


def list_to_string(list):
    # Alphabetize list in case-insensitive way
    list = sorted(list, key=lambda s: s.casefold())

    # Change list to comma-separated string
    delimiter = ","
    string = delimiter.join(list)
    return string


def string_to_list(string): 
    li = list(string.split(",")) 
    return li


def string_to_datetime(string):
    newDatetime = datetime.strptime(string, '%Y-%m-%dT%H:%M:%S%z')
    return newDatetime


currentTime = datetime.now(timezone.utc)


### Get dataverse info

In [288]:
# Get dataverse server and alias from user - return error if there's no alias or if alias is the Root dataverse
server = 'https://demo.dataverse.org'
mainDataverseAlias = 'sefsef'
# server = 'https://dataverse.harvard.edu'
# mainDataverseAlias = 'mit'

repositoryMetadataBlocksApi = '%s/api/v1/metadatablocks' % (server)
response = requests.get(repositoryMetadataBlocksApi)
repositoryMetadataBlocks = response.json()

repositoryMetadataBlockNames = []
for repositoryMetadataBlock in repositoryMetadataBlocks['data']:
    repositoryMetadataBlockNames.append(repositoryMetadataBlock['name'])

In [289]:
# Get info from that dataverse: whether or not the dataverse has a description and/or tagline, metadatablocks enabled, facets enabled, validate contact email
dataverseInfoApi = '%s/api/dataverses/%s' % (server, mainDataverseAlias)
response = requests.get(dataverseInfoApi)
dataverseMetadata = response.json()

In [290]:
if dataverseMetadata['status'] == 'ERROR':
    print('No dataverse found. Is the dataverse published on Harvard Dataverse?')
elif dataverseMetadata['status'] == 'OK':
    if 'description' in dataverseMetadata['data']:
        dataverseMetadataExists = True
    else:
        dataverseMetadataExists = False
    print('Dataverse description exists: %s' % (dataverseMetadataExists))

    if 'theme' in dataverseMetadata['data'] and 'tagline' in dataverseMetadata['data']['theme']:
        taglineExists = True
    else:
        taglineExists = False
    print('Dataverse tagline exists: %s' % (taglineExists))

#     contactEmails = []
#     for contact in dataverseMetadata['data']['dataverseContacts']:
#         contactEmails.append(contact['contactEmail'])
#     print(contactEmails)

    dataverseFacetsApi = '%s/api/dataverses/%s/facets' % (server, mainDataverseAlias)
    response = requests.get(dataverseFacetsApi)
    dataverseFacets = response.json()
    facets = []
    for facet in dataverseFacets['data']:
        facets.append(facet)
    print('Number of search facets used: %s' % (len(facets)))    

#     # See if dataverse inherits its metadatablocks from its parent dataverse
#     metadatablocksInheritedApi = '%s/api/dataverses/%s/metadatablocks/isRoot' % (server, dataverseAlias)
#     response = requests.get(metadatablocksInheritedApi)
#     metadatablocksInherited = response.json()
#     print(metadatablocksInherited)
    
    # Get list of metadatablocks enabled in the dataverse
    dataverseMetadatablocksList = []
    dataverseMetadatablocksApi = '%s/api/dataverses/%s/metadatablocks' % (server, mainDataverseAlias)
    response = requests.get(dataverseMetadatablocksApi)
    dataverseMetadatablocks = response.json()
    for dataverseMetadatablock in dataverseMetadatablocks['data']:
        dataverseMetadatablock = dataverseMetadatablock['name']
        dataverseMetadatablocksList.append(dataverseMetadatablock)
    print('Number of metadatablocks enabled (in addition to Citation): %s' % (len(dataverseMetadatablocksList) - 1))
    print('\t%s' % (dataverseMetadatablocksList))


Dataverse description exists: False
Dataverse tagline exists: False
Number of search facets used: 6
Number of metadatablocks enabled (in addition to Citation): 5
	['citation', 'biomedical', 'journal', 'astrophysics', 'socialscience', 'geospatial']


### Get aliases of any sub-dataverses in the given dataverse

In [291]:
mainDataverseInfoApi = '%s/api/dataverses/%s' % (server, mainDataverseAlias)
response = requests.get(mainDataverseInfoApi)
data = response.json()
mainDataverseID = data['data']['id']

dataverseIDs = [mainDataverseID]
for dataverseID in dataverseIDs:

    sys.stdout.write('.')
    sys.stdout.flush()

    url = '%s/api/dataverses/%s/contents' % (server, dataverseID)

    response = requests.get(url)
    data = response.json()

    for i in data['data']:
        if i['type'] == 'dataverse':
            dataverseID = i['id']
            dataverseIDs.extend([dataverseID])

print('\n\nFound 1 dataverse and %s subdataverses' % (len(dataverseIDs) - 1))

...

Found 1 dataverse and 2 subdataverses


### Get dataset info

In [292]:
# Get PIDs of all published datasets in each of the dataverses
datasetPIDs = []
rowList = []
for dataverseID in dataverseIDs:
    getDataverseInfoApi = '%s/api/dataverses/%s' % (server, dataverseID)
    response = requests.get(getDataverseInfoApi)
    dataverseInfo = response.json()
    dataverseName = dataverseInfo['data']['name']
    dataverseAlias = dataverseInfo['data']['alias']

    getDataverseContentsApi = '%s/api/dataverses/%s/contents' % (server, dataverseID)
    response = requests.get(getDataverseContentsApi)
    dataverseContents = response.json()
    for item in dataverseContents['data']:
        if item['type'] == 'dataset':
            datasetPID = item['persistentUrl'].replace('https://doi.org/', 'doi:')
            datasetPIDs.append(datasetPID)
            
            newRow = {'datasetPID': datasetPID,
                  'dataverseName': dataverseName,
                  'dataverseUrl': '%s/dataverse/%s' % (server, dataverseAlias)
                 }
            rowList.append(dict(newRow))
            
            sys.stdout.write('.')
            sys.stdout.flush()
            
print('\nNumber of datasets: %s' % (len(datasetPIDs)))

datasetDataverseInfoDF = pd.DataFrame(rowList)


...
Number of datasets: 3


In [293]:
# print(datasetPIDs)

Create a dataframe for dataset info: date of publication, the release date of the latest version, number of versions

_Getting this info can be slow. For example, getting the info of ~375 datasets might take 45 min_

In [294]:
# Create list of file types that Dataverse can convert to .tab files during ingest
uningestedFileTypes = ['application/x-rlang-transport', 'application/x-stata-13', 'application/x-spss-por',
                      'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'text/csv', 'text/tsv',
                      'application/x-spss-sav', 'text/comma-separated-values', 'application/x-stata',
                      'application/x-stata-14']

rowList = []
datasetCount = 0
for datasetPID in datasetPIDs:
    getAllVersionsApi = '%s/api/datasets/:persistentId/versions?persistentId=%s' % (server, datasetPID)
    response = requests.get(getAllVersionsApi)
    datasetVersions = response.json()
    
    # Get only datasets with metadata (exclude responses with no values in 'data' key, e.g. deaccessioned datasets)
    if datasetVersions['status'] == 'OK' and len(datasetVersions['data']) > 0:
        
        # Get metadata of latest version
        latestDatasetVersion = datasetVersions['data'][0]
        
        # Get index location of first dataset version
        firstVersion = len(datasetVersions['data']) - 1

        publicationDate = string_to_datetime(datasetVersions['data'][firstVersion]['releaseTime'])
        latestReleaseDate = string_to_datetime(latestDatasetVersion['releaseTime'])
        
        # Get age of dataset from today's date
        delta = currentTime - publicationDate
        ageOfDataset = delta.days
        
        # Get number of days since last update
        delta = currentTime - latestReleaseDate
        ageOfLastUpdate = delta.days
        if ageOfLastUpdate < 0:
            ageOfLastUpdate = 0
        
        # Get length of description text
        descriptionLength = 0
        
        for field in latestDatasetVersion['metadataBlocks']['citation']['fields']:
            if field['typeName'] == 'dsDescription':
                # "N/A" is the value assigned there was no description given (pre Dataverse 4)
                if len(field['value']) == 1 and field['value'][0]['dsDescriptionValue']['value'] == 'N/A':
                    descriptionLength = 0
                else:
                    for i in field['value']:
                        descriptionLength = descriptionLength + len(i['dsDescriptionValue']['value'])

        # See whether CC0 or Terms of Use metadata exists
        license = latestDatasetVersion.get('license', 'None')

        if 'termsOfUse' in latestDatasetVersion:
            termsOfUse = True
        else:
            termsOfUse = False
            
        if 'termsOfAccess' in latestDatasetVersion:
            termsOfAccess = True
        else:
            termsOfAccess = False

        if license != 'CC0' and termsOfUse == False:
            termsExist = False
        else:
            termsExist = True

        # Get info about related publication metadata
        relPubCount = 0
        relPubPIDCount = 0
        for field in latestDatasetVersion['metadataBlocks']['citation']['fields']:
            if field['typeName'] == 'publication':
                for value in field['value']:
                    relPubCount += 1
                    if 'publicationIDType' and 'publicationIDNumber' in value:
                        relPubPIDCount += 1
        
        # Show metadatablocks whose fields are used by the dataset
        usedMetadataBlocks = []
        for repositoryMetadataBlockName in repositoryMetadataBlockNames:
            try:
                fieldCount = len(latestDatasetVersion['metadataBlocks'][repositoryMetadataBlockName]['fields'])
                if fieldCount > 0:
                    usedMetadataBlocks.append(repositoryMetadataBlockName)
            except KeyError:
                usedMetadataBlocks = usedMetadataBlocks
        if len(usedMetadataBlocks) == 0:
            usedMetadataBlocks = ''
        else:
            usedMetadataBlocks = list_to_string(usedMetadataBlocks)
        
        # Get number of files
        numberOfFiles = len(latestDatasetVersion['files'])

        # Get file info
        noFileDescriptionCount = 0
        contentType = []
        ingestedTabFilesCount = 0
        uningestedTabFilesCount = 0
        restrictedFilesCount = 0
        fileTags = []
        for file in latestDatasetVersion['files']:            
            if 'description' in file:
                noFileDescriptionCount = noFileDescriptionCount
            else:
                noFileDescriptionCount += 1
            contentType.append(file['dataFile']['contentType'])
            if file['restricted'] == True:
                restrictedFilesCount += 1
            if file['dataFile']['contentType'] in uningestedFileTypes:
                uningestedTabFilesCount += 1
            if file['dataFile']['contentType'] == 'text/tab-separated-values':
                ingestedTabFilesCount += 1
            try:
                for tags in file['categories']:
                    fileTags.append(tags)
            except KeyError:
                fileTags = fileTags

        tabularDataFileCount = uningestedTabFilesCount + ingestedTabFilesCount

        if len(fileTags) == 0:
            fileTagsExist = False
        else:
            fileTagsExist = True

        if len(contentType) == 0:
            uniqueContentTypes = 'NA'
        else:
            uniqueContentTypes = list_to_string(list(set(contentType)))

        # Create dictionary
        newRow = {'datasetPID': datasetPID,
                  'datasetPIDUrl' : datasetPID.replace('doi:', 'https://doi.org/'),
                  'numberOfVersions': len(datasetVersions['data']),
                  'numberOfMajorVersions': latestDatasetVersion['versionNumber'],
                  'publicationDate': publicationDate,
                  'latestReleaseDate': latestReleaseDate,
                  'ageOfDataset(Days)': ageOfDataset,
                  'ageOfLastUpdate(Days)': ageOfLastUpdate,
                  'descriptionLenth': descriptionLength,
                  'termsExist': termsExist,
                  'license': license,
                  'termsOfUseExists': termsOfUse,
                  'termsOfAccessExists': termsOfAccess,
                  'relPubCount': relPubCount,
                  'relPubPIDCount': relPubPIDCount,
                  'usedMetadataBlocks': usedMetadataBlocks,
                  'numberOfFiles': numberOfFiles,
                  'noFileDescriptionCount': noFileDescriptionCount,
                  'fileTagsExist': fileTagsExist,
                  'uniqueContentTypes': uniqueContentTypes,
                  'tabularDataFileCount': tabularDataFileCount,
                  'ingestedTabFilesCount': ingestedTabFilesCount,
                  'uningestedTabFilesCount': uningestedTabFilesCount,
                  'restrictedFilesCount': restrictedFilesCount
                 }
        rowList.append(dict(newRow))
        datasetCount += 1
        print('%s of %s (%s)' % (datasetCount, len(datasetPIDs), datasetPID), end='\r', flush=True)
        
if len(datasetPIDs) != datasetCount:
    print('The metadata of %s dataset(s) could not be retrieved' % (len(datasetPIDs) - datasetCount))


3 of 3 (doi:10.70122/FK2/ZYUGHH)

In [295]:
datasetInfoDF = pd.DataFrame(rowList)


In [296]:
dataframes = [datasetDataverseInfoDF, datasetInfoDF]

# For each dataframe, set the indexes (or the common columns across the dataframes to join on)
for dataframe in dataframes:
    dataframe.set_index(['datasetPID'], inplace=True)

# Merge both dataframes and save to the 'merged' variable
report = reduce(lambda left, right: left.join(right, how='outer'), dataframes)

# Reset index
report.reset_index(drop=False, inplace=True)


In [297]:
# report

Unnamed: 0,datasetPID,dataverseName,dataverseUrl,datasetPIDUrl,numberOfVersions,numberOfMajorVersions,publicationDate,latestReleaseDate,ageOfDataset(Days),ageOfLastUpdate(Days),...,relPubPIDCount,usedMetadataBlocks,numberOfFiles,noFileDescriptionCount,fileTagsExist,uniqueContentTypes,tabularDataFileCount,ingestedTabFilesCount,uningestedTabFilesCount,restrictedFilesCount
0,doi:10.70122/FK2/HZTO03,Julian Gautier (SU) Dataverse,https://demo.dataverse.org/dataverse/sefsef,https://doi.org/10.70122/FK2/HZTO03,3,1,2020-08-04 19:48:40+00:00,2020-10-26 03:44:39+00:00,86,3,...,0,"citation,geospatial",0,0,False,,0,0,0,0
1,doi:10.70122/FK2/CMFTOD,Julian Gautier (SU) Dataverse,https://demo.dataverse.org/dataverse/sefsef,https://doi.org/10.70122/FK2/CMFTOD,1,1,2020-10-14 20:07:47+00:00,2020-10-14 20:07:47+00:00,15,15,...,0,citation,2,2,False,image/jpeg,0,0,0,0
2,doi:10.70122/FK2/ZYUGHH,Julian Gautier (SU) Dataverse,https://demo.dataverse.org/dataverse/sefsef,https://doi.org/10.70122/FK2/ZYUGHH,16,5,2020-09-17 16:08:53+00:00,2020-10-29 03:16:38+00:00,42,0,...,1,"astrophysics,biomedical,citation,geospatial,so...",3,2,True,"image/jpeg,image/png,text/tab-separated-values",1,1,0,0


In [298]:
# Export report to CSV
file = '%s_%s.csv' % (mainDataverseAlias, currentTime)
report.to_csv(file, index=False)


In [329]:
# report = pd.read_csv('nds_datasets.csv', na_filter = False)

In [330]:
# # Get list of metadatablocks used by all datasets
# allUsedMetadataBlocks = []
# for i in report['usedMetadataBlocks']:
#     allUsedMetadataBlocks.extend(list(i.split(",")))

# # Deduplicate, alphabetize and change list to string
# allUsedMetadataBlocks = list_to_string(list(set(allUsedMetadataBlocks)))

In [331]:
# for i in report['uniqueContentTypes']:
#     print('%s: %s' % (i, type(i)))

text/plain; charset=US-ASCII: <class 'str'>
application/msword,application/pdf,text/plain; charset=US-ASCII,text/tab-separated-values,text/x-spss-syntax; charset=US-ASCII: <class 'str'>
application/dbf,application/octet-stream,application/pdf,application/vnd.ms-excel,text/tab-separated-values: <class 'str'>
NA: <class 'str'>
text/plain; charset=US-ASCII: <class 'str'>
application/pdf,text/plain; charset=US-ASCII: <class 'str'>
text/plain; charset=US-ASCII: <class 'str'>
application/dbf,application/pdf,text/plain: <class 'str'>
application/msword,application/pdf,text/plain; charset=US-ASCII,text/tab-separated-values,text/x-spss-syntax; charset=US-ASCII: <class 'str'>
application/msword,application/pdf,text/plain; charset=US-ASCII,text/tab-separated-values,text/x-spss-syntax; charset=US-ASCII: <class 'str'>
text/plain; charset=US-ASCII: <class 'str'>
text/plain; charset=US-ASCII: <class 'str'>
text/plain; charset=US-ASCII: <class 'str'>
text/plain; charset=US-ASCII: <class 'str'>
applica

In [332]:
# # Get list of uniqueContentTypes used by all datasets
# allContentTypes = []
# for i in report['uniqueContentTypes']:
#     if i != 'NA':
#         allContentTypes.extend(list(i.split(",")))

# # Deduplicate, alphabetize and change list to string
# # allContentTypes = list_to_string(list(set(allContentTypes)))
# len(set(allContentTypes))

17

In [333]:
# # Create summary
# summaryDict = {
#     'Summary': {
#         '0': 'Has description',
#         '1': 'Has tagline',
#         '2': 'Number of search facets',
#         '3': 'Metadatablocks enabled',
#         '4': 'Dataset count',
#         '5': 'Versions (avg # of major and minor versions)',
#         '6': 'Major versions (average #)',
#         '7': 'Description length (avg # of characters)',
#         '8': 'CC0 datasets (% of total datasets)',
#         '9': 'Age of datasets (average)',
#         '10': 'No terms (% of datasets with no terms metadata)',
#         '11': 'Related pub metadata (% of datasets with rel pub metadata)',
#         '12': 'Related pub PIDs (% of datasets with rel pub PIDs)',
#         '13': 'Metadatablocks used (list)',
#         '14': 'No files (# of datasets with no files)',
#         '15': 'File descriptions (% of datasets with 1 or more file descriptions)',
#         '16': 'File tags (% of datasets with 1 or more file tags)',
#         '17': 'Unique file types (count)',
#         '18': 'Tabular data (% of datasets with tabular data) ',
#         '19': 'Tabular data ingest successes (% of datasets with tabular data that has been ingested)',
#         '20': 'Tabular data ingest failures (% of datasets with tabular data that has not been ingested)',
#         '21': 'Public files (% of unrestricted files)'
#     },
#     mainDataverseAlias: {
#         '0': dataverseMetadataExists,
#         '1': taglineExists,
#         '2': len(facets),
#         '3': len(dataverseMetadatablocksList) - 1,
#         '4': datasetCount,
#         '5': report['numberOfVersions'].mean(),
#         '6': report['numberOfMajorVersions'].mean(),
#         '7': report['descriptionLenth'].mean(),
#         '8': (len(report['license']=='CC0')/datasetCount)*100,
#         '9': report['ageOfDataset(Days)'].mean(),
#         '10': ((~report['termsExist']).values.sum())/datasetCount*100,
#         '11': len(report[(report['relPubCount']!=0)])/datasetCount*100,
#         '12': len(report[(report['relPubPIDCount']!=0)])/datasetCount*100,
#         '13': allUsedMetadataBlocks,
#         '14': len(report[(report['numberOfFiles']==0)]),
#         '15': len(report[(report['noFileDescriptionCount']!=0)])/datasetCount*100,
#         '16': ((report['fileTagsExist']).values.sum())/datasetCount*100,
#         '17': len(set(allContentTypes)),
#         '18': len(report[(report['tabularDataFileCount']!=0)])/datasetCount*100,
#         '19': len(report[(report['ingestedTabFilesCount']!=0)])/tabularDataFileCount*100,
#         '20': len(report[(report['uningestedTabFilesCount']!=0)])/tabularDataFileCount*100,
#         '21': len(report[(report['restrictedFilesCount']==0)])
#     }
# }

# # # Show average age of datasets
# # ageOfDatasets = report['ageOfDataset(Days)'].mean()
# # averageDatasetAge = ageOfDatasets.mean()

# # # Show average number of dataset versions
# # numberOfVersions = report['numberOfVersions'].mean()
# # averageNumberOfVersions = numberOfVersions.mean()

# # # Create list of datasets with fewer than a certain number of characters in their descriptions
# # lowDescriptionCount = df[df['descriptionLenth'] < 20]
# # lowDescriptionCount = lowDescriptionCount['datasetPID']


KeyError: 'tabularDataFileCount'

In [319]:
# datasetInfoDF = pd.DataFrame(summaryDict)

In [320]:
# datasetInfoDF

Unnamed: 0,Summary,sefsef
0,Has description,False
1,Has tagline,False
2,Number of search facets,6
3,Metadatablocks enabled,5
4,Dataset count,3
5,Versions (avg # of major and minor versions),6.66667
6,Major versions (average #),2.33333
7,Description length (avg # of characters),4.66667
8,CC0 datasets (% of total datasets),100
9,Age of datasets (average),47.6667


In [12]:
# List count of each unique file format