In [1]:
import csv
from functools import reduce
from google_trans_new import google_translator
import pandas as pd


## Get data

Get data about whether or not each dataset as one or more restricted files

In [2]:
restrictedFilesCountDF = pd.read_csv('restricted_files_count.tab', sep='\t', na_filter = False)

# Get only metadata for the latest versions of each dataset
restrictedFilesCountLatestversionDF = (restrictedFilesCountDF
    .iloc[restrictedFilesCountDF.groupby('persistentUrl')['datasetVersionId'].agg(pd.Series.idxmax)]
    .reset_index(drop=True, inplace=False)
    .drop(columns=['datasetVersionId'])
    )


In [3]:
restrictedFilesCountLatestversionDF.head(5)

Unnamed: 0,persistentUrl,restricted_files
0,http://dx.doi.org/10.26193/00HBWG,NA (not recorded)
1,http://dx.doi.org/10.26193/01P0AI,NA (not recorded)
2,http://dx.doi.org/10.26193/04F7C1,NA (not recorded)
3,http://dx.doi.org/10.26193/07R31R,NA (not recorded)
4,http://dx.doi.org/10.26193/0AF6TZ,NA (not recorded)


In [4]:
termsMetadataDF = pd.read_csv('terms_metadata.tab', sep='\t', na_filter = False)

# Get only metadata for the latest versions of each dataset
termsMetadataLatestversionDF = (termsMetadataDF
    .iloc[termsMetadataDF.groupby('persistentUrl')['datasetVersionId']
    .agg(pd.Series.idxmax)]
    .sort_values(by=['publisher'], inplace=False, ascending=True)
    .drop(columns=['datasetVersionId'])
    .reset_index(drop=True, inplace=False)
    )


In [5]:
termsMetadataLatestversionDF.head(5)

Unnamed: 0,publisher,persistentUrl,majorVersionNumber,minorVersionNumber,license,termsOfAccess,termsOfUse,availabilityStatus,citationRequirements,conditions,confidentialityDeclaration,contactForAccess,dataaccessPlace,depositorRequirements,disclaimer,originalArchive,restrictions,sizeOfCollection,specialPermissions,studyCompletion
0,ACSS Dataverse,https://doi.org/10.25825/FK2/VXVPVP,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,
1,ACSS Dataverse,https://doi.org/10.25825/FK2/8YKSQV,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,
2,ACSS Dataverse,https://doi.org/10.25825/FK2/9QFRW2,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,
3,ACSS Dataverse,https://doi.org/10.25825/FK2/A3JWCN,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,
4,ACSS Dataverse,https://doi.org/10.25825/FK2/AGZJI8,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,


Join restrictedFilesCountLatestversionDF to termsMetadataLatestversionDF

In [6]:
termsAndRestrictedFilesDF = pd.merge(termsMetadataLatestversionDF, restrictedFilesCountLatestversionDF)
termsAndRestrictedFilesDF = termsAndRestrictedFilesDF[['publisher','persistentUrl', 'termsOfAccess', 'restricted_files']]


In [7]:
termsAndRestrictedFilesDF.head(5)

Unnamed: 0,publisher,persistentUrl,termsOfAccess,restricted_files
0,ACSS Dataverse,https://doi.org/10.25825/FK2/VXVPVP,,0
1,ACSS Dataverse,https://doi.org/10.25825/FK2/8YKSQV,,0
2,ACSS Dataverse,https://doi.org/10.25825/FK2/9QFRW2,,0
3,ACSS Dataverse,https://doi.org/10.25825/FK2/A3JWCN,,0
4,ACSS Dataverse,https://doi.org/10.25825/FK2/AGZJI8,,0


Verify data in joined table by comparing count of datasets in restrictedFilesCountLatestversionDf to count of datasets termsMetadataLatestversionDF.

In [8]:
print('Number of datasets in termsMetadataLatestversionDF: %s' % (len(pd.unique(termsMetadataLatestversionDF['persistentUrl']))))
print('Number of datasets in termsAndRestrictedFilesDF: %s' % (len(pd.unique(termsAndRestrictedFilesDF['persistentUrl']))))


Number of datasets in termsMetadataLatestversionDF: 133253
Number of datasets in termsAndRestrictedFilesDF: 133253


Great, the same number of datasets exist in one of the original tables and the resulting joined table. This means the join was successful.

Now we'll count the number of datasets whose latest version:
- Has Terms of Access metadata
  - And has no files
  - And has files
    - But we don't know if the files are restricted because the datasets are in repositories whose JSON exports don't include if files are restricted or not
    - Where one or more files are restricted
    - Where no files are restricted
- That have restricted files but no Terms of Access metadata

In [18]:
print('Number of datasets: %s' % (len(pd.unique(termsAndRestrictedFilesDF['persistentUrl']))))

# Has Terms of Access metadata
df1 = (termsAndRestrictedFilesDF.query('termsOfAccess != ""'))
print('Number of datasets with Terms of Access metadata: %s' % (len(pd.unique(df1['persistentUrl']))))

# Has ToA metadata but no files
df2 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess != "" and\
        restricted_files == "NA (no files)"'))
print('\tNumber of datasets with ToA but no files: %s' % (len(pd.unique(df2['persistentUrl']))))

# Has ToA metadata and one or more files
df3 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess != "" and\
        restricted_files != "NA (no files)"'))
print('\tNumber of datasets with ToA and one or more files: %s' % (len(pd.unique(df3['persistentUrl']))))

# Has ToA metadata and files but we don't know if the files are restricted
df3 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess != "" and\
        restricted_files == "NA (not recorded)"'))
print('\t\tNumber of datasets with ToA and files but we don\'t know if the files are restricted: %s' % (len(pd.unique(df3['persistentUrl']))))

# Has ToA metadata and files and one or more files is restricted
df4 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess != ""\
        and restricted_files.str.contains("NA") == False\
        and restricted_files != "0"'))
print('\t\tNumber of datasets with ToA and one or more restricted\
 files: %s' % (len(pd.unique(df4['persistentUrl']))))

# Has ToA metadata and files and none are restricted
df5 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess != ""\
        and restricted_files == "0"'))
print('\t\tNumber of datasets with ToA and no restricted files: %s' % (len(pd.unique(df5['persistentUrl']))))

# Does not have ToA metadata but has one or more restricted files
df6 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess == ""\
        and restricted_files.str.contains("NA") == False\
        and restricted_files != "0"'))
print('Number of datasets with no ToA but with restricted file(s): %s' % (len(pd.unique(df3['persistentUrl']))))


Number of datasets: 133253
Number of datasets with Terms of Access metadata: 6798
	Number of datasets with ToA but no files: 68
	Number of datasets with ToA and one or more files: 6730
		Number of datasets with ToA and files but we don't know if the files are restricted: 1531
		Number of datasets with ToA and one or more restricted files: 2354
		Number of datasets with ToA and no restricted files: 2845
Number of datasets with no ToA but with restricted file(s): 1531


So of the datasets that have Terms of Access metadata, 6798, about a third, 2845, have no restricted files.

Let's take a look at the ToA for a few of these datasets. The data is stored in the "df5" dataframe.

In [19]:
df5.head(5)

Unnamed: 0,publisher,persistentUrl,termsOfAccess,restricted_files
1675,AUSSDA,https://doi.org/10.11587/QYUSBJ,"Für jede Datei, die Sie herunterladen erklären...",0
1677,AUSSDA,https://doi.org/10.11587/R1PDUC,"Für jede Datei, die Sie herunterladen erklären...",0
1680,AUSSDA,https://doi.org/10.11587/R2W6EQ,"Für jede Datei, die Sie herunterladen erklären...",0
1682,AUSSDA,https://doi.org/10.11587/R7CPFY,"<p>For each file you download, you declare:<br...",0
1689,AUSSDA,https://doi.org/10.11587/R4RCPI,"<p>For each file you download, you declare:<br...",0


The first five datasets from AUSSDA each have one public PDF file that is documentation. The Terms of Access metadata actually applies to data files that are not included in the dataset.

Let's see how many of these 2845 datasets are in the AUSSDA repository.

In [20]:
datasetsByPublisher = df5.value_counts(subset=['publisher']).to_frame('dataset_count')
datasetsByPublisher


Unnamed: 0_level_0,dataset_count
publisher,Unnamed: 1_level_1
RIN Dataverse,2126
AUSSDA,296
Harvard Dataverse,295
Root,40
Scholars Portal Dataverse,19
UAL Dataverse,12
DataverseNL,9
DataverseNO,6
UNC Dataverse,6
QDR Main Collection,6


Most of the datasets are actually in one repository, RIN repository. This is encouraging as it might make it easier to get a good picture of how ToA are being used with regard to restricted files.

In [30]:
df7 = (df1.query(
        'publisher == "AUSSDA"\
        and termsOfAccess != ""\
        and restricted_files != "0"\
        and restricted_files.str.contains("NA") == False'))
print('Number of AUSSDA datasets with ToA metadata and one or more files: %s' % (len(df7)))


Number of AUSSDA datasets with ToA metadata and one or more files: 431


So 431 datasets in the AUSSDA repository have ToA metadata and one or more restricted files.

Let's take a look at these datasets.



In [31]:
df7.head(5)

Unnamed: 0,publisher,persistentUrl,termsOfAccess,restricted_files
1676,AUSSDA,https://doi.org/10.11587/R0XDMW,"Für jede Datei, die Sie herunterladen erklären...",2
1678,AUSSDA,https://doi.org/10.11587/RF9W7K,"Für jede Datei, die Sie herunterladen erklären...",2
1679,AUSSDA,https://doi.org/10.11587/R1UAS0,"Für jede Datei, die Sie herunterladen erklären...",1
1681,AUSSDA,https://doi.org/10.11587/QZG1XH,"Für jede Datei, die Sie herunterladen erklären...",4
1683,AUSSDA,https://doi.org/10.11587/R5DPPK,"Für jede Datei, die Sie herunterladen erklären...",2


It looks like some of these datasets have restricted files that actually contain data, but we can't tell 