In [12]:
import csv
from functools import reduce
from google_trans_new import google_translator
import numpy as np
import pandas as pd


## Get and verify the data

Get data about whether or not each dataset as one or more restricted files

In [2]:
restrictedFilesCountDF = pd.read_csv('file_restrictions.tab', sep='\t', na_filter = False)


In [3]:
# Get only metadata for the latest versions of each dataset
restrictedFilesCountLatestversionDF = (restrictedFilesCountDF
    .iloc[restrictedFilesCountDF.groupby('persistentUrl')['datasetVersionId'].agg(pd.Series.idxmax)]
    .reset_index(drop=True, inplace=False)
    .drop(columns=['datasetVersionId'])
    )

In [4]:
restrictedFilesCountLatestversionDF.head(5)

Unnamed: 0,persistentUrl,restricted_files,file_access_request
0,http://dx.doi.org/10.26193/00HBWG,NA (not recorded),NA (not recorded)
1,http://dx.doi.org/10.26193/01P0AI,NA (not recorded),NA (not recorded)
2,http://dx.doi.org/10.26193/04F7C1,NA (not recorded),NA (not recorded)
3,http://dx.doi.org/10.26193/07R31R,NA (not recorded),NA (not recorded)
4,http://dx.doi.org/10.26193/0AF6TZ,NA (not recorded),NA (not recorded)


In [22]:
termsMetadataDF = (pd
    .read_csv('terms_metadata.tab', sep='\t', na_filter = False)
    .replace(r'^\s*$', np.nan, regex=True)
)

# Replace any blank values with NaN
# latestversion = latestversion.replace(r'^\s*$', np.nan, regex=True)

In [23]:
# Get only metadata for the latest versions of each dataset
termsMetadataLatestversionDF = (termsMetadataDF
    .iloc[termsMetadataDF.groupby('persistentUrl')['datasetVersionId']
    .agg(pd.Series.idxmax)]
    .sort_values(by=['publisher'], inplace=False, ascending=True)
    .drop(columns=['datasetVersionId'])
    .reset_index(drop=True, inplace=False)
    )

In [24]:
termsMetadataLatestversionDF.head(5)

Unnamed: 0,publisher,persistentUrl,majorVersionNumber,minorVersionNumber,license,termsOfAccess,termsOfUse,availabilityStatus,citationRequirements,conditions,confidentialityDeclaration,contactForAccess,dataaccessPlace,depositorRequirements,disclaimer,originalArchive,restrictions,sizeOfCollection,specialPermissions,studyCompletion
0,ACSS Dataverse,https://doi.org/10.25825/FK2/VXVPVP,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,
1,ACSS Dataverse,https://doi.org/10.25825/FK2/8YKSQV,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,
2,ACSS Dataverse,https://doi.org/10.25825/FK2/9QFRW2,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,
3,ACSS Dataverse,https://doi.org/10.25825/FK2/A3JWCN,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,
4,ACSS Dataverse,https://doi.org/10.25825/FK2/AGZJI8,1,0,NONE,,<b>Acceptance of Terms</b><br></br>\nThe follo...,,,,,,,,,,,,,


Join restrictedFilesCountLatestversionDF to termsMetadataLatestversionDF

In [25]:
termsAndRestrictedFilesDF = pd.merge(termsMetadataLatestversionDF, restrictedFilesCountLatestversionDF)

# Retain only needed columns
termsAndRestrictedFilesDF = termsAndRestrictedFilesDF[['publisher','persistentUrl', 'termsOfAccess', 'restricted_files']]


In [26]:
termsAndRestrictedFilesDF.head(5)

Unnamed: 0,publisher,persistentUrl,termsOfAccess,restricted_files
0,ACSS Dataverse,https://doi.org/10.25825/FK2/VXVPVP,,False
1,ACSS Dataverse,https://doi.org/10.25825/FK2/8YKSQV,,False
2,ACSS Dataverse,https://doi.org/10.25825/FK2/9QFRW2,,False
3,ACSS Dataverse,https://doi.org/10.25825/FK2/A3JWCN,,False
4,ACSS Dataverse,https://doi.org/10.25825/FK2/AGZJI8,,False


I think the restricted_files column isn't really a boolean data type, since there are more than just True and False values in the column. There's also two "NA" strings in there to record when there are no files or the metadata doesn't record if files are restricted. For the querying I plan to do, I didn't want to add more columns to record each of those cases. Let's confirm the datatypes for all columns:

In [42]:
termsAndRestrictedFilesDF.dtypes


publisher           object
persistentUrl       object
termsOfAccess       object
restricted_files    object
dtype: object

The datatypes for all of the columns all "object", or strings, so I'll have to search for any True and False values as strings instead of as boolean values.

Last we'll verify that the data was successfully joined by making sure that the number of datasets in restrictedFilesCountLatestversionDf is the same as the number of datasets in the joined table termsMetadataLatestversionDF.

In [43]:
print('Number of datasets in termsMetadataLatestversionDF: %s' % (len(pd.unique(termsMetadataLatestversionDF['persistentUrl']))))
print('Number of datasets in termsAndRestrictedFilesDF: %s' % (len(pd.unique(termsAndRestrictedFilesDF['persistentUrl']))))


Number of datasets in termsMetadataLatestversionDF: 133253
Number of datasets in termsAndRestrictedFilesDF: 133253


Great, the same number of datasets exist in one of the original tables and the resulting joined table. This means the join was successful.


## Query the data to answer questions

Now we'll count the number of datasets whose latest version:
- Has Terms of Access metadata
  - And has no files
  - And has files
    - But we don't know if the files are restricted because the datasets are in repositories whose JSON exports don't include if files are restricted or not
    - Where one or more files are restricted
    - Where no files are restricted
- That have restricted files but no Terms of Access metadata

In [44]:
print('Number of datasets: %s' % (len(pd.unique(termsAndRestrictedFilesDF['persistentUrl']))))

# Has Terms of Access metadata
df1 = termsAndRestrictedFilesDF.query('termsOfAccess.notnull()')
print('1. Number of datasets with Terms of Access metadata: %s' % (len(pd.unique(df1['persistentUrl']))))

# Has ToA metadata but no files
df2 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess.notnull() and\
        restricted_files == "NA (no files)"'))
print('\t2. Number of datasets with ToA but no files: %s' % (len(pd.unique(df2['persistentUrl']))))

# Has ToA metadata and one or more files
df3 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess.notnull() and\
        restricted_files != "NA (no files)"'))
print('\t3. Number of datasets with ToA and one or more files: %s' % (len(pd.unique(df3['persistentUrl']))))

# Has ToA metadata and files but we don't know if the files are restricted
df4 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess.notnull() and\
        restricted_files == "NA (not recorded)"'))
print('\t\t4. Number of datasets with ToA and files but we don\'t know if the files are restricted: %s' % (len(pd.unique(df4['persistentUrl']))))

# Has ToA metadata and files and one or more files is restricted
df5 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess.notnull() and\
        restricted_files == "True"'))
print('\t\t5. Number of datasets with ToA and one or more restricted\
 files: %s' % (len(pd.unique(df5['persistentUrl']))))

# Has ToA metadata and files and none are restricted
df6 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess.notnull() and\
        restricted_files == "False"'))
print('\t\t6. Number of datasets with ToA and no restricted files: %s' % (len(pd.unique(df6['persistentUrl']))))

# Does not have ToA metadata but has one or more restricted files
df7 = (
    termsAndRestrictedFilesDF.query(
        'termsOfAccess.isnull() and\
        restricted_files == "True"'))
print('7. Number of datasets with no ToA but with restricted file(s): %s' % (len(pd.unique(df7['persistentUrl']))))


Number of datasets: 133253
1. Number of datasets with Terms of Access metadata: 6798
	2. Number of datasets with ToA but no files: 68
	3. Number of datasets with ToA and one or more files: 6730
		4. Number of datasets with ToA and files but we don't know if the files are restricted: 1531
		5. Number of datasets with ToA and one or more restricted files: 2354
		6. Number of datasets with ToA and no restricted files: 2845
7. Number of datasets with no ToA but with restricted file(s): 4448


So of the datasets that have Terms of Access metadata, 6798, about a third, 2845, have files but none are restricted.

Let's take a look at the ToA for a few of these datasets. The data is stored in the "df6" dataframe.

In [40]:
df6.head(5)

Unnamed: 0,publisher,persistentUrl,termsOfAccess,restricted_files
1675,AUSSDA,https://doi.org/10.11587/QYUSBJ,"Für jede Datei, die Sie herunterladen erklären...",False
1677,AUSSDA,https://doi.org/10.11587/R1PDUC,"Für jede Datei, die Sie herunterladen erklären...",False
1680,AUSSDA,https://doi.org/10.11587/R2W6EQ,"Für jede Datei, die Sie herunterladen erklären...",False
1682,AUSSDA,https://doi.org/10.11587/R7CPFY,"<p>For each file you download, you declare:<br...",False
1689,AUSSDA,https://doi.org/10.11587/R4RCPI,"<p>For each file you download, you declare:<br...",False


The first five datasets from AUSSDA each have one public PDF file that is documentation. The Terms of Access metadata actually applies to data files that are not included in the dataset.

Let's see how many of these 2845 datasets are in the AUSSDA repository.

In [20]:
datasetsByPublisher = df5.value_counts(subset=['publisher']).to_frame('dataset_count')
datasetsByPublisher


Unnamed: 0_level_0,dataset_count
publisher,Unnamed: 1_level_1
RIN Dataverse,2126
AUSSDA,296
Harvard Dataverse,295
Root,40
Scholars Portal Dataverse,19
UAL Dataverse,12
DataverseNL,9
DataverseNO,6
UNC Dataverse,6
QDR Main Collection,6


Most of the datasets are actually in one repository, RIN repository. This is encouraging as it might make it easier to get a good picture of how ToA are being used with regard to restricted files.

In [45]:
df7 = (df1.query(
        'publisher == "AUSSDA"\
        and termsOfAccess.notnull()\
        and restricted_files == "True"'))
print('Number of AUSSDA datasets with ToA metadata and one or more files: %s' % (len(df7)))


Number of AUSSDA datasets with ToA metadata and one or more files: 431


So 431 datasets in the AUSSDA repository have ToA metadata and one or more restricted files.

Let's take a look at these datasets.



In [31]:
df7.head(5)

Unnamed: 0,publisher,persistentUrl,termsOfAccess,restricted_files
1676,AUSSDA,https://doi.org/10.11587/R0XDMW,"Für jede Datei, die Sie herunterladen erklären...",2
1678,AUSSDA,https://doi.org/10.11587/RF9W7K,"Für jede Datei, die Sie herunterladen erklären...",2
1679,AUSSDA,https://doi.org/10.11587/R1UAS0,"Für jede Datei, die Sie herunterladen erklären...",1
1681,AUSSDA,https://doi.org/10.11587/QZG1XH,"Für jede Datei, die Sie herunterladen erklären...",4
1683,AUSSDA,https://doi.org/10.11587/R5DPPK,"Für jede Datei, die Sie herunterladen erklären...",2


It looks like some of these datasets have restricted files that actually contain data, but we can't tell 