# Terms metadata in Dataverse repositories

As part of the work to make it easier for dataset depositors to specify how they're data should be access and used by letting them choose from a pre-defined list of licenses, questions have been brought up about the relationship between waivers/licenses and the other Terms metadata fields (e.g. Terms of Use, Terms of Access, Conditions) that the Dataverse software ships with.

This notebook explores the Terms metadata of Dataverse repositories to help explore that relationship in practice. It identifies published datasets with both:
- a CC0 waiver or a license and
- metadata entered in one or more of the other Terms metadata fields

to help answer the following questions:

- When depositors publish datasets with a CC0 waiver or a license, how often are they filling the other Terms fields?
- When depositors publish datasets with a CC0 waiver or a license and fill one or more other Terms fields, are they entering metadata that conflicts or might conflict with the license?

## Considerations
- This is an exploration of depositor behavior and not necessarily what they "should" be doing.
- The data used in this exploration comes from 49 of 67 known Dataverse repositories, so it's only a sample of the population.
- Repositories have different levels of control over the quality of the metadata they publish. Some repositories allow anyone to publish datasets and don't intervene at all or in all cases to prevent datasets that have conflicting metadata. At the other end, some repositories have complete control. The level of control and expertise applied when publishing datasets should be taken into account if a goal of this "multiple licenses" work is to add functionality that helps depositors with any level of curation expertise easily apply Terms metadata easily that follows best practices for data sharing.

In [99]:
import csv
from google_trans_new import google_translator
import pandas as pd


## Getting the data

The terms_metadata.tab file contains some basic metadata and the "Terms" metadata of all published versions of every dataset published in 49 known Dataverse repositories. Each row of the tabular file is a published version of a dataset, so there can be multiple rows (versions) per dataset. Each dataset has a unique persistentUrl, but the database ID for each dataset version, in the datasetVersionId column, is unique only within each of the 49 repositories, and versions of different datasets in different repositories share the same datasetVersionId.

Getting the data in terms_metadata.tab:
- Download the 49 zipped files at https://doi.org/10.7910/DVN/DCDKZQ. Each zipped file contains the metadata of each published version of every dataset published in 49 known Dataverse repositories
- Using your preferred method, move all JSON files into a single folder
- Run the two scripts "get_basic_metadata.py" and "get_terms_metadata.py" at https://github.com/jggautier/dataverse_scripts/tree/master/get-dataverse-metadata/parse_metadata_fields with that folder as the input to get two CSV files, one containing the basic metadata of all datasets (publisher names, PIDs, publication dates, version numbers, etc), and one containing the Terms metadata for each version of each dataset.
- Using your preferred method, retain from the basic_metadata file only the 'publisher', 'persistentUrl', 'datasetVersionId', 'majorVersionNumber', and 'minorVersionNumber' columns.
- Using your preferred method, join both CSV files on their persistentUrl and datasetVersionId columns
- Export the results as a .tab file (or export as a CSV and convert to TAB). Because of the Dataverse software's preference for .tab files, it's easier to work with if you plan to publish this data in a Dataverse repository

In [2]:
data = pd.read_csv('terms_metadata.tab', sep='\t', na_filter = False)

In [3]:
# Check data
print('Number of datasets: %s' % (len(pd.unique(data['persistentUrl']))))
print('Number of dataset versions: %s' %(len(data)))
      

Number of datasets: 133253
Number of dataset versions: 347241


In [None]:
data.head(5)


To make things simple, lets look at the metadata of only latest published version of each dataset.

In [4]:
# Get only metadata for the latest versions of each dataset
latestversion = data.iloc[data.groupby('persistentUrl')['datasetVersionId'].agg(pd.Series.idxmax)].sort_values(by=['publisher'], inplace=False, ascending=True).reset_index(drop=True, inplace=False)


In [None]:
# Check data
print('Number of datasets: %s' % (len(pd.unique(latestversion['persistentUrl']))))
print('Number of dataset versions: %s' %(latestversion.shape[0]))


In [None]:
latestversion.head(5)


To make data munging easier, let's replace any blank values with null. We also know that Dartmouth's metadata exports use the default value "Root" in their publisher field. So to identify datasets from that repository, let's replace all cases of "Root" in the "publisher" column with "Dartmouth".

In [32]:
# Replace any blank values with NaN
latestversion = latestversion.replace(r'^\s*$', np.nan, regex=True)

# Replace publsiher "Root" with Dartmouth
latestversion['publisher'] = latestversion['publisher'].replace(['Root'],'Dartmouth')


## Exploring the data

What are repositories putting in the their datasets' license fields? The Dataverse software ships with support to add a CC0 waiver, and the database stores that waiver in the "license" field (which isn't accessible by the depositor). But some repositories have forked their Dataverse installation to let depositors choose a CC BY license instead. So let's see what values in the license field we need to consider.

In [33]:
latestversion.license.unique()

array(['NONE', 'CC0', 'CC BY', nan, 'CCBY'], dtype=object)

Looks like the variation is CC0, CC BY, and CCBY. The software applies "NONE" when depositors indicate that they don't want the CC0 waiver applied, at which point the Terms of Use field appears in the UI. Depositors can enter anything in that field, including any type of license, but based on community discussion the CC0 waiver and the CCBY license may be the most popular applied to datasets, so we'll narrow our exploration to only those two things.

Depositors can also enter licenses in other Terms fields, e.g. Terms of Access or Conditions, but it might be fair to assume that, most datasets with licences have that license text in their Terms of Use field, either entered automatically by the Dataverse software when the depositor chooses CC0 or CCBY, or, because of the visual prominence of the Terms of Use field, entered by the depositor when she doesn't choose CC0 or CCBY. So we'll look for datasets that have text in their Terms of Use field that indicates a CC0 waiver, a CCBY license or another creative commons license.

That is, which datasets have the CC0 waiver or CCBY licenses in their Terms of Use metadata? It should be safe and easy to assume that datasets with the words "creative commons" in their Terms of Use metadata will have some sort of Creative Commons license applied, so we'll query for that as well.

In [34]:
data_with_licenses = (
    latestversion.query(
        'termsOfUse.str.contains("CC0")\
        or termsOfUse.str.contains("CCBY")\
        or termsOfUse.str.contains("CC BY")\
        or termsOfUse.str.contains("creative commons", case = False)'
    )
)

In [35]:
# Check data
print('Number of datasets: %s' % (len(pd.unique(data_with_licenses['persistentUrl']))))


Number of datasets: 113307


Of the datasets with a CC waiver or license in their ToU field, which have any text in their other Terms metadata?

In [36]:
data_with_licenses_and_other_terms = (
    data_with_licenses.query(
        'termsOfAccess == termsOfAccess or\
        availabilityStatus == availabilityStatus or\
        citationRequirements == citationRequirements or\
        conditions == conditions or\
        confidentialityDeclaration == confidentialityDeclaration or\
        contactForAccess == contactForAccess or\
        dataaccessPlace == dataaccessPlace or\
        depositorRequirements == depositorRequirements or\
        disclaimer == disclaimer or\
        originalArchive == originalArchive or\
        restrictions == restrictions or\
        sizeOfCollection == sizeOfCollection or\
        specialPermissions == specialPermissions or\
        studyCompletion == studyCompletion'
    )
)


In [37]:
# Check data
print('Number of datasets: %s' % (len(pd.unique(data_with_licenses_and_other_terms['persistentUrl']))))


Number of datasets: 5804


In [None]:
data_with_licenses_and_other_terms.head(5)

5,804 datasets fit this bill.

Now we have data we can use to answer our two questions:
- When depositors add data with CC0 waivers or CC licenses, how often are they filling the other Terms fields?
- When depositors add data with CC0 waivers or CC licenses and they fill one or more other Terms fields, are they entering things that conflict or might conflict with the waiver or CC license?

Let's try to get a sense of the variety of Terms metadata entered and what kind of grouping we can do to learn how prevelant certain behaviors are.

First let's see how many unique values exist for each field in these 5,804 datasets.

In [38]:
termsFields = ['termsOfAccess', 'availabilityStatus', 'citationRequirements', 'conditions', 'confidentialityDeclaration', 'contactForAccess', 'dataaccessPlace', 'depositorRequirements', 'disclaimer', 'originalArchive', 'restrictions', 'sizeOfCollection', 'specialPermissions', 'studyCompletion']
for field in termsFields:
    print(field + ': ' + str(data_with_licenses_and_other_terms[field].nunique()))

termsOfAccess: 684
availabilityStatus: 24
citationRequirements: 562
conditions: 39
confidentialityDeclaration: 24
contactForAccess: 63
dataaccessPlace: 0
depositorRequirements: 71
disclaimer: 291
originalArchive: 22
restrictions: 59
sizeOfCollection: 18
specialPermissions: 22
studyCompletion: 3


So there are no values entered in dataaccessPlace. There are very few unique values entered in fields like studyCompletion and availabilityStatus. Lots of variation in termsOfAccess and citationRequirements.

Let's see how many of these datasets are published in each repository.


In [39]:
datasetsByPublisher = data_with_licenses_and_other_terms.value_counts(subset=['publisher']).to_frame('dataset_count')
datasetsByPublisher


Unnamed: 0_level_0,dataset_count
publisher,Unnamed: 1_level_1
RIN Dataverse,2278
Harvard Dataverse,1963
AUSSDA,447
Scholars Portal Dataverse,284
UNC Dataverse,156
ICRISAT Dataverse,119
World Agroforestry - Research Data Repository,97
QDR Main Collection,86
DataverseNL,54
Peking University Open Research Data Platform,53


In the top two repositories, and 4 of the top 5, publishing datasets is less centralized, meaning control over what's published and the quailty of it is left to depositors and there's less installation-wide control over what's published. This increases the chance that the metadata, and in this case the Terms metadata, is created by depositors who don't apply a lot of knowledge about data licenses and their relationship to the Dataverse software's DDI-inspired Terms metadata fields. 

This will make generalizing more difficult, so let's narrow our exploration to a few of the top 5 repositories listed where there's no or little installation-wide control over what's published.

In [46]:
rin_datasets = (data_with_licenses_and_other_terms.query('publisher == "RIN Dataverse"'))
print('Number of datasets: %s' %(len(rin_datasets)))


Number of datasets: 2278


What values are entered in the license fields of the 2,278 datasets in the RIN repository?

In [45]:
rin_datasets.license.unique()


array(['CC0', 'NONE'], dtype=object)

I know that the repository hasn't forked their code to let depositors choose CCBY, so this is expected. We can assume that of these 2,278 datasets, the depositors of datasets with no license (license = NONE) must have entered CC0 or a CC license in the Terms of Use metadata field. Let's see how many datasets have in their license field "CC0" and how many have "NONE".

In [51]:
rin_datasets.value_counts(subset=['license']).to_frame('dataset_count')


Unnamed: 0_level_0,dataset_count
license,Unnamed: 1_level_1
CC0,2230
NONE,48


Let's explore the datasets that have CC0 in either their licence field or Terms of Use field. We'll save a dataframe with this info.

In [54]:
rin_datasets_cc0 = (
    rin_datasets.query(
        'license == "CC0" or\
        termsOfUse.str.contains("CC0")'
    )
)
print('Number of datasets: %s' %(len(rin_datasets_cc0)))


Number of datasets: 2230


Among the datasets we know have CC0 applied, how many unique values exist for each of the Terms fields?

In [53]:
for field in termsFields:
    print(field + ': ' + str(rin_datasets_cc0[field].nunique()))


termsOfAccess: 43
availabilityStatus: 0
citationRequirements: 0
conditions: 0
confidentialityDeclaration: 0
contactForAccess: 0
dataaccessPlace: 0
depositorRequirements: 0
disclaimer: 0
originalArchive: 1
restrictions: 0
sizeOfCollection: 0
specialPermissions: 0
studyCompletion: 0


Looks like most of the metadata is in the termsOfAccess field. We'll list the unique values entered in that field and count how many datasets have each of those values. 

In [110]:
rin_datasets_cc0_toa = rin_datasets_cc0.value_counts(subset=['termsOfAccess']).to_frame('dataset_count').reset_index(drop=False, inplace=False)
rin_datasets_cc0_toa.head(5)


Unnamed: 0,termsOfAccess,dataset_count
0,File yang diunduh dari RIN ini mungkin tidak d...,1061
1,File yang diunduh dari RIN ini mungkin tidak d...,507
2,File yang diunduh dari RIN ini mungkin tidak d...,396
3,File yang diunduh dari RIN ini mungkin tidak d...,166
4,File yang diunduh dari RIN ini mungkin tidak d...,44


Right, this repository is based in Indonesia, so we'll need to translate the metadata. Let's translate what's been entered into the Terms of Access fields of the 1,061 RIN datasets where CC0 has been applied.

In [109]:
translator.translate(rin_datasets_cc0_toa['termsOfAccess'][0], lang_src='id', lang_tgt='en')


'Files downloaded from this RIN may not be redistributed in any form (electronic, electro-magnetic or printed) without the prior consent of the data distributor.\n\nBy downloading the file, I confirm that I have read and understood each of the terms set out in the terms and conditions of use of data and other materials found below, and I agree to be bound by all of these terms and conditions.\n\nIf I do not understand or agree to all the terms and conditions, I may not use or download other data or materials.\n<br> <br>\n-----------\n<br> <br>\n<strong> Rules and Conduct </strong> <br>\n\nAs a condition of use, you promise not to use the Service for any purpose prohibited by the Terms of Use. For purposes of the Terms of Use, "Content" includes, without limitation, information, data, text, software, scripts, graphics and any interactive features that are generated, provided or made accessible to RIN or its partners on or through the Services. For example, and not as a limitation, you m

"Files downloaded from this RIN may not be redistributed in any form (electronic, electro-magnetic or printed) without the prior consent of the data distributor."

Right away, it seems like what's been entered into the Terms of Access field of at least 1061 RIN datasets conflicts with CC0, which I would assume would let the person who's downloaded the files do whatever they want with them, even without "the prior consent of the data distributor."

But this data is hosted in Indonesia, and while CC0 gives "creators a way to waive all their copyright and related rights in their works to the fullest extent allowed by law" (https://creativecommons.org/share-your-work/public-domain/cc0/), we don't know what "the fullest extent allowed by law" means in Indonesia.


Now let's look at the other RIN datasets that may be using CC licenses.


In [117]:
rin_datasets_cc = (rin_datasets.query('license == "NONE"'))
print('Number of datasets: %s' %(len(rin_datasets_cc)))


Number of datasets: 48


Which CC licenses are these 48 datasets using? "NONE" is in their license fields, so there must be some text in their Terms of Use fields about a CC license:

In [125]:
rin_datasets_cc0_tou = rin_datasets_cc.value_counts(subset=['termsOfUse']).to_frame('dataset_count').reset_index(drop=False, inplace=False)
with pd.option_context('display.max_colwidth', -1):
    display(rin_datasets_cc0_tou)


Unnamed: 0,termsOfUse,dataset_count
0,"<a rel=""license"" href=""http://creativecommons.org/licenses/by-nc-nd/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by-nc-nd/4.0/"">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.",16
1,"<a rel=""license"" href=""http://creativecommons.org/licenses/by-nc/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by-nc/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by-nc/4.0/"">Creative Commons Attribution-NonCommercial 4.0 International License</a>.",6
2,"<a rel=""license"" href=""http://creativecommons.org/licenses/by-nc-nd/4.0/""><img alt=""Lisensi Creative Commons"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png"" /></a><br />Ciptaan disebarluaskan di bawah <a rel=""license"" href=""http://creativecommons.org/licenses/by-nc-nd/4.0/"">Lisensi Creative Commons Atribusi-NonKomersial-TanpaTurunan 4.0 Internasional</a>.",5
3,"<a rel=""license"" href=""http://creativecommons.org/lic enses/by-nd/4.0/""><img alt=""Creative Commons License"" style=""borderwidth:0"" src=""https://i.creativecommons.org/l/ by-nd/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/lic enses/by-nd/4.0/"">Creative Commons AttributionNoDerivatives 4.0 International License</a>",5
4,"<a rel=""license""\nhref=""http://creativecommons.org/li\ncenses/by-nc-nd/4.0/""><img\nalt=""Creative Commons License""\nstyle=""border-width:0""\nsrc=""https://i.creativecommons.org/l\n/by-nc-nd/4.0/88x31.png"" /></a><br\n/>This work is licensed under a <a\nrel=""license""\nhref=""http://creativecommons.org/li\ncenses/by-nc-nd/4.0/"">Creative\nCommons AttributionNonCommercial-NoDerivatives 4.0\nInternational License</a>.",3
5,"<a rel=""license"" href=""http://creativecommons.org/licenses/by/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by/4.0/"">Creative Commons Attribution 4.0 International License</a>.",2
6,"<a rel=""license""\nhref=""http://creativecommons.org/lic\nenses/by/4.0/""><img alt=""Creative\nCommons License"" style=""borderwidth:0""\nsrc=""https://i.creativecommons.org/l/\nby/4.0/88x31.png"" /></a><br />This\nwork is licensed under a <a\nrel=""license""\nhref=""http://creativecommons.org/lic\nenses/by/4.0/"">Creative Commons\nAttribution 4.0 International\nLicense</a>.",2
7,"Copyright (c) 2020 Valencia Matthew Anis, Harijanto Sabijono, Stanley Kho Walandouw\n\n<a rel=""license"" href=""http://creativecommons.org/licenses/by-nc/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by-nc/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by-nc/4.0/"">Creative Commons Attribution-NonCommercial 4.0 International License</a>.",1
8,"Copyright (c) 2020 Enggar Wahyuning Pahlawan, Anita Wijayanti, Suhendro\n\n<a rel=""license"" href=""http://creativecommons.org/licenses/by/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by/4.0/"">Creative Commons Attribution 4.0 International License</a>.",1
9,"Copyright (c) 2020 Een Samawati Miharja\n\n<a rel=""license"" href=""http://creativecommons.org/licenses/by/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by/4.0/"">Creative Commons Attribution 4.0 International License</a>.",1


Scanning each of the rows, we see that all 48 datasets use CCBY (with 16 datasets using the same exact text). What does the Terms metadata of those 48 datasets look like?

In [126]:
for field in termsFields:
    print(field + ': ' + str(rin_datasets_cc[field].nunique()))


termsOfAccess: 30
availabilityStatus: 0
citationRequirements: 0
conditions: 0
confidentialityDeclaration: 1
contactForAccess: 1
dataaccessPlace: 0
depositorRequirements: 0
disclaimer: 0
originalArchive: 0
restrictions: 3
sizeOfCollection: 0
specialPermissions: 0
studyCompletion: 0


A few more fields have values, compared to RINs CC0 datasets.

Let's see what's in the Terms of Access fields first.

In [131]:
rin_datasets_cc_toa = rin_datasets_cc.value_counts(subset=['termsOfAccess']).to_frame('dataset_count').reset_index(drop=False, inplace=False)
with pd.option_context('display.max_colwidth', -1):
    display(rin_datasets_cc_toa)


Unnamed: 0,termsOfAccess,dataset_count
0,"<a rel=""license"" href=""http://creativecommons.org/licenses/by-nc/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by-nc/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by-nc/4.0/"">Creative Commons Attribution-NonCommercial 4.0 International License</a>.",7
1,"<a rel=""license"" href=""http://creativecommons.org/lic enses/by-nd/4.0/""><img alt=""Creative Commons License"" style=""borderwidth:0"" src=""https://i.creativecommons.org/l/ by-nd/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/lic enses/by-nd/4.0/"">Creative Commons AttributionNoDerivatives 4.0 International License</a>",5
2,"<a rel=""license""\nhref=""http://creativecommons.org/li\ncenses/by-nc-nd/4.0/""><img\nalt=""Creative Commons License""\nstyle=""border-width:0""\nsrc=""https://i.creativecommons.org/l\n/by-nc-nd/4.0/88x31.png"" /></a><br\n/>This work is licensed under a <a\nrel=""license""\nhref=""http://creativecommons.org/li\ncenses/by-nc-nd/4.0/"">Creative\nCommons AttributionNonCommercial-NoDerivatives 4.0\nInternational License</a>.",3
3,"<a rel=""license"" href=""http://creativecommons.org/licenses/by/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by/4.0/"">Creative Commons Attribution 4.0 International License</a>.",2
4,"<a rel=""license"" href=""http://creativecommons.org/licenses/by-nc-nd/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by-nc-nd/4.0/"">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.\n\nFile yang diunduh dari RIN ini mungkin tidak didistribusikan ulang dalam bentuk apapun (elektronik, elektro-magnetik atau cetak) tanpa izin dari distributor data sebelumnya.\n\nDengan mendownload file, saya mengonfirmasi bahwa saya telah membaca dan memahami setiap istilah yang ditetapkan dalam persyaratan dan ketentuan penggunaan data dan materi lain yang ditemukan di bawah ini, dan saya setuju untuk terikat oleh semua persyaratan dan ketentuan tersebut.\n\nJika saya tidak mengerti atau menyetujui semua persyaratan dan ketentuan, saya tidak boleh menggunakan atau mendownload data atau materi lainnya.\n<br><br>\n-----------\n<br><br>\n<strong>Aturan dan Perilaku</strong><br>\n\nSebagai syarat penggunaan, Anda berjanji untuk tidak menggunakan Layanan untuk tujuan apa pun yang dilarang oleh Persyaratan Penggunaan. Untuk tujuan Persyaratan Penggunaan, ""Konten"" mencakup, tanpa batasan, informasi, data, teks, perangkat lunak, skrip, grafik, dan fitur interaktif apa pun yang dihasilkan, disediakan, atau dibuat dapat diakses oleh RIN atau mitranya pada atau melalui Layanan . Sebagai contoh, dan bukan sebagai batasan, Anda tidak boleh (atau mengizinkan orang lain untuk) melakukan (a) mengambil tindakan atau (b) mengunggah, mendownload, mengirim, mengirimkan atau mendistribusikan atau memfasilitasi distribusi konten apapun dengan menggunakan komunikasi apapun. layanan atau layanan lain yang tersedia pada atau melalui Layanan, bahwa:\n<ul>\n <li>melanggar hak paten, merek dagang, rahasia dagang, hak cipta, hak publisitas atau hak lain dari orang atau entitas lain;</li>\n <li>melanggar hukum, mengancam, kasar, melecehkan, memfitnah, memfitnah, menipu, menipu, melanggar privasi orang lain, menyiksa, cabul, menyinggung, atau profan;</li>\n <li>merupakan iklan yang tidak sah atau tidak diminta, spam atau e-mail massal (""spamming"");</li>\n <li> berisi virus perangkat lunak atau kode komputer, file, atau program lain yang dirancang atau dimaksudkan untuk mengganggu, merusak, membatasi atau mengganggu fungsi perangkat lunak, perangkat keras, atau perangkat telekomunikasi yang sesuai atau merusak atau mendapatkan akses tidak sah ke sistem apapun, data atau informasi lain dari RIN atau pihak ketiga; atau Selain itu, Anda tidak boleh: (i) mengambil tindakan yang memaksakan atau memaksakan (sebagaimana ditentukan oleh RIN atas kebijakannya sendiri) beban besar yang tidak masuk akal atau tidak proporsional terhadap infrastruktur RIN (atau penyedia pihak ketiga); (ii) mengganggu atau mencoba mengganggu pelaksanaan Layanan atau kegiatan yang dilakukan oleh Dinas; atau (iii) mengabaikan semua tindakan yang mungkin digunakan RIN untuk mencegah atau membatasi akses ke Layanan (atau akun lain, sistem komputer atau jaringan yang terhubung ke Layanan).</li>\n</ul>\n\nAnda harus mematuhi semua hukum dan peraturan lokal, negara bagian, nasional dan internasional yang berlaku.",2
5,"<a rel=""license"" href=""http://creativecommons.org/licenses/by-nc-nd/4.0/""><img alt=""Creative Commons License"" style=""border-width:0"" src=""https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png"" /></a><br />This work is licensed under a <a rel=""license"" href=""http://creativecommons.org/licenses/by-nc-nd/4.0/"">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.\n\nFile yang diunduh dari RIN ini mungkin tidak didistribusikan ulang dalam bentuk apapun (elektronik, elektro-magnetik atau cetak) tanpa izin dari distributor data sebelumnya. Dengan mendownload file, saya mengonfirmasi bahwa saya telah membaca dan memahami setiap istilah yang ditetapkan dalam persyaratan dan ketentuan penggunaan data dan materi lain yang ditemukan di bawah ini, dan saya setuju untuk terikat oleh semua persyaratan dan ketentuan tersebut. Jika saya tidak mengerti atau menyetujui semua persyaratan dan ketentuan, saya tidak boleh menggunakan atau mendownload data atau materi lainnya.\n\n-----------\n\nAturan dan Perilaku\nSebagai syarat penggunaan, Anda berjanji untuk tidak menggunakan Layanan untuk tujuan apa pun yang dilarang oleh Persyaratan Penggunaan. Untuk tujuan Persyaratan Penggunaan, ""Konten"" mencakup, tanpa batasan, informasi, data, teks, perangkat lunak, skrip, grafik, dan fitur interaktif apa pun yang dihasilkan, disediakan, atau dibuat dapat diakses oleh RIN atau mitranya pada atau melalui Layanan . Sebagai contoh, dan bukan sebagai batasan, Anda tidak boleh (atau mengizinkan orang lain untuk) melakukan (a) mengambil tindakan atau (b) mengunggah, mendownload, mengirim, mengirimkan atau mendistribusikan atau memfasilitasi distribusi konten apapun dengan menggunakan komunikasi apapun. layanan atau layanan lain yang tersedia pada atau melalui Layanan, bahwa:\n\nmelanggar hak paten, merek dagang, rahasia dagang, hak cipta, hak publisitas atau hak lain dari orang atau entitas lain;\nmelanggar hukum, mengancam, kasar, melecehkan, memfitnah, memfitnah, menipu, menipu, melanggar privasi orang lain, menyiksa, cabul, menyinggung, atau profan;\nmerupakan iklan yang tidak sah atau tidak diminta, spam atau e-mail massal (""spamming"");\nberisi virus perangkat lunak atau kode komputer, file, atau program lain yang dirancang atau dimaksudkan untuk mengganggu, merusak, membatasi atau mengganggu fungsi perangkat lunak, perangkat keras, atau perangkat telekomunikasi yang sesuai atau merusak atau mendapatkan akses tidak sah ke sistem apapun, data atau informasi lain dari RIN atau pihak ketiga; atau Selain itu, Anda tidak boleh: (i) mengambil tindakan yang memaksakan atau memaksakan (sebagaimana ditentukan oleh RIN atas kebijakannya sendiri) beban besar yang tidak masuk akal atau tidak proporsional terhadap infrastruktur RIN (atau penyedia pihak ketiga); (ii) mengganggu atau mencoba mengganggu pelaksanaan Layanan atau kegiatan yang dilakukan oleh Dinas; atau (iii) mengabaikan semua tindakan yang mungkin digunakan RIN untuk mencegah atau membatasi akses ke Layanan (atau akun lain, sistem komputer atau jaringan yang terhubung ke Layanan).\nAnda harus mematuhi semua hukum dan peraturan lokal, negara bagian, nasional dan internasional yang berlaku.",2
6,"File yang diunduh dari RIN ini mungkin tidak didistribusikan ulang dalam bentuk apapun (elektronik, elektro-magnetik atau cetak) tanpa izin dari distributor data sebelumnya. Dengan mendownload file, saya mengonfirmasi bahwa saya telah membaca dan memahami setiap istilah yang ditetapkan dalam persyaratan dan ketentuan penggunaan data dan materi lain yang ditemukan di bawah ini, dan saya setuju untuk terikat oleh semua persyaratan dan ketentuan tersebut. Jika saya tidak mengerti atau menyetujui semua persyaratan dan ketentuan, saya tidak boleh menggunakan atau mendownload data atau materi lainnya.\n\n-----------\n\nAturan dan Perilaku\nSebagai syarat penggunaan, Anda berjanji untuk tidak menggunakan Layanan untuk tujuan apa pun yang dilarang oleh Persyaratan Penggunaan. Untuk tujuan Persyaratan Penggunaan, ""Konten"" mencakup, tanpa batasan, informasi, data, teks, perangkat lunak, skrip, grafik, dan fitur interaktif apa pun yang dihasilkan, disediakan, atau dibuat dapat diakses oleh RIN atau mitranya pada atau melalui Layanan . Sebagai contoh, dan bukan sebagai batasan, Anda tidak boleh (atau mengizinkan orang lain untuk) melakukan (a) mengambil tindakan atau (b) mengunggah, mendownload, mengirim, mengirimkan atau mendistribusikan atau memfasilitasi distribusi konten apapun dengan menggunakan komunikasi apapun. layanan atau layanan lain yang tersedia pada atau melalui Layanan, bahwa:\n\nmelanggar hak paten, merek dagang, rahasia dagang, hak cipta, hak publisitas atau hak lain dari orang atau entitas lain;\nmelanggar hukum, mengancam, kasar, melecehkan, memfitnah, memfitnah, menipu, menipu, melanggar privasi orang lain, menyiksa, cabul, menyinggung, atau profan;\nmerupakan iklan yang tidak sah atau tidak diminta, spam atau e-mail massal (""spamming"");\nberisi virus perangkat lunak atau kode komputer, file, atau program lain yang dirancang atau dimaksudkan untuk mengganggu, merusak, membatasi atau mengganggu fungsi perangkat lunak, perangkat keras, atau perangkat telekomunikasi yang sesuai atau merusak atau mendapatkan akses tidak sah ke sistem apapun, data atau informasi lain dari RIN atau pihak ketiga; atau Selain itu, Anda tidak boleh: (i) mengambil tindakan yang memaksakan atau memaksakan (sebagaimana ditentukan oleh RIN atas kebijakannya sendiri) beban besar yang tidak masuk akal atau tidak proporsional terhadap infrastruktur RIN (atau penyedia pihak ketiga); (ii) mengganggu atau mencoba mengganggu pelaksanaan Layanan atau kegiatan yang dilakukan oleh Dinas; atau (iii) mengabaikan semua tindakan yang mungkin digunakan RIN untuk mencegah atau membatasi akses ke Layanan (atau akun lain, sistem komputer atau jaringan yang terhubung ke Layanan).\nAnda harus mematuhi semua hukum dan peraturan lokal, negara bagian, nasional dan internasional yang berlaku",2
7,hubungi admin,2
8,"This document is restricted, please contact the author for full access.",1
9,"Bila berminat dengan data saya, harap menghubungi saya via email",1


Most of these datasets have the same CCBY text entered in their Terms of Access fields. Scrolling down, we see 2 datasets in the fifth row with extra text in Indonedian. What does it say?


In [130]:
translator.translate(rin_datasets_cc_toa['termsOfAccess'][4], lang_src='id', lang_tgt='en')

'<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"> <img alt = "Creative Commons License" style = "border-width: 0" src = "https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /> </a> <br /> This work is licensed under a <a rel = "license" href = "http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License </a>.\n\nFiles downloaded from this RIN may not be redistributed in any form (electronic, electro-magnetic or printed) without the prior consent of the data distributor.\n\nBy downloading the file, I confirm that I have read and understood each of the terms set out in the terms and conditions of use of data and other materials found below, and I agree to be bound by all of these terms and conditions.\n\nIf I do not understand or agree to all the terms and conditions, I may not use or download other data or materials.\n<br> <br>\n-----------\n<br> <br>\n<strong> Rules and Conduct </str

After the CCBY links and images, there's "Files downloaded from this RIN may not be redistributed in any form (electronic, electro-magnetic or printed) without the prior consent of the data distributor." This and the remaining text looks a lot like what was entered in the Terms Of Access fields of the CC0 datasets we looked at earlier.

And it again seems to conflict with CCBY (https://creativecommons.org/licenses/by-nc-nd/4.0/), where there are limitations on redistribution, but they don't include getting the data distributor's consent.

Let's see how many of the 48 CCBY datasets have this text about consent. "tanpa izin dari distributor data sebelumnya" means "without prior permission of the data distributor".


In [142]:
rin_datasets_cc_toa.query(
            'termsOfAccess.str.contains("tanpa izin dari distributor data sebelumnya")'
    )


Unnamed: 0,termsOfAccess,dataset_count
4,"<a rel=""license"" href=""http://creativecommons....",2
5,"<a rel=""license"" href=""http://creativecommons....",2
6,File yang diunduh dari RIN ini mungkin tidak d...,2
12,"<a rel=""license"" href=""http://creativecommons....",1
21,File yang diunduh dari RIN ini mungkin tidak d...,1


Just 8 of the 48 CCBY datasets have metadata with this distribution restriction. What about the other 40 datasets?

In [143]:
# rin_datasets_cc_toa.query(
#             'termsOfAccess.str.contains("tanpa izin dari distributor data sebelumnya")==False'
#     )

rin_datasets_cc_toa[rin_datasets_cc_toa["termsOfAccess"].str.contains('tanpa izin dari distributor data sebelumnya')==False]




Unnamed: 0,termsOfAccess,dataset_count
0,"<a rel=""license"" href=""http://creativecommons....",7
1,"<a rel=""license"" href=""http://creativecommons....",5
2,"<a rel=""license""\nhref=""http://creativecommons...",3
3,"<a rel=""license"" href=""http://creativecommons....",2
7,hubungi admin,2
8,"This document is restricted, please contact th...",1
9,"Bila berminat dengan data saya, harap menghubu...",1
10,jika membutuhkan data ini silahkan hubungi ema...,1
11,"<a rel=""license"" href=""http://creativecommons....",1
13,hubungi Dede Sunarya (081320076678),1
