## Goals
Objective 2 of the Harvard Data Commons project seeks to make it easier for users of the Harvard Dataverse Repository to publish, find and re-use the "workflow" datasets that the repository hosts. The group working on enhancing the repository's functionality toward this end is defining "workflow" datasets as any datasets containing any of several types of files that indicate that the published data is meant to be computationally reproducible.

For research into the reusability of workflow datasets, see "Trisovic, A., Lau, M.K., Pasquier, T. et al. A large-scale study on research code quality and execution. Sci Data 9, 60 (2022). https://doi.org/10.1038/s41597-022-01143-6"

This notebook aims to provide benchmarks for how findable these workflow datasets have been. And after changes to the repository are made in order to improve their discoverability, the methods explored in this notebook might help indicate the effectiveness of those changes by comparing these measurements to measurements taken after the changes have been in place for some amount of time.

Two possible indicators of the discoverability of workflow datasets are the number of times datasets' files have been downloaded and the number of times that their web pages have been viewed.

The Harvard Dataverse Repository uses Google Analytics to track page views, so we can look there for metrics about how often all datasets are viewed. To find which datasets would be considered workflow datasets and to find out how often files in all datasets, including workflow datasets, are downloaded, we can look in the repository's database.

So let's try to answer the follow questions about page views:
- What has been the average sum of file downloads of all datasets in the repository?
- What has been the average sum of file downloads of workflow datasets?

And let's try to answer the following questions about file downloads:
- On average, how often is each dataset's page viewed?
- On average, how often is each workflow dataset's page viewed?







### Query the database for information about each published dataset:
- DOI
- Publication Date
- Number of total file downloads
- A list of its "workflow" files, if any

Query for finding "workflow" files, which also returns the DOI of the dataset each file is in:

```
select
    datasetversion.dataset_id as dataset_database_id,
    concat('https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/', dvobject.identifier) as dataset_url,
    case
        when dvobject.publicationdate is not null then 'Published'
        when dvobject.publicationdate is null then 'Unpublished'
    end as publication_state,
    datafile.id as file_database_id, contenttype as file_contenttype, label as file_name,
    case
        when (filemetadata.label ilike '%.ipynb' or datafile.contenttype = 'application/x-ipynb+json') then 'Jupyter Notebook'
        when
            (filemetadata.label ilike '%.rmd' or datafile.contenttype = 'text/x-r-markdown') or
            (filemetadata.label ilike '%.rmd' and datafile.contenttype = 'text/markdown') then 'R Notebook'
        when (filemetadata.label ilike '%.cwl') then 'Common Workflow Language'
        when (filemetadata.label ilike '%.ga') then 'Galaxy'
        when (filemetadata.label ilike '%.wdl') then 'WDL'
        when (filemetadata.label ilike 'makefile') then 'Makefile'
        else 'None'
    end as type_of_code_file
from datafile
join filemetadata on filemetadata.datafile_id = datafile.id
join datasetversion on datasetversion.id = filemetadata.datasetversion_id
join dataset on dataset.id = datasetversion.dataset_id
join dvobject on dvobject.id = dataset.id
where
    dataset.harvestingclient_id is null and
    (
        filemetadata.label ilike any (array['%.ipynb', '%.rmd', '%.cwl', '%.ga', '%.wdl']) or
        datafile.contenttype in ('application/x-ipynb+json', 'text/x-r-markdown') or
        (datafile.contenttype = 'text/markdown' and filemetadata.label ilike '%.rmd') or
        (filemetadata.label ilike 'makefile')
    )
order by dataset.id
```