<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Questions" data-toc-modified-id="Questions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Questions</a></span></li><li><span><a href="#Import-modules" data-toc-modified-id="Import-modules-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import modules</a></span></li><li><span><a href="#Get-data-from-repository-database" data-toc-modified-id="Get-data-from-repository-database-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Get data from repository database</a></span></li><li><span><a href="#Read-in-data" data-toc-modified-id="Read-in-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Read in data</a></span></li><li><span><a href="#Analysis" data-toc-modified-id="Analysis-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Analysis</a></span></li></ul></div>

## Questions

- How often have people published datasets in Root dataverse?
    - What are the account types of those users? (shib, orcid, google, github, builtin)
        - Among those users with accounts that are not shib accounts, which have .edu in their email addresses?
    - How often have people created multiple datasets in Root? (User1 published a dataset in Root, then publishes one or more additional datasets in Root. How often does this happen?)

## Import modules

In [3]:
import csv
import pandas as pd


## Get data from repository database

Query for getting data from repository's postgre database:
```
select
	dvobject.id as deposit_id, dtype as deposit_type, dvobject.createdate as deposit_createdate,
		publicationdate as deposit_publicationdate, dvobject.owner_id as deposit_parent_dataverse,
	dvobject.creator_id as user_account_id_of_deposit, authenticateduser.createdtime as user_account_createdate,
		authenticateduserlookup.authenticationproviderid as user_account_type, affiliation,
		split_part(email,'@',2) as account_email_type
from dvobject
join authenticateduser on authenticateduser.id = dvobject.creator_id
join authenticateduserlookup on authenticateduserlookup.authenticateduser_id = authenticateduser.id
where dvobject.id not in(
	select id
	from dataset
	where harvestingclient_id is not null
)
and dtype != 'DataFile'
and publicationdate is not null
```

## Read in data

In [41]:
rawDataDF = pd.read_csv('user_account_publish_data.csv', na_filter = False)
# rawDataDF = pd.read_csv('user_account_publish_data.csv')
print(rawDataDF.shape)
rawDataDF.head(5)


(41522, 10)


Unnamed: 0,deposit_id,deposit_type,deposit_createdate,deposit_publicationdate,deposit_parent_dataverse,user_account_id_of_deposit,user_account_createdate,user_account_type,affiliation,account_email_type
0,2839889,Dataverse,2016-06-08 22:45:59.18,2016-06-08 23:17:02.726,1,14473,2000-01-01 00:00:00,builtin,Stanford University,gmail.com
1,2840637,Dataverse,2016-06-13 15:20:38.27,2017-06-14 17:11:03.269,1,14508,2000-01-01 00:00:00,builtin,,gmail.com
2,644,Dataverse,2009-07-27 03:59:59.42,2011-06-15 07:43:23.703,1,1417,2000-01-01 00:00:00,builtin,International Christian University,icu.ac.jp
3,2840679,Dataverse,2016-06-13 15:31:31.37,2017-06-23 16:35:36.095,1,14507,2000-01-01 00:00:00,builtin,,jhu.edu
4,646,Dataverse,2009-07-29 16:22:43.88,2009-07-29 17:18:41.474,1,1425,2000-01-01 00:00:00,builtin,Harvard University,wjh.harvard.edu


## Analysis

In [42]:
# Create dataframe showing how often people publish datasets in Root dataverse over the last two years
datasetsInRootDF = rawDataDF.query('deposit_type == "Dataset"\
                 & deposit_parent_dataverse == "1"\
                 & deposit_createdate > "2018-10-31"')
print(datasetsInRootDF.shape)
datasetsInRootDF.head(5)


(3109, 10)


Unnamed: 0,deposit_id,deposit_type,deposit_createdate,deposit_publicationdate,deposit_parent_dataverse,user_account_id_of_deposit,user_account_createdate,user_account_type,affiliation,account_email_type
163,3825664,Dataset,2020-05-05 05:28:11.229,2020-05-05 05:30:37.154,1,31879,2019-11-25 13:43:00.243,orcid,Universitas Negeri Makassar,unm.ac.id
261,3801653,Dataset,2020-04-10 21:32:10.307,2020-04-10 21:34:56.049,1,4066,2000-01-01 00:00:00,builtin,,gmail.com
298,3806075,Dataset,2020-04-14 14:03:29.458,2020-04-14 14:22:58.251,1,23385,2018-06-04 15:24:46.985,orcid,Vrije Universiteit Amsterdam,vu.nl
314,3983285,Dataset,2020-07-13 07:55:40.256,2020-07-25 10:37:29.992,1,36947,2020-07-13 07:53:46.988,builtin,Liverpool University Hospitals NHS Foundation ...,liverpoolft.nhs.uk
315,3893531,Dataset,2020-06-24 01:14:46.562,2020-06-24 01:15:34.277,1,36500,2020-06-24 01:07:05.904,orcid,Fiona Stanley Hospital,gmail.com


In [43]:
print(datasetsInRootDF['deposit_createdate'].min())
print(datasetsInRootDF['deposit_createdate'].max())

2018-11-01 12:44:01.87
2020-10-31 20:53:47.3


In [44]:
print(datasetsInRootDF.shape)

(3109, 10)
