# Harvard Dataverse accounts that are used to deposit into "Root" dataverse

## Purpose

To prevent people from publishing spam in Harvard Dataverse, repository staff are considering publishing workflows that prevent new user accounts from being used to deposit into the "Root" dataverse until repository staff can verify that account creators won't deposit spam.

To help estimate the number of "Root"-depositor accounts that staff may have to vet each day, this notebook analyzes two years of user account data to try to answer:
1. On average, how many Harvard Dataverse accounts are created each day where the user accounts have been used to deposit one or more datasets or dataverses in Harvard Dataverse?
1. Which institutional Shibboleth systems are used most often to create Harvard Dataverse user accounts that are used to deposit datasets or dataverses in Harvard Dataverse?

## Getting the data

The "raw" data is extracted from Harvard Dataverse's postgres database with the following query and exported as [useraccount_info_of_root_depositors.csv](https://github.com/jggautier/ux-research/blob/main/notebooks/hdv_root_depositors/useraccount_info_of_root_depositors.csv):
```
select distinct on(authenticateduser.id)
	authenticateduserlookup.authenticationproviderid,
	authenticateduser.affiliation,
	authenticateduser.createdtime as account_create_date
from authenticateduser
join dvobject on dvobject.creator_id = authenticateduser.id
join authenticateduserlookup ON authenticateduserlookup.authenticateduser_id = authenticateduser.id
where dvobject.owner_id = 1
and authenticateduser.createdtime > '2018-09-30'
and authenticateduser.createdtime <= '2020-09-30'
```

The file useraccount_info_of_root_depositors.csv contains information about Harvard Dataverse user accounts created in the last two years (October 1, 2018 - September 30, 2020) that have deposited a dataverse or dataset in HDV's "Root" dataverse, whose database ID is 1. Email addresses are excluded (and aren't necessary here).

## Missing information

The repository's database doesn't retain the information needed to figure out which user accounts are used to deposit spam datasets and dataverses. Since this investigation doesn't include user accounts that are known to create spam, the investigation's results could be thought of as conservative estimates of the number of user accounts created each day that deposit into the repository's "Root" dataverse.

## Get average number of shib and non-shib "Root"-depositor accounts created each day

### Import modules and read in and clean data

In [4]:
# Import modules
import datetime as dt
from functools import reduce
import pandas as pd

# Read data into dataframe
accountInfo = pd.read_csv('useraccount_info_of_root_depositors.csv')

# Make sure values in account_create_date column are interpreted as dates
accountInfo['account_create_date'] = pd.to_datetime(accountInfo['account_create_date'], infer_datetime_format=True)


### Copy and transform data to get daily averages

In [9]:
def get_daily_avgs(df):
    # In the authenticationproviderid column, rename non-shib login options to "nonshib"
    df = df.replace(
    {'authenticationproviderid': {
        'builtin': 'nonshib',
        'github': 'nonshib',
        'orcid': 'nonshib',
        'google': 'nonshib'}})
    
    # Use account_create_date column to add "day of week" column
    df['day_of_week'] = df['account_create_date'].dt.weekday_name

    # Drop account_create_date and affiliation columns
    df = df.drop(columns=['account_create_date', 'affiliation'])

    # Get count of shib and nonshib accounts for each day of the week
    df = df.groupby(
        ['day_of_week', 'authenticationproviderid']).agg(
        {'authenticationproviderid': ['count']})
    df = df.unstack()

    # Combine multi-indexed columns
    df.columns = ['_'.join(col).strip() for col in df.columns.values]

    # Rename columns
    df.columns = ['nonshib_accounts_created', 'shib_accounts_created']

    # Reorder rows by day of week
    dayOfWeek = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
    df = df.reindex(dayOfWeek)

    # Replace count of accounts created with daily average over two years (or 104 of each day, ignoring leap year effects)
    df = df.apply(
        lambda x: x/104 if x.name in ['nonshib_accounts_created', 'shib_accounts_created'] else x)

    # Round averages to two decimal places
    df = df.round(2)

    # Rename columns
    df.columns = ['nonshib_average', 'shib_average']
    
    return df

averageDailyCount = get_daily_avgs(accountInfo)


### Results

In [10]:
# Show data (with styled table title)
averageDailyCount.style.set_caption(
    'Average number of "Root"-depositor accounts created in Harvard Dataverse each day').set_table_styles([{
    'selector': 'caption',
    'props': [
        ('color', 'black'),
        ('font-size', '16px')]}])

Unnamed: 0_level_0,nonshib_average,shib_average
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1
Sunday,2.21,0.39
Monday,4.45,1.08
Tuesday,4.27,1.13
Wednesday,4.47,1.06
Thursday,4.02,1.3
Friday,3.7,0.89
Saturday,1.8,0.41


- Between 4 and 6 "Root"-depositor accounts are created each weekday, while between 2-3 are created each day of the weekend.
- About a fifth of the accounts created each day are Shib accounts.
- Because the data doesn't include "Root"-depositor accounts that are known to deposit spam, this might be a conservative estimate of the number of user accounts that staff will have to verify each day.

## Find the institutional shib providers that are used most often to create "Root"-depositor accounts

Staff are considering creating [group role assignments](http://guides.dataverse.org/en/latest/user/dataverse-management.html?highlight=groups#roles-permissions) to let accounts that are created using certain institutional shib providers, such as Harvard University's HarvardKey, deposit datasets and dataverses into "Root" without staff vetting. This should streamline data publishing for user's who are least likely to deposit spam.

To help verify which institutional shib providers should automatically be given permission to deposit in "Root", let's see which institutional shib providers are used most often to create "Root"-depositor accounts.

### Copy and transform data to get counts by institutional shib provider

In [2]:
def get_shib_counts(df):
    # Keep only "shib" accounts (i.e. rows that have "shib" in the authenticationproviderid column)
    df = df[df.authenticationproviderid == 'shib']

    # Drop account_create_date and authenticationproviderid columns
    df = df.drop(columns=['account_create_date', 'authenticationproviderid'])

    # Get count of shib accounts by institution
    df = df.groupby(['affiliation']).agg({'affiliation': ['count']})

    # Combine multi-indexed columns
    df.columns = ['_'.join(col).strip() for col in df.columns.values]

    # Add index column
    df.reset_index(drop=False, inplace=True)

    # Rename columns
    df.columns = ['shib_institution', 'number_of_accounts']

    # Sort dataframe by number_of_accounts, then shib_institution
    df.sort_values(by=['number_of_accounts', 'shib_institution'], inplace=True, ascending=[False, True])

    # Reset index column
    df.reset_index(drop=True, inplace=True)

    return df

shibInstitutionCounts = get_shib_counts(accountInfo)


### Results

In [3]:
# Show top five institutional shib providers used
shibInstitutionCounts.head(5)


Unnamed: 0,shib_institution,number_of_accounts
0,Harvard University,144
1,Massachusetts Institute of Technology,20
2,Stanford University,16
3,University of Toronto,13
4,Michigan State University,10


- No suprise that most "Root"-depositor accounts created through Shibboleth used Harvard University's HarvardKey or MIT Touchstone.
- Run the cell below to see counts for the rest of the institutional shib providers used to create "Root"-depositor accounts.

In [4]:
# Show counts for all institutional shib providers used to create "Root"-depositor accounts
shibInstitutionCounts.style.set_table_styles(
    [dict(selector='th',
          props=[('text-align', 'left')])]).set_properties(
    subset=['shib_institution'], **{'text-align': 'left'})

Unnamed: 0,shib_institution,number_of_accounts
0,Harvard University,144
1,Massachusetts Institute of Technology,20
2,Stanford University,16
3,University of Toronto,13
4,Michigan State University,10
5,New York University,10
6,University of Washington,10
7,Boston University,9
8,Columbia University,9
9,ETH Zürich,9
