## Goals
In the Harvard Dataverse Repository, estimate the number of dataset authors who have some affiliation with Harvard University. Consider only the authors of published datasets (and not draft or deaccessioned datasets).

The author metadata fields are made up of four child fields: author name, author affiliation, author identifier scheme, and author identifier. The author name field must be filled, while the other three fields are technically optional.

Depositors must create and use an account in order to create and publish datasets and are able to enter anything they like in the author fields. Depositors can deposit datasets on behalf of dataset authors and do not need to be one of the authors.

To estimate the number of distinct dataset authors, we'll look at what dataset depositors enter in the author metadata fields. Some datasets have no author metadata; in those cases we'll try to determine the affiliations of the depositors.

We'll finish with a review of the limitations of this method of estimating the number of Harvard-affiliated dataset authors.

## Import Python libraries

In [24]:
#from functools import reduce
#import numpy as np
import os
import pandas as pd
#import re

## Import and check data
We'll be using data exported from the repository's database, saved in CSV files. Database queries for these files are in the same directory as the CSV files.

In [4]:
os.chdir('data_and_queries/data_in_multiple_tables')

In [5]:
authorsDF = pd.read_csv('dataset_authors.csv', dtype={'author_identifier': str}, na_filter = False)
authorsDF.head()

Unnamed: 0,dataset_doi_url,author_name,author_affiliation,author_identifier_scheme,author_identifier
0,https://doi.org/10.7910/DVN/0UTWWV,"Altman, Dan",Georgia State University,,
1,https://doi.org/10.7910/DVN/ZE14ST,"Moch, Jonathan",Harvard University,,
2,https://doi.org/10.7910/DVN/ZZ2LBN,Addison Kwame Thomas,"Kwame Nkrumah University of Science and Technology Faculty of Biosciences, Kumasi, Ghana",ORCID,0000-0003-2129-061X
3,https://doi.org/10.7910/DVN/OE1NW0,"Stager, Lawrence E.",Harvard University,,
4,https://doi.org/10.7910/DVN/STDSFL,"Stager, Lawrence E.",Harvard University,,


## Inspect the data
First, lets try to get a count of distinct dataset authors, whether or not they are Harvard affiliated.

Let's deduplicate that authorsDF dataframe using the information in the four author fields. Each row of the new deduplicated dataframe should represent a distinct author. We'll have to remove the dataset_doi_url column, since one person can be an author of multiple datasets. And we'll have to remove datasets that have no author metadata.

In [6]:
# Remove dataset_doi_url column from authorsDF and deduplicate rows using values in the remaining columns
authorsDeduplicatedDF = authorsDF.drop(columns=['dataset_doi_url']).drop_duplicates().reset_index(drop=True)

# For datasets that have no author name, the database has the string "N/A" in the author name field.
# Let's remove that "N/A" author from the authorsDeduplicatedDF.
authorsDeduplicatedDF = (
    authorsDeduplicatedDF
        .query(
            'author_name != "N/A"'
    )
)

# Print count of rows in authorsDeduplicatedDF
print(f'Number of distinct authors: {len(authorsDeduplicatedDF):,}')

Number of distinct authors: 47,551


Of those distinct authors, lets create a new dataframe that contains only authors with an affiliation from Harvard (including schools, departments and other groups within Harvard). We'll use lists and values that I've found useful during earlier reviews of the repository's datasets.

In [7]:
# Create list of strings to search for in affiliation field that indicates a Harvard affiliation
harvardAffiliationExact = ['CfA', 'HKS', 'SAO']

# Create new dataframe that contains only authors with affiliation metadata that indicates a Harvard affiliation
harvardAffiliatedAuthorsDF = (
    authorsDeduplicatedDF
        .query(
            'author_affiliation.str.contains("Harvard") or\
            author_affiliation.str.contains("Berkman") or\
            author_affiliation.str.contains("Center for Geographic Analysis") or\
            author_affiliation.str.contains("Center for the History of Medicine") or\
            author_affiliation.str.contains("Francis A. Countway Library of Medicine") or\
            author_affiliation.str.contains("HMDC") or\
            author_affiliation.str.contains("Institute for Quantitative Social Science") or\
            author_affiliation.str.contains("IQSS") or\
            author_affiliation.str.contains("Smithsonian Astrophysical Observatory") or\
            author_affiliation.str.contains("Social Science One") or\
            author_affiliation.str.contains("T.H. Chan School of Public Health") or\
            author_affiliation.str.contains("Murray Research Archive") or\
            author_affiliation.str.contains("Radcliffe College") or\
            author_affiliation in @harvardAffiliationExact'
    )
)

# Print count of rows in the new dataframe
print(f'Number of distinct authors: {len(harvardAffiliatedAuthorsDF):,}')

Number of distinct authors: 2,152


2,152 is about 4.5 percent of 47,551. Maybe that's where Danny's 5 percent figure came from.

Let's look at the datasets that have no authors. These were published before a Dataverse software update made the author metadata mandatory. We can see which accounts were used to deposit those datasets by including information in the dataset_depositors.csv file.

In [8]:
# Create dataframe that contains only the datasets that have no author metadata
noAuthorsDF = (
    authorsDF
        .query(
            'author_name == "N/A"'
    )
)

# Print count of rows in the new dataframe
print(f'Number of datasets with no author metadata: {(len(noAuthorsDF)):,}')

Number of datasets with no author metadata: 4,146


In [9]:
# Read in dataset_depositors.csv as a new dataframe
depositorsDF = pd.read_csv('dataset_depositors.csv')

# Join the noAuthorsDF with the depositorsDF
datasetsWithNoAuthorsDF = pd.merge(
    noAuthorsDF,
    depositorsDF,
    how='inner',
    on='dataset_doi_url')

# Drop the author metadata columns since they don't contain information
datasetsWithNoAuthorsDF = datasetsWithNoAuthorsDF.drop(columns=[
    'author_name', 'author_affiliation', 'author_identifier_scheme', 'author_identifier'])

# Make sure the join worked by counting the number of rows of the new dataframe, which should be the same as noAuthorsDF's row count
print(f'Number of rows in joined datasetsWithNoAuthorsDF dataframe: {len(datasetsWithNoAuthorsDF):,}')

Number of rows in joined datasetsWithNoAuthorsDF dataframe: 4,146


Let's preview this dataframe

In [10]:
datasetsWithNoAuthorsDF.head()

Unnamed: 0,dataset_doi_url,depositor_account_type,depositor_id,depositor_firstname,depositor_lastname,depositor_affiliation,depositor_email
0,https://doi.org/10.7910/DVN/M0C8T2,builtin,1,Dataverse,Admin,Dataverse.org,dataverseadmin@iq.harvard.edu
1,https://doi.org/10.7910/DVN/N619OT,builtin,1,Dataverse,Admin,Dataverse.org,dataverseadmin@iq.harvard.edu
2,https://doi.org/10.7910/DVN/28195,builtin,5291,Alvaro,Lima,,alvaroelima@gmail.com
3,https://doi.org/10.7910/DVN/22SIWX,builtin,1,Dataverse,Admin,Dataverse.org,dataverseadmin@iq.harvard.edu
4,https://doi.org/10.7910/DVN/1IJEWF,builtin,1,Dataverse,Admin,Dataverse.org,dataverseadmin@iq.harvard.edu


We need to find the accounts that are affiliated with Harvard:
- Check if the depositor_email field contains 'harvard'
- Check if the depositor_affiliation field contains 'harvard' or any schools, departments and other groups within Harvard. We can re-use the query we used earlier to find Harvard-affiliated authors.

We'll also remove the dataset_doi_url column and deduplicate the dataframe so it contains distinct, Harvard-affiliated depositors.

In [11]:
# Create new dataframe that contains only depositors with affiliation metadata that indicates a Harvard affiliation
depositorsOfDatasetsWithNoAuthorsDF = (
    datasetsWithNoAuthorsDF
        .query(
            'depositor_email.str.contains("harvard") or\
            depositor_affiliation.str.contains("Harvard") or\
            depositor_affiliation.str.contains("Berkman") or\
            depositor_affiliation.str.contains("Center for Geographic Analysis") or\
            depositor_affiliation.str.contains("Center for the History of Medicine") or\
            depositor_affiliation.str.contains("Francis A. Countway Library of Medicine") or\
            depositor_affiliation.str.contains("HMDC") or\
            depositor_affiliation.str.contains("Institute for Quantitative Social Science") or\
            depositor_affiliation.str.contains("IQSS") or\
            depositor_affiliation.str.contains("Smithsonian Astrophysical Observatory") or\
            depositor_affiliation.str.contains("Social Science One") or\
            depositor_affiliation.str.contains("T.H. Chan School of Public Health") or\
            depositor_affiliation.str.contains("Murray Research Archive") or\
            depositor_affiliation.str.contains("Radcliffe College") or\
            depositor_affiliation in @harvardAffiliationExact'
    )
)

# Remove dataset_doi_url column from depositorsOfDatasetsWithNoAuthorsDF and deduplicate rows using values in the remaining columns
depositorsOfDatasetsWithNoAuthorsDF = depositorsOfDatasetsWithNoAuthorsDF.drop(columns=['dataset_doi_url']).drop_duplicates().reset_index(drop=True)

# Print count of rows in the new dataframe
print(f'Number of depositors of datasets with no author metadata: {len(depositorsOfDatasetsWithNoAuthorsDF):,}')

Number of depositors of datasets with no author metadata: 36


In [12]:
# Preview new dataframe
depositorsOfDatasetsWithNoAuthorsDF.head()

Unnamed: 0,depositor_account_type,depositor_id,depositor_firstname,depositor_lastname,depositor_affiliation,depositor_email
0,builtin,1,Dataverse,Admin,Dataverse.org,dataverseadmin@iq.harvard.edu
1,builtin,770,Marie,McCormick,Harvard School of Public Health,mmccormi@hsph.harvard.edu
2,builtin,148,Meghan,Dolan,Harvard,dolan2@fas.harvard.edu
3,shib-harvardkey,172,Diane,Sredl,Harvard University,sredl@fas.harvard.edu
4,builtin,7413,Zhanina,Boyadzhieva,,zlboyadz@gsd.harvard.edu


## Limitations


This estimation doesn't account for:
 - author names that are the same but are really different people
 - the same author having variations in the spellings or orders of their name or affiliation or typos in their identifier (like a mistyped ORCID ID)

We can see an example of this in the harvardAffiliatedAuthorsDF dataframe. Let's find the rows that have the same names but different values in other author fields:

In [40]:
# From the harvardAffiliatedAuthorsDF, create new dataframe with rows where the values in the author_name field are the same
duplicateNamesDF = (
    harvardAffiliatedAuthorsDF[harvardAffiliatedAuthorsDF
        .duplicated(['author_name'], keep=False)]
        .sort_values('author_name', ascending=True)
        .reset_index(drop=True)
)

# Preview duplicateNamesDF
duplicateNamesDF.head(10)

Unnamed: 0,author_name,author_affiliation,author_identifier_scheme,author_identifier
0,Ahmed Fahmy,Harvard Medical School,,
1,Ahmed Fahmy,Harvard University,,
2,"Ansolabehere, Stephen",Harvard,,
3,"Ansolabehere, Stephen",Harvard University,ORCID,0000-0001-5240-9084
4,"Ansolabehere, Stephen",Harvard University,,
5,"B. Baleato, Suso","Harvard University, University of Santiago de Compostela",,
6,"B. Baleato, Suso","Harvard University, Universidade de Santiago de Compostela",,
7,"Baker, Caitlin M.",Harvard University,ORCID,0000-0002-9782-4959
8,"Baker, Caitlin M.",Harvard University,ORCID,https://orcid.org/0000-0002-9782-4959
9,"Balan, Pablo",Harvard,,


It's likely that these 10 rows represent at most 5 authors, but the deduplication isn't accounting for the differently spelled affiliations and various ORCID ID formatting.

The same is true when considering the affiliations of depositors.

So that affects the accuracy.

Accounting for these variations would improve the accuracy of the estimate and would take