## Zenodo Metadata Initial Bulk Extraction

### What does this code do?

This code searches Zenodo using the Zenodo API for UAB affiliated datasets. It then downloads those metadata records and converts them into a format that can be used for a Digital Commons bulk upload, with appropriate headers. This file will be saved as an .xlsx file and will require manual resaving to a .xls file for the actual upload. This file will also require manual curation before it is ready for upload, with special attention to:

- Cleaning up affiliations, ex:
  - Unifying names for a single institution: University of Alabama - Birmingham, UNIVERSITY OF ALABAMA AT BIRMINGHAM
   &rarr; University of Alabama at Birmingham
  - Un-abbreviating institution names: NYU &rarr; New York University 
  - Removing extraneous location details: California Digital Library, Oakland, United States of America &rarr; California Digital Library
- Checking for names and keywords written in all caps
- Checking for special characters or accents that are not formatted properly

### What datasets are included?

We want to include find datasets where at least one author is affiliated with the University if Alabama at Birmingham. Since Zenodo does not use or require ROR IDs, the best way to find these datasets is by searching in the creator affiliation field for the query string "university AND alabama AND birmingham". Since [Zenodo hosts copies of Dryad datasets](https://blog.zenodo.org/2020/03/10/dryad-and-zenodo-our-path-ahead/), this search will locate datasets in both Zenodo and Dryad. As of 11/14/2024, there are 128 total datasets, 41 of which are from Dryad. The Dryad datasets belong to the Zenodo community "dryad", allowing us to isolate them if needed.

### Import the data as a json

The code below uses an access token (from Claire Warner - you can replace it with your own) and a search query ` "creators.affiliation:(+university +alabama +birmingham)" `. The `size` parameter allows you to choose the number of results returned. We have 128 results, so it is set at 200 to allow for some headroom if new datasets appear.

We are left with `records`, the API response in json format.

In [None]:
import requests

ACCESS_TOKEN = '5lRvVDSnCTTXgdFWLuCN7HLAK2UWKjUbJwPCEiWJxirzVT3VfLsAeHnhflmt'
search_query = 'creators.affiliation:(+university +alabama +birmingham)'

response = requests.get('https://zenodo.org/api/records/',
                        params={'q': search_query,
                                'access_token': ACCESS_TOKEN,
                                'size': 200, # should be 128, add headroom
                                'type' : 'dataset'
                                })

records = response.json()

### Generate the necessary dataframe

The data from the API request comes in json format. We want to convert it to a pandas dataframe so that we can work with it more easily. This is done using the [`pd.json_normalize()`](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html) function. This results in a dataframe with a *lot* of columns. Some of these columns are irrelevant to our purposes (ex: view and download statistics) so we then remove them using the column names and the `drop()` function. 

We then save this dataframe in csv format. This allows us to easily view the metadata downloaded from Zenodo. We use the `utf-8-sig` encoding to best preserve symbols and special characters.

In [None]:
import pandas as pd

### MAKE THE DATAFRAME

df = pd.json_normalize(records['hits']['hits'][0])

for i in range(1, len(records['hits']['hits'])):
    df_row = pd.json_normalize(records['hits']['hits'][i])
    # df_row = df_row.drop(columns = ['stats'])

    df = pd.concat([df, df_row])

### DROP UNWANTED COLUMNS

unwanted_cols = ['conceptrecid', 'recid', 'revision', 'files', 'owners', 'status', 'state', 'submitted', 'metadata.title', 'metadata.resource_type.title', 
                 'metadata.resource_type.type', 'metadata.relations.version', 'links.self', 'links.doi', 'links.self_doi', 'links.self_doi_html', 
                 'links.parent', 'links.self_iiif_manifest', 'links.self_iiif_sequence', 'links.files', 'links.media_files', 'links.archive', 'links.archive_media', 
                 'links.latest', 'links.latest_html', 'links.versions', 'links.draft', 'links.reserve_doi', 'links.access_links', 'links.access_grants', 'links.access_users',
                 'links.access_request', 'links.access', 'links.communities-suggestions', 'links.requests', 'stats.downloads', 'stats.unique_downloads',
                 'stats.views', 'stats.unique_views', 'stats.version_downloads', 'stats.version_unique_downloads', 'stats.version_unique_views', 'stats.version_views'
                ]

df = df.drop(columns=unwanted_cols)

### Rename and save to csv file

df_input = df

csv_path = 'raw-data/zenodo_expanded_raw.csv'

df_input.to_csv(csv_path, index=False, encoding='utf-8-sig')

### Helper functions

We will need a series of functions to extract and reformat the information from the dataframe `df_input` and put it into new columns in the dataframe `df_output`, which will then be used to make the Digital Commons batch upload file. 

In [None]:
import csv

def csv_to_dict(file_path):
    '''Imports data from a 2-column csv file, where the first column contains dictionary keys and the second column contains the corresponding values.
    Outputs the resulting dictionary. Used with the relation_types.csv file to generate the strings used for the relation type of a related item.'''
    result_dict = {}
    with open(file_path, mode='r', newline='', encoding='utf-8') as csvfile:
        csvreader = csv.reader(csvfile)
        for row in csvreader:
            key = row[0]  # First column as key
            value = row[1]  # Second column as value
            result_dict[key] = value
    return result_dict

### Creating the relation type dictionary
relation_dict = csv_to_dict('relation_types.csv')

def add_col(df1, col1name, df2, col2name):
    '''Copies a column (col1name) out of df1 and adds it to df2. The name of the column in df2 can be specified using the col2name variable.
    The contents of the column are not altered. Returns df2 with the new column added.'''
    extracted_col = df1[col1name]
    df2 = pd.concat([df2, extracted_col.rename(col2name)], axis=1)
    return df2

def list_to_string(lst):
    '''Takes in a list of strings ['a', 'b', 'c'] and returns a single string containing the list elements, separated by commas 'a, b, c'. '''
    if isinstance(lst, list):  # Check if the value is a list
        return ', '.join(lst)
    else:
        return ""  # Convert non-list values to string or handle as needed

def url_to_html(url, link_text=None):
    '''Converts a url in string form to a html formatted string for a hyperlinked url, with an optional alternate link text.'''
    if not url:
        return ''  # Return empty string if no URL is provided
    link_text = link_text or url  # Use the URL as the link text if no text is provided
    return f'<a href="{url}">{link_text}</a>'

### Dictionary containing the Zenodo terms for licenses as keys, with the values being a list containing the license URL as well as the html formatted text corresponding to each license.
license_dict = {"mit-license" : ["https://opensource.org/license/mit", "<p>This data is available under the MIT License</p>"],
                "cc-zero" : ["https://creativecommons.org/public-domain/cc0/", "<p>This data is public domain under the CC-0.0 License</p>"],
                "cc-by-4.0" : ["http://creativecommons.org/licenses/by/4.0/", "<p>This data is available under the CC-BY 4.0 License</p>"],
                "cc-by-2.0" : ["http://creativecommons.org/licenses/by/2.0/", "<p>This data is available under the CC-BY 2.0 License</p>"],
                "cc-by" : ["https://creativecommons.org/licenses/by/1.0/", "<p>This data is available under the CC-BY License</p>"]
}

def add_license(df1, df2):
    '''Adds licensing information columns to df2. Finds the value of the column metadata.license.id in df1
    and uses it as a key for license_dict to retrieve the url fo the license and the string we want displayed in html.
    If there is no value in that row, there is no license shown. We assume this means the data is restricted.'''
    licenses = [] # List for the license url
    access = [] # List for the string/text explaining access
    for license in df1['metadata.license.id']:
        if pd.notnull(license):
            licenses.append(license_dict[license][0])
            access.append(license_dict[license][1])
        else:
            licenses.append('') # Append blank string if there is no license for restricted data
            access.append('<p>Access to this data is restricted.</p>')
    df2['distribution_license'] = licenses
    df2['access_link'] = access
    return df2

def add_repo(df1, df2):
    '''Adds the repository in which the data is stored. Defaults to Zenodo, unless the dataset is in the Dryad community.'''
    repo_list = []
    for community in df1['metadata.communities']: # Iterates through metadata.communities in df1
        if pd.notnull(community) and community[0]['id'] == 'dryad': # If there is a community listed and it is Dryad
            repo_list.append('<p>Dryad</p>')
        else:
            repo_list.append('<p>Zenodo</p>')
    df2['external_rep'] = repo_list
    return df2

def separate_name(name):
    '''Separates a name into First, Middle (if applicable) and Last, returns these elements as separate strings.'''
    middle = ""
    
    # Check if the name contains a comma (indicating "last, first" or "last, first m." format)
    if ',' in name:
        parts = name.split(", ") # Split into a list: "last, first" becomes ["last", "first"]
        last = parts[0] # Last name is everything before the comma
        
        # Check if there's an element after the comma
        if len(parts) > 1:
            first_and_middle = parts[1].split() # Split whatever was after the comma with spaces
            first = first_and_middle[0] # First name will be the first part of that
            
            # Assign middle if available
            if len(first_and_middle) > 1: # If there is a second part
                middle = first_and_middle[1].replace(".", "") # Assign middle name as second part, remove . because DC will add it
        else:
            # Set first to an empty string if there's nothing after the comma
            first = ""
    else: # If the name had no comma, you can assume it is in First Last or First Middle Last format
        parts = name.split() # Split into a list where spaces are 
        
        first = parts[0] # First name is first element of list
        last = parts[-1] # Last name is the last element of the list 
        
        # Check if there is a middle name/initial 
        if len(parts) > 2: 
            middle = parts[1].replace(".", "") # Remove period if it exists
    
    return first, middle, last
    
def reformat_name(name):
    '''Checks if a name string is in "Last, First" format. If it is, returns "First Last".'''
    # Check if the name contains a comma
    if ', ' in name:
        # Split the string by comma and strip any extra whitespace
        last, first = name.split(", ")
        # Return the string in "first last" format
        return f"{first} {last}"
    else:
        # Return the name unchanged if there's no comma
        return name
    
def add_orcid(df1, df2):
    '''Finds authors with ORCIDs and lists them in html format for each dataset, along with hyperlinked urls.'''
    orcid_pairs = [] # List for lists of author/orcid pairs where each element corresponds to a different dataset 
    for index, row in df1.iterrows():
        pairs = [] # List for appending author/orcid pairs within one dataset
        for author in row['metadata.creators']: # Iterate through the list of authors in each row of the metadata.creators column
            if 'orcid' in author:  # Check if ORCID is present
                first_last_name = reformat_name(author['name'])
                pairs.append('<p>' + first_last_name + ' <a href="https://orcid.org/' + author['orcid'] + '">' +author['orcid']+ '</a></p>')
        orcid_pairs.append("".join(pairs))  # Join multiple name-ORCID pairs with a comma
    # Add this list as a new column in df2
    df2['orcid'] = orcid_pairs
    return df2

def to_html(string):
    '''Takes in a string. If the string is not in html already (assume first character is <) wrap it in <p> ... </p>.'''
    if string[0] != "<":
        string = "<p>" + string + "</p>"
    return(string)

def add_funders(df1, df2):
    '''Adds funder information from df1 to df2. Adds funder name, and optionally DOI, grant title, and grant number.'''
    funders = []
    for index, row in df1.iterrows():
        funder = []
        # Iterate through the grants if they are present
        if isinstance(row['metadata.grants'], list):
            for grant in row['metadata.grants']:
                if pd.notnull(grant):  # Check if grant is not null
                    funder.append('<p>Funder: ' + grant['funder']['name'])
                    if 'doi' in grant['funder']:
                        funder.append('<br>Funder DOI: <a href="https://doi.org/' + grant['funder']['doi'] + '">' +grant['funder']['doi']+ '</a>')
                    if 'title' in grant:
                        funder.append('<br>' + grant['title'])
                    if 'code' in grant:
                        funder.append('<br>' + grant['code'])
                    funder.append('</p>')
        else:
            # If it's not a list put an empty cell
            if pd.notnull(row['metadata.grants']):
                funder.append('')
        # Append the joined funder information to the list
        funders.append("".join(funder))
    # Add the new 'fundref' column to df2
    df2['fundref'] = funders
    return df2

def add_related_items(df1, df2):
    '''Adds related items from df1 to df2 with nice html formatting. Includes relation type taken from relation_dict dictionary. Formats PID appropriately if it is a DOI or other URL.'''
    items = []
    for index, row in df1.iterrows():
        item = []
        # Iterate through the grants if they are present
        if isinstance(row['metadata.related_identifiers'], list):
            item.append('<p>')
            count =0
            for id in row['metadata.related_identifiers']:
                if pd.notnull(id):  
                    if count > 0:
                        item.append('<br>')
                    count += 1
                    if id['relation']:
                        item.append(relation_dict[id['relation']] + ': ')
                    if id['scheme'] == 'url':
                        url = id['identifier']
                        item.append( '<a href="'+ url + '">' + url + '</a>')
                    if id['scheme'] == 'doi':
                        doi = id['identifier']
                        item.append('<a href="https://doi.org/' + doi + '">' + doi + '</a>')
                    if id['scheme'] != 'url' and id['scheme'] != 'doi':
                        item.append(id['identifier'])
            item.append('</p>')
        else:
            # If it's not a list put an empty cell
            if pd.notnull(row['metadata.related_identifiers']):
                item.append('')
        # Append the joined funder information to the list
        items.append("".join(item))
    # Add the new 'fundref' column to df2
    df2['related_data'] = items
    return df2

def add_creators(df1, df2):
    '''Adds creator information. Note that this adds an arbitrary number of creators but Digital Commons has a number cap for authors so you may need to manually curate after.
    This function will make a new dataframe df3 and append it on to df2.'''
    # Create a list to hold all rows of data for the new DataFrame
    expanded_data = []
    
    # Process each row in df1
    for _, row in df1.iterrows():
        row_data = {}
        creators = row['metadata.creators']
        
        # Populate the row_data dictionary with each author's name and affiliation
        for i, creator in enumerate(creators):
            author_index = i + 1
            name = creator.get('name', "")
            institution = creator.get('affiliation', "")  # Use 'institution' instead of 'affiliation' (DC nomenclature)
            
            # Use the separate_name function to split names
            first_name, middle_name, last_name = separate_name(name)
            
            # Assign names and institution to the row_data dictionary
            row_data[f'author{author_index}_fname'] = first_name
            row_data[f'author{author_index}_mname'] = middle_name
            row_data[f'author{author_index}_lname'] = last_name
            row_data[f'author{author_index}_institution'] = institution  # Change 'affl' to 'institution'
        
        # Append row_data dictionary to expanded_data list
        expanded_data.append(row_data)

    # Convert expanded_data list of dictionaries into a new DataFrame
    df3 = pd.DataFrame(expanded_data)
    
    # Fill missing values with empty strings for any columns where data is missing
    df3 = df3.fillna("")
    
    df2_reset = df2.reset_index(drop=True)
    df3_reset = df3.reset_index(drop=True)
    
    # Concatenate the two DataFrames along the columns
    df_out = pd.concat([df2_reset, df3_reset], axis=1)
    
    return df_out




Build output dataframe 

In [None]:
### BUILD OUTPUT DATAFRAME

# title
df_output = df[['title']]

# orcid
df_output = add_orcid(df1=df_input, df2=df_output)

# publication_date
df_output = add_col(df_input, "metadata.publication_date", df_output, "publication_date")

# abstract
df_output = add_col(df_input, "metadata.description", df_output, "abstract")
df_output['abstract'] = df_output['abstract'].apply(to_html)

# keywords
df_output = add_col(df_input, "metadata.keywords", df_output, "keywords")
df_output["keywords"] = df_output["keywords"].apply(list_to_string)

# disciplines
df_output["disciplines"] = "" #make blank column, we will need to fill in the values

#source_publication
df_output["source_publication"] = "" #make blank column, we will need to fill in the values

# related_data
df_output = add_related_items(df1=df_input, df2=df_output)

# source_fulltext_url
df_output = add_col(df_input, "doi_url", df_output, "source_fulltext_url")

# external_rep
df_output = add_repo(df1=df_input, df2=df_output)

# distribution_license and access_link
df_output = add_license(df1=df_input, df2=df_output)

# funder_info
df_output = add_funders(df1=df_input, df2=df_output)

# author info
df_output = add_creators(df1=df_input, df2=df_output)

display(df_output)

df_output.to_excel('batch-upload/zenodo_batch_upload.xlsx', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['orcid'] = orcid_pairs


Unnamed: 0,title,orcid,publication_date,abstract,keywords,disciplines,source_publication,related_data,source_fulltext_url,external_rep,...,author47_lname,author47_institution,author48_fname,author48_mname,author48_lname,author48_institution,author49_fname,author49_mname,author49_lname,author49_institution
0,Alzheimer's disease risk gene BIN1 induces Tau...,"<p>Yuliya Voskobiynyk <a href=""https://orcid.o...",2020-08-19,<p>Genome-wide association studies identified ...,,,,"<p>Is cited by: <a href=""https://doi.org/10.75...",https://doi.org/10.5061/dryad.rbnzs7h8z,<p>Dryad</p>,...,,,,,,,,,,
1,Data from: The effect of Speed of Processing t...,,2015-08-04,<p>Older adults experience cognitive deficits ...,"peripheral, Useful Field of View, cognitive in...",,,"<p>Is cited by: <a href=""https://doi.org/10.13...",https://doi.org/10.5061/dryad.4fn70,<p>Dryad</p>,...,,,,,,,,,,
2,Data for Cell-type-specific alternative splici...,"<p>Emma F. Jones <a href=""https://orcid.org/00...",2024-06-25,<p><span><strong>data.tar.gz </strong>contains...,,,,"<p>Is compiled by: <a href=""https://doi.org/10...",https://doi.org/10.5281/zenodo.12535061,<p>Zenodo</p>,...,,,,,,,,,,
3,Data for Altered Glia-Neuron Communication in ...,"<p>Tabea Soelter <a href=""https://orcid.org/00...",2023-11-28,<p><strong>data.tar.gz contains all files from...,"Alzheimer's disease, neurodegeneration, cell-c...",,,"<p>Is supplemented by: <a href=""https://doi.or...",https://doi.org/10.5281/zenodo.10214497,<p>Zenodo</p>,...,,,,,,,,,,
4,Data for Long-read RNA sequencing identifies r...,"<p>Emma F. Jones <a href=""https://orcid.org/00...",2023-12-14,<p><span>data_minus_bam.tar.gz contains all fi...,"long-read RNA sequencing, brain, sex, alternat...",,,"<p>Is supplement to: <a href=""https://github.c...",https://doi.org/10.5281/zenodo.10381745,<p>Zenodo</p>,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124,"Dataset for ""What is Gab? A Bastion of Free Sp...",,2018-09-13,<p>This dataset was used for this project: &qu...,,,,,https://doi.org/10.5281/zenodo.3460400,<p>Zenodo</p>,...,,,,,,,,,,
125,"Public Dataset for ""Large Scale Crowdsourcing ...",,2020-02-21,<p>Dataset for the &quot;Large Scale Crowdsour...,,,,,https://doi.org/10.5281/zenodo.3678559,<p>Zenodo</p>,...,,,,,,,,,,
126,Transposon DNA sequences facilitate the tissue...,,2023-05-24,<p>The uploaded files are in the&nbsp;fasta fo...,"Horizontal gene transfer, circulating tumor DN...",,,,https://doi.org/10.5281/zenodo.7958520,<p>Zenodo</p>,...,,,,,,,,,,
127,Functional connectivity in the face of congeni...,"<p>Pinar Demirayak <a href=""https://orcid.org/...",2019-09-06,<p>Results of diffusion tensor imaging analysi...,,,,,https://doi.org/10.5281/zenodo.3401600,<p>Zenodo</p>,...,,,,,,,,,,
