# Mashup data

## Read and Filter Zenodo 

**Zenodo output data labels**

**title** : title of the record\
**id**: the identifier number assigned to the record in Zendo\
**doi**: the complete doi in the zenodo\
**creators**: a list of creators of the record\
**orcid**: a list of orcid id of the creators of the record\
**date**: the publication date of the record\
**description**: the description in the metadata of the record\
**type**: the type of the record extracted from the title of resource type in metadata\
**broader_type**: the type of the record extracted from the type of the resource type in metadat\
**rights**: the access rights to the record\
**publisher**: the publisher if metioned for the journals or books or university for thesis
, otherwise it is zenodo\
**relation**: the pid of the relation\
**communities**: the communities id mentioned in metadata

In [1]:
import numpy as np
import pandas as pd

# Read Zenodo dataset
zen_ds = pd.read_json(path_or_buf='../datasets/ZenodoData.json')

In [39]:
# Safe getter: gets from deeper layers
def safe_nested_get(d, *keys, default='zenodo'):
    for key in keys:
        if isinstance(d, dict):
            d = d.get(key, default)
        else:
            return default
    return d


In [40]:
def extract_metadata_info(metadata):
    if not isinstance(metadata, dict): # control if the argument is a dictionary
        return pd.Series({
            'creator' : np.nan,
            'orcid': np.nan,
            'date': np.nan,
            'description' : np.nan,
            'type' : np.nan,
            'broad_type' : np.nan,
            'publisher' : 'zenodo',
            'access_right' : np.nan,
            'relation' : np.nan,
            'communities' : np.nan
        })
    
    # Create a list of creators
    creators = [creator.get('name', np.nan) for creator in metadata.get('creators', [])]
    # Create a list of creators' orcid numbers
    orcids = [creator.get('orcid', np.nan) for creator in metadata.get('creators', [])]
    pub_date = metadata.get('publication_date', np.nan) 
    description = metadata.get('description', np.nan) # Retrieve description for potential use cases
    res_type = metadata.get('resource_type', np.nan).get('title', np.nan) # Get the type by title of the type
    broad_type = metadata.get('resource_type', np.nan).get('type', np.nan) # Get the broader type
    rights = metadata.get('access_right', np.nan)
    publisher = 'zenodo' # Put the default value to zenodo
    communities = [community.get('id', np.nan) for community in metadata.get('communities', [])]

    relation = []
    # Extract pid_value of relations in metadata
    version_list = metadata.get('relations', {}).get('version', [])
    for version in version_list:
        parent = version.get('parent', {})
        pid_val = parent.get('pid_value')
        if pid_val:
            relation.append(pid_val)
            
    if res_type == 'Journal article' or res_type == 'Peer review':
        publisher = safe_nested_get(metadata, 'journal', 'title')

    elif res_type == 'Conference paper' or res_type == 'Presentation': # Get meeting title In case resource type
        publisher = safe_nested_get(metadata, 'meeting', 'title')      #  is conference or presentation 

    elif res_type == 'Book chapter' or res_type == 'Book':
        publisher = safe_nested_get(metadata, 'imprint', 'publisher')
        if pd.isna(publisher):
            publisher = safe_nested_get(metadata, 'thesis', 'place')

    elif res_type == 'Thesis' or 'thesis' in metadata: # In case the type is not thesis but the publishers' 
                                                       # information is in the 'thesis' key
        publisher = safe_nested_get(metadata, 'thesis', 'university')
           
    return pd.Series({
        'name' : creators,
        'orcid': orcids,
        'date': pub_date,
        'description' : description,
        'type' : res_type,
        'broad_type' : broad_type,
        'access_right' : rights,
        'publisher' : publisher,
        'relation' : relation,
        'communities' : communities
    })


In [None]:
# Apply the function
zen_ds[['creators', 'orcid', 'date', 'description', 'type', 'broader_type', 'rights', 'publisher',
        'relation', 'communities']] = zen_ds['metadata'].apply(extract_metadata_info)

In [67]:
column_lst = ['title', 'id', 'doi', 'creators', 'orcid', 'date', 'description', 'type', 'broader_type', 'rights', 
              'publisher', 'relation', 'communities']
norm_zen_ds = zen_ds[column_lst] # Get the needed columns in dataframe
norm_zen_ds['src_repo'] = 'zenodo' # Add the flag column for source repository "zenodo"
norm_zen_ds.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  norm_zen_ds['src_repo'] = 'zenodo' # Add the flag column for source repository "zenodo"


Unnamed: 0,title,id,doi,creators,orcid,date,description,type,broader_type,rights,publisher,relation,communities,src_repo
0,Il Progetto ACCESs: esperienze di accessibilit...,7956878,10.5281/zenodo.7956878,"[Zanchi, Anna]",[nan],2023-05-22,<p>Tesi di laurea magistrale del corso di Arti...,Thesis,publication,open,Alma Mater Studiorum Università di Bologna,[7956877],[],zenodo
1,La Chouffe DMP New,6411449,10.5281/zenodo.6411449,"[Chiara Catizone, Giulia Venditti, Davide Brem...","[0000-0003-2445-2426, 0000-0001-7696-7574, 000...",2022-04-04,<p>This DMP has been created fo managing data ...,Output management plan,publication,open,zenodo,[6411448],[argos],zenodo
2,Footactile rhythmics: protocols and data colle...,5504259,10.5281/zenodo.5504259,"[Dall'Osso, Giorgio]",[0000-0002-4219-7513],2021-09-13,<p>The data shared refer to research investiga...,Dataset,dataset,open,Alma Mater Studiorum - Università di Bologna,[5504258],[],zenodo
3,La Chouffe DMP,6411382,10.5281/zenodo.6411382,[Chiara Catizone],[nan],2022-04-04,This DMP has been created fo managing data rep...,Output management plan,publication,open,zenodo,[6411381],[argos],zenodo
4,Addressing the Challenges of Health Data Stand...,15358180,10.5281/zenodo.15358180,"[Marfoglia, Alberto, Arcobelli, Valerio Antoni...","[0009-0000-5857-2376, 0000-0002-1262-9899, 000...",2025-05-07,<p>This table presents the data extraction fro...,Dataset,dataset,open,zenodo,[15358179],[],zenodo


## output

In [68]:
norm_zen_ds.to_csv('../mashup_dataset/mashup.csv', index=False)