# Mashup data

## Read and Filter Zenodo 

**Zenodo output data labels**

**title** : title of the record\
**id**: the identifier number assigned to the record in Zendo\
**doi**: the complete doi in the zenodo\
**creators**: a list of creators of the record\
**orcid**: a list of orcid id of the creators of the record\
**date**: the publication date of the record\
**description**: the description in the metadata of the record\
**type**: the type of the record extracted from the title of resource type in metadata\
**broader_type**: the type of the record extracted from the type of the resource type in metadat\
**rights**: the access rights to the record\
**publisher**: the publisher if metioned for the journals or books or university for thesis
, otherwise it is zenodo\
**relation**: the pid of the relation\
**communities**: the communities id mentioned in metadata

In [149]:
import numpy as np
import pandas as pd

# Read Zenodo dataset
zen_ds = pd.read_json(path_or_buf='../datasets/ZenodoData.json')

In [150]:
%qtconsole

In [151]:
# Safe getter: gets from deeper layers
def safe_nested_get(d, *keys, default='zenodo'):
    for key in keys:
        if isinstance(d, dict):
            d = d.get(key, default)
        else:
            return default
    return d


In [152]:
def extract_metadata_info(metadata):
    if not isinstance(metadata, dict): # control if the argument is a dictionary
        return pd.Series({
            'creator' : np.nan,
            'orcid': np.nan,
            'date': np.nan,
            'description' : np.nan,
            'resource_type' : np.nan,
            'type' : np.nan,
            'publisher' : 'zenodo',
            'access_right' : np.nan,
            'relation' : np.nan,
            'communities' : np.nan
        })
    
    # Create a list of creators
    creators = [creator.get('name', np.nan) for creator in metadata.get('creators', [])]
    # Create a list of creators' orcid numbers
    orcids = [creator.get('orcid', np.nan) for creator in metadata.get('creators', [])]
    pub_date = metadata.get('publication_date', np.nan) 
    description = metadata.get('description', np.nan) # Retrieve description for potential use cases
    res_type = metadata.get('resource_type', np.nan).get('title', np.nan) # Get the type by title of the type
    broad_type = metadata.get('resource_type', np.nan).get('type', np.nan) # Get the broader type
    rights = metadata.get('access_right', np.nan)
    publisher = 'zenodo' # Put the default value to zenodo
    communities = [community.get('id', np.nan) for community in metadata.get('communities', [])]

    relation = []
    # Extract pid_value of relations in metadata
    version_list = metadata.get('relations', {}).get('version', [])
    for version in version_list:
        parent = version.get('parent', {})
        pid_val = parent.get('pid_value')
        if pid_val:
            relation.append(pid_val)
            
    if res_type == 'Journal article' or res_type == 'Peer review':
        publisher = safe_nested_get(metadata, 'journal', 'title')

    elif res_type == 'Conference paper' or res_type == 'Presentation': # Get meeting title In case resource type
        publisher = safe_nested_get(metadata, 'meeting', 'title')      #  is conference or presentation 

    elif res_type == 'Book chapter' or res_type == 'Book':
        publisher = safe_nested_get(metadata, 'imprint', 'publisher')
        if pd.isna(publisher):
            publisher = safe_nested_get(metadata, 'thesis', 'place')

    elif res_type == 'Thesis' or 'thesis' in metadata: # In case the type is not thesis but the publishers' 
                                                       # information is in the 'thesis' key
        publisher = safe_nested_get(metadata, 'thesis', 'university')
           
    return pd.Series({
        'name' : creators,
        'orcid': orcids,
        'date': pub_date,
        'description' : description,
        'resource_type' : res_type,
        'type' : broad_type,
        'access_right' : rights,
        'publisher' : publisher,
        'relation' : relation,
        'communities' : communities
    })


In [153]:
# Apply the function
zen_ds[['creators', 'orcid', 'date', 'description', 'resource_type', 'type', 'rights', 'publisher',
        'relation', 'communities']] = zen_ds['metadata'].apply(extract_metadata_info)

In [154]:
column_lst = ['title', 'id', 'doi', 'creators', 'orcid', 'date', 'description', 'resource_type', 'doi_url', 'type', 'rights', 
              'publisher', 'relation', 'communities']
norm_zen_ds = zen_ds[column_lst] # Get the needed columns in dataframe
norm_zen_ds['src_repo'] = 'zenodo' # Add the flag column for source repository "zenodo"
norm_zen_ds.rename(columns={'doi_url':'url'}, inplace=True)
norm_zen_ds.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  norm_zen_ds['src_repo'] = 'zenodo' # Add the flag column for source repository "zenodo"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  norm_zen_ds.rename(columns={'doi_url':'url'}, inplace=True)


Unnamed: 0,title,id,doi,creators,orcid,date,description,resource_type,url,type,rights,publisher,relation,communities,src_repo
0,Il Progetto ACCESs: esperienze di accessibilit...,7956878,10.5281/zenodo.7956878,"[Zanchi, Anna]",[nan],2023-05-22,<p>Tesi di laurea magistrale del corso di Arti...,Thesis,https://doi.org/10.5281/zenodo.7956878,publication,open,Alma Mater Studiorum Università di Bologna,[7956877],[],zenodo
1,La Chouffe DMP New,6411449,10.5281/zenodo.6411449,"[Chiara Catizone, Giulia Venditti, Davide Brem...","[0000-0003-2445-2426, 0000-0001-7696-7574, 000...",2022-04-04,<p>This DMP has been created fo managing data ...,Output management plan,https://doi.org/10.5281/zenodo.6411449,publication,open,zenodo,[6411448],[argos],zenodo
2,Footactile rhythmics: protocols and data colle...,5504259,10.5281/zenodo.5504259,"[Dall'Osso, Giorgio]",[0000-0002-4219-7513],2021-09-13,<p>The data shared refer to research investiga...,Dataset,https://doi.org/10.5281/zenodo.5504259,dataset,open,Alma Mater Studiorum - Università di Bologna,[5504258],[],zenodo
3,La Chouffe DMP,6411382,10.5281/zenodo.6411382,[Chiara Catizone],[nan],2022-04-04,This DMP has been created fo managing data rep...,Output management plan,https://doi.org/10.5281/zenodo.6411382,publication,open,zenodo,[6411381],[argos],zenodo
4,Addressing the Challenges of Health Data Stand...,15358180,10.5281/zenodo.15358180,"[Marfoglia, Alberto, Arcobelli, Valerio Antoni...","[0009-0000-5857-2376, 0000-0002-1262-9899, 000...",2025-05-07,<p>This table presents the data extraction fro...,Dataset,https://doi.org/10.5281/zenodo.15358180,dataset,open,zenodo,[15358179],[],zenodo


There are 276 duplicated rows in the zenodo dataset (by id and doi)

In [155]:
zen_ds2 = norm_zen_ds.drop_duplicates(subset=['id'])

In [156]:
zen_ds2['type'].unique()

array(['publication', 'dataset', 'software', 'other', 'poster',
       'presentation', 'image', 'model', 'event', 'lesson', 'video',
       'workflow', 'physicalobject'], dtype=object)

## Read and Filter AMS Acta
**AMS Acta output data lablels**
**


In [157]:
ams_ds = pd.read_json(path_or_buf='../datasets/amsacta_filtered_affiliation_or_orcid_doubles.json')

In [158]:
ams_col = ['title', 'doi', 'creators', 'monograph_type', 'type', 'date', 'uri', 'publisher', 'eprintid', 
          'abstract', 'issn' ]

In [159]:
ams_ds_filt = ams_ds[ams_col]

In [160]:
ams_ds_filt['date'] = pd.to_datetime(ams_ds_filt['date']).dt.date # Normalize datetime

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt['date'] = pd.to_datetime(ams_ds_filt['date']).dt.date # Normalize datetime


In [161]:
def name_getter(creators_raw):
    # creators_raw is already the list we want
    if not isinstance(creators_raw, list):
        creators_raw = []

    creators, orcids = [], []

    for person in creators_raw:
        if not isinstance(person, dict):
            continuezen_ds2 = norm_zen_ds.drop_duplicates(subset=['id'])
        name     = person.get("name", {})
        family   = name.get("family")
        given    = name.get("given")

        parts = [p for p in (family, given) if p]      # drop None/empty
        if parts:
            creators.append(", ".join(parts[::-1]))    # "Family, Given"

        orcids.append(person.get("orcid", np.nan))

    return pd.Series({"creators": creators or np.nan,
                      "orcid":    orcids    or np.nan})


In [162]:
ams_ds_filt[["creators", "orcid"]] = ams_ds_filt["creators"].apply(name_getter) # Apply the function

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt[["creators", "orcid"]] = ams_ds_filt["creators"].apply(name_getter) # Apply the function
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt[["creators", "orcid"]] = ams_ds_filt["creators"].apply(name_getter) # Apply the function


### AMSACTA column renames
* resource_type is monograph_type in the original dataset
* url is uri in the original dataset
* id is eprintid in the original dataset
* description is abstract in the original dataset

In [163]:
ams_ds_filt.rename(columns={'monograph_type':'resource_type','uri':'url', 'eprintid':'id', 'abstract':'description'}, inplace=True)
ams_ds_filt['src_repo'] = 'amsacta'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt.rename(columns={'monograph_type':'resource_type','uri':'url', 'eprintid':'id', 'abstract':'description'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt['src_repo'] = 'amsacta'


There are some duplicated records with different ids. We keep them for further investigations.

In [164]:
ams_ds_filt.head()

Unnamed: 0,title,doi,creators,resource_type,type,date,url,publisher,id,description,issn,orcid,src_repo
0,Introduzione alla Fisica del Terreno,10.6092/unibo/amsacta/2616,"[Giuliano, Vitali]",manual,monograph,2009-09-30,https://amsacta.unibo.it/id/eprint/2616,Asterisco,2616,,,[0000-0002-7866-5534],amsacta
1,A Female Musician or Dancer of Iron Age in Sou...,10.6092/unibo/amsacta/2953,"[Angela, Bellia]",,preprint,1970-01-01,https://amsacta.unibo.it/id/eprint/2953,,2953,The excavations conducted by Paola Zancani Mon...,,[0000-0002-1517-6012],amsacta
2,Gli strumenti musicali nelle immagini della Gr...,10.6092/unibo/amsacta/2955,"[Angela, Bellia]",,preprint,1970-01-01,https://amsacta.unibo.it/id/eprint/2955,,2955,Questo percorso didattico è dedicato agli stru...,,[0000-0002-1517-6012],amsacta
3,"Mito, musica e rito nelle raffigurazioni music...",10.6092/unibo/amsacta/2957,"[Angela, Bellia]",,conference_item,2008-07-01,https://amsacta.unibo.it/id/eprint/2957,,2957,I pinakes locresi sono tavolette votive in ter...,,[0000-0002-1517-6012],amsacta
4,Le raffigurazioni musicali nella coroplastica ...,10.6092/unibo/amsacta/2958,"[Angela, Bellia]",,conference_item,1970-01-01,https://amsacta.unibo.it/id/eprint/2958,,2958,"Nell’ambito degli studi archeologici, le ricer...",,[0000-0002-1517-6012],amsacta


## Read Software Heritage data

In [165]:
swh_ds = pd.read_json(path_or_buf='../datasets/unibo_repositories_swh.json')

In [166]:
def author_getter(creators_raw):
    # creators_raw is already the list we want
    if not isinstance(creators_raw, list):
        creators_raw = []

    creators = []

    for person in creators_raw:
        if not isinstance(person, dict):
            continue
        name = person.get("name", {})

        if name != 'GitHub':
            creators.append(name)    # get name

    return pd.Series({"creators": creators or np.nan})


In [167]:
swh_ds[["creators"]] = swh_ds["authors"].apply(author_getter) # Apply the function

In [168]:
swh_ds = swh_ds[['url', 'creators']]
swh_ds['src_repo'] = 'software heritage'

In [169]:
# Get the title from the url tail
swh_ds['title'] = swh_ds['url'].str.extract(r'.*/(.*?)$')
PATTERN = (
      r'(?<=[a-z0-9])(?=[A-Z])'    # lower-or-digit → Upper
    r'|(?<=[A-Za-z])(?=\d)'        # letter         → digit
    r'|[-_.]'                      # hyphen, underscore **or dot**
)

swh_ds['title'] = (
    swh_ds['title']
      .replace(PATTERN, ' ', regex=True)   # insert / swap for space
      .replace(r'\s+', ' ', regex=True)    # collapse doubles
)
swh_ds.head()

Unnamed: 0,url,creators,src_repo,title
0,https://dei-gitlab.dei.unibo.it/mengozzi/thesi...,[Mattia Mengozzi],software heritage,thesis git
1,https://github.com/CVLAB-Unibo/Learning2AdaptF...,"[Alessio Tonioni, Alessio Tonioni, atonioni]",software heritage,Learning 2 Adapt For Stereo
2,https://github.com/unibo-bigdata/101-hadoop-hd...,"[Enrico Gallinucci, unknown]",software heritage,101 hadoop hdfs Riccardo Salvatori
3,https://github.com/rrnextUsername/it.unibo.esa...,[Mattia Piretti],software heritage,it unibo esame sprint 7 refactoring
4,https://bitbucket.org/shapournemati_unibo/cart...,"[Shapour Nemati, shapournemati_unibo]",software heritage,cartag android git


## Iris dump

#### first file consists of ids and creators

In [170]:
iris1 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_CON_PERSON.csv', dtype={'ITEM_ID':'str'})
iris1['creators'] = iris1['LAST_NAME']+', '+iris1['FIRST_NAME'] # Make the name column from first name and second name

# Groupby and create list of creators and orcid numbers for each record
iris1_agg = iris1.groupby(["ITEM_ID"]).agg({
    'creators' : lambda x: list(x.unique()),
    'ORCID' : lambda  x: list(x.unique()),
}).reset_index()

In [171]:
iris1_agg = iris1_agg.sort_values(['ITEM_ID']).astype('str')

#### second file consists of some records that are not present in the first file

In [172]:
iris2 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_DESCRIPTION.csv', dtype={'ITEM_ID':'str'})
iris2 = iris2[['ITEM_ID', 'DES_ALLPEOPLE']]

In [173]:
rows_only_in_iris2 = iris2[~iris2['ITEM_ID'].isin(iris1_agg['ITEM_ID'])].copy() # check for the records that are not in the first file
rows_only_in_iris2['creators'] = rows_only_in_iris2['DES_ALLPEOPLE'].str.split(';') # make a list of creators from allpeople column to add to main datafram

In [174]:
iris = pd.concat([iris1_agg, rows_only_in_iris2]).astype('str') # concatenate the non-overlapping 

In [175]:
print(rows_only_in_iris2.shape)
print(iris1_agg.shape)
print(iris.shape)

(40400, 3)
(371806, 3)
(412206, 4)


#### Third file consists of identifiers

In [176]:
iris3 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_IDENTIFIER.csv', dtype='str')
iris3.dropna(subset='ITEM_ID', inplace=True)
iris3_id = iris3[['ITEM_ID', 'IDE_DOI', 'IDE_URL']]

In [177]:
iris = pd.merge(left=iris, right=iris3_id, on='ITEM_ID', how='left') # add  doi and url where it exists

#### Fourth consists of titles and the date(year)

In [178]:
iris4 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_MASTER_ALL.csv', dtype='str')
iris4 = iris4[['ITEM_ID','DATE_ISSUED_YEAR', 'TITLE', 'OWNING_COLLECTION_DES']]

In [179]:
iris = pd.merge(left=iris, right=iris4, on='ITEM_ID', how='outer') # merge the fourth iris data set (titles and date)

#### Fifth consists of titles and the date(year)

In [180]:
iris5 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_PUBLISHER.csv', dtype='str')
iris5 = iris5[['ITEM_ID', 'PUB_NAME']]

In [181]:
iris = pd.merge(left=iris, right=iris5, on='ITEM_ID', how='outer')

In [182]:
iris1_agg.rename(columns={'ITEM_ID':'id', 'NAME':'creators', 'ORCID':'orcid'}, inplace='True')

In [183]:
rows_only_in_iris5 = iris5[~iris5['ITEM_ID'].isin(iris['ITEM_ID'])].copy()

In [184]:
rows_only_in_iris5

Unnamed: 0,ITEM_ID,PUB_NAME


#### relation in which issn is used

In [185]:
iris6 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_RELATION.csv', dtype='str')
iris6['REL_ISSN_IN_ERIH_PLUS'] = iris6['REL_ISSN_IN_ERIH_PLUS'].replace({'0':'','1':''})
iris6['issn'] = iris6['REL_ISSN'] + iris6['REL_ISSN_IN_ERIH_PLUS']
iris6 = iris6[['ITEM_ID','issn']]

In [186]:
iris = pd.merge(left=iris, right=iris6, on='ITEM_ID', how='left')

In [187]:
iris.drop(['DES_ALLPEOPLE'], axis=1, inplace=True)
iris = iris.rename(columns={'ITEM_ID':'id','ORCID':'orcid','IDE_DOI':'doi', 'IDE_URL':'url', 'OWNING_COLLECTION_DES':'type','DATE_ISSUED_YEAR':'date', 'TITLE':'title', 'PUB_NAME':'publisher'})
iris['src_repo'] = 'iris'
iris.head()

Unnamed: 0,id,creators,orcid,doi,url,date,title,type,publisher,issn,src_repo
0,1,"['CHIUSOLI, ALESSANDRO']",[nan],,,2004,Il verde nelle aree urbane,2.01 Capitolo / saggio in libro,EDAGRICOLE-EDIZIONI AGRICOLE DE UK IL SOLE 24 ORE,,iris
1,10,"['POGGI, VALENTINA']",[nan],,,2005,SAMUEL RICHARDSON. LA VITA. PROFILO STORICO CR...,2.01 Capitolo / saggio in libro,GARZANTI,,iris
2,100,"['SEBASTIANI, ALBERTO']",['0000-0001-8197-2888'],,,2005,Io mangio la mela? Io mangio la mela!,1.01 Articolo in rivista,,0012-3382,iris
3,1000,"['QUARANTA, MARILISA', 'OTTANI, VITTORIA']",[nan],,,2005,Metalloprotesis activation in spontaneously br...,4.02 Riassunto (Abstract),,0176-8638,iris
4,10000,"['OMICINI, ANDREA']",['0000-0002-6655-3869'],,http://ceur-ws.org/Vol-1382/paper11.pdf,2015,Coordination of Large-Scale Socio-Technical Sy...,4.01 Contributo in Atti di convegno,"Sun SITE Central Europe, RWTH Aachen University",1613-0073,iris


## Mashup

In [188]:
mashup = pd.concat([norm_zen_ds, ams_ds_filt ,swh_ds, iris], join='outer', ignore_index=True)

In [189]:
mashup.head()

Unnamed: 0,title,id,doi,creators,orcid,date,description,resource_type,url,type,rights,publisher,relation,communities,src_repo,issn
0,Il Progetto ACCESs: esperienze di accessibilit...,7956878,10.5281/zenodo.7956878,"[Zanchi, Anna]",[nan],2023-05-22,<p>Tesi di laurea magistrale del corso di Arti...,Thesis,https://doi.org/10.5281/zenodo.7956878,publication,open,Alma Mater Studiorum Università di Bologna,[7956877],[],zenodo,
1,La Chouffe DMP New,6411449,10.5281/zenodo.6411449,"[Chiara Catizone, Giulia Venditti, Davide Brem...","[0000-0003-2445-2426, 0000-0001-7696-7574, 000...",2022-04-04,<p>This DMP has been created fo managing data ...,Output management plan,https://doi.org/10.5281/zenodo.6411449,publication,open,zenodo,[6411448],[argos],zenodo,
2,Footactile rhythmics: protocols and data colle...,5504259,10.5281/zenodo.5504259,"[Dall'Osso, Giorgio]",[0000-0002-4219-7513],2021-09-13,<p>The data shared refer to research investiga...,Dataset,https://doi.org/10.5281/zenodo.5504259,dataset,open,Alma Mater Studiorum - Università di Bologna,[5504258],[],zenodo,
3,La Chouffe DMP,6411382,10.5281/zenodo.6411382,[Chiara Catizone],[nan],2022-04-04,This DMP has been created fo managing data rep...,Output management plan,https://doi.org/10.5281/zenodo.6411382,publication,open,zenodo,[6411381],[argos],zenodo,
4,Addressing the Challenges of Health Data Stand...,15358180,10.5281/zenodo.15358180,"[Marfoglia, Alberto, Arcobelli, Valerio Antoni...","[0009-0000-5857-2376, 0000-0002-1262-9899, 000...",2025-05-07,<p>This table presents the data extraction fro...,Dataset,https://doi.org/10.5281/zenodo.15358180,dataset,open,zenodo,[15358179],[],zenodo,


## output

In [191]:
mashup.to_csv('../mashup_dataset/mashup.csv', index=False)