# Mashup data

## Read and Filter Zenodo 

**Zenodo output data labels**

**title** : title of the record\
**id**: the identifier number assigned to the record in Zendo\
**doi**: the complete doi in the zenodo\
**creators**: a list of creators of the record\
**orcid**: a list of orcid id of the creators of the record\
**date**: the publication date of the record\
**description**: the description in the metadata of the record\
**type**: the type of the record extracted from the title of resource type in metadata\
**broader_type**: the type of the record extracted from the type of the resource type in metadat\
**rights**: the access rights to the record\
**publisher**: the publisher if metioned for the journals or books or university for thesis
, otherwise it is zenodo\
**relation**: the pid of the relation\
**communities**: the communities id mentioned in metadata\
**keywords**: A list of keywords on the record\
**src_repo**: zenodo\
**swh_id**: is the software heritage id

In [52]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

# Read Zenodo dataset
zen_ds = pd.read_json(path_or_buf='../datasets/ZenodoData.json')

In [53]:
# Safe getter: gets from deeper layers
def safe_nested_get(d, *keys, default='zenodo'):
    for key in keys:
        if isinstance(d, dict):
            d = d.get(key, default)
        else:
            return default
    if d:
        return d.strip()
    else:
        return d


def remove_html_tags_bs(text):
    if type(text) == str:
        return BeautifulSoup(text, "html.parser").get_text()
    
def extract_metadata_info(metadata):
    if not isinstance(metadata, dict): # control if the argument is a dictionary
        return pd.Series({
            'creator' : np.nan,
            'orcid': np.nan,
            'date': np.nan,
            'description' : np.nan,
            'resource_type' : np.nan,
            'type' : np.nan,
            'publisher' : 'zenodo',
            'access_right' : np.nan,
            'relation' : np.nan,
            'communities' : np.nan,
            'keywords' : np.nan
        })
    
    # Create a list of creators
    creators = [creator.get('name', np.nan).strip() for creator in metadata.get('creators', [])]
    # Create a list of creators' orcid numbers
    orcids = [safe_nested_get(creator, 'orcid') for creator in metadata.get('creators', [])]
    pub_date = metadata.get('publication_date', np.nan)
    description = remove_html_tags_bs(metadata.get('description', np.nan))# Retrieve description for
    # potential use cases and remove html tags from the text
    
    res_type = metadata.get('resource_type', np.nan).get('title', np.nan).strip().lower() # Get the type by title of the type
    broad_type = metadata.get('resource_type', np.nan).get('type', np.nan).strip().lower() # Get the broader type
    rights = metadata.get('access_right', np.nan).strip()
    publisher = 'zenodo' # Put the default value to zenodo
    communities = [community.get('id', np.nan).strip() for community in metadata.get('communities', [])]
    keywords = metadata.get('keywords', np.nan)

    relation = []
    # Extract pid_value of relations in metadata
    version_list = metadata.get('relations', {}).get('version', [])
    for version in version_list:
        parent = version.get('parent', {})
        pid_val = parent.get('pid_value')
        if pid_val:
            relation.append(pid_val)
            
    if res_type == 'Journal article' or res_type == 'Peer review':
        publisher = safe_nested_get(metadata, 'journal', 'title')

    elif res_type == 'Conference paper' or res_type == 'Presentation': # Get meeting title In case resource type
        publisher = safe_nested_get(metadata, 'meeting', 'title')  #  is conference or presentation 

    elif res_type == 'Book chapter' or res_type == 'Book':
        publisher = safe_nested_get(metadata, 'imprint', 'publisher')
        if pd.isna(publisher):
            publisher = safe_nested_get(metadata, 'thesis', 'place')

    elif res_type == 'Thesis' or 'thesis' in metadata: # In case the type is not thesis but the publishers' 
                                                       # information is in the 'thesis' key
        publisher = safe_nested_get(metadata, 'thesis', 'university')
           
    return pd.Series({
        'name' : creators,
        'orcid': orcids,
        'date': pub_date,
        'description' : description,
        'resource_type' : res_type,
        'type' : broad_type,
        'access_right' : rights,
        'publisher' : publisher,
        'relation' : relation,
        'communities' : communities,
        'keywords' : keywords
    })


In [4]:
# Apply the function
zen_ds[['creators', 'orcid', 'date', 'description', 'resource_type', 'type', 'rights', 'publisher',
        'relation', 'communities', 'keywords']] = zen_ds['metadata'].apply(extract_metadata_info)

zen_ds['swh_id'] = zen_ds['swh'].apply(lambda x: x.get('swhid') if isinstance(x, dict) else None)
zen_ds['swh_id'] = zen_ds['swh_id'].str.extract(r':([^:]+);path.*$')

column_lst = ['title', 'id', 'doi', 'creators', 'orcid', 'date', 'description', 'resource_type', 'doi_url', 'type', 'rights', 
              'publisher', 'relation', 'communities', 'swh_id', 'keywords']

norm_zen_ds = zen_ds[column_lst] # Get the needed columns in dataframe
norm_zen_ds['src_repo'] = 'zenodo' # Add the flag column for source repository "zenodo"
norm_zen_ds.rename(columns={'doi_url':'url'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  norm_zen_ds['src_repo'] = 'zenodo' # Add the flag column for source repository "zenodo"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  norm_zen_ds.rename(columns={'doi_url':'url'}, inplace=True)


There are 276 duplicated rows in the zenodo dataset (by id and doi)

In [58]:
zen_ds2 = norm_zen_ds.drop_duplicates(subset=['id'])

## Read and Filter AMS Acta
**AMS Acta output data lablels**
**title**\
**doi**\
**creators**\
**monograph_type:** This field has values only when the type = monograph\
**type**\ 
**date**\
**uri**\
**publisher**\
**eprintid**\
**abstract**\
**issn**

In [8]:
ams_ds = pd.read_json(path_or_buf='../datasets/amsacta_filtered_affiliation_or_orcid_doubles.json')

In [30]:
ams_ds.columns

Index(['refereed', 'dir', 'uri', 'creators', 'monograph_type', 'publisher',
       'title', 'date', 'conditions_berlin', 'projectidvalid',
       'item_issues_count', 'allow_print', 'userid', 'ispublished',
       'allow_redistribution', 'doi', 'eprintid', 'pages',
       'metadata_visibility', 'date_type', 'status_changed', 'datestamp',
       'rev_number', 'eprint_status', 'allow_save', 'keywords', 'lastmod',
       'type', 'place_of_pub', 'subjects', 'documents', 'series',
       'full_text_status', 'abstract', 'structures', 'event_location',
       'pres_type', 'event_type', 'event_dates', 'event_title', 'publication',
       'succeeds', 'projecttype', 'isbn', 'official_url', 'pagerange',
       'series_number', 'issn', 'number', 'volume', 'jurisdiction',
       'fundingprogramme', 'projectacronym', 'projectname', 'projectid',
       'book_curators', 'book_title', 'funder', 'referencetext', 'relatedid',
       'series_curators', 'curators', 'editors', 'contributors', 'id_number',
 

In [38]:
ams_col = ['title', 'doi', 'creators', 'monograph_type', 'type', 'date', 'uri', 'publisher', 'eprintid', 
          'abstract', 'issn', 'keywords']
ams_ds_filt = ams_ds[ams_col]
ams_ds_filt['date'] = pd.to_datetime(ams_ds_filt['date']).dt.date # Normalize datetime

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt['date'] = pd.to_datetime(ams_ds_filt['date']).dt.date # Normalize datetime


In [39]:
def name_getter(creators_raw):
    # creators_raw is already the list we want
    if not isinstance(creators_raw, list):
        creators_raw = []

    creators, orcids = [], []

    for person in creators_raw:
        if not isinstance(person, dict):
            continuezen_ds2 = norm_zen_ds.drop_duplicates(subset=['id'])
        name     = person.get("name", {})
        family   = name.get("family")
        given    = name.get("given").strip()

        parts = [p for p in (family, given) if p]      # drop None/empty
        if parts:
            creators.append(", ".join(parts[::-1]))    # "Family, Given"

        orcids.append(person.get("orcid", np.nan))

    return pd.Series({"creators": creators or np.nan,
                      "orcid":    orcids    or np.nan})


In [40]:
ams_ds_filt[["creators", "orcid"]] = ams_ds_filt["creators"].apply(name_getter) # Apply the function

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt[["creators", "orcid"]] = ams_ds_filt["creators"].apply(name_getter) # Apply the function
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt[["creators", "orcid"]] = ams_ds_filt["creators"].apply(name_getter) # Apply the function


### AMSACTA column renames
* resource_type is monograph_type in the original dataset
* url is uri in the original dataset
* id is eprintid in the original dataset
* description is abstract in the original dataset

In [59]:
ams_ds_filt.rename(columns={'monograph_type':'resource_type','uri':'url', 'eprintid':'id', 'abstract':'description'}, inplace=True)
ams_ds_filt['src_repo'] = 'amsacta'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt.rename(columns={'monograph_type':'resource_type','uri':'url', 'eprintid':'id', 'abstract':'description'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ams_ds_filt['src_repo'] = 'amsacta'


There are some duplicated records with different ids. We keep them for further investigations.

In [42]:
ams_ds_filt.head()

Unnamed: 0,title,doi,creators,resource_type,type,date,url,publisher,id,description,issn,keywords,orcid,src_repo
0,Introduzione alla Fisica del Terreno,10.6092/unibo/amsacta/2616,"[Giuliano, Vitali]",manual,monograph,2009-09-30,https://amsacta.unibo.it/id/eprint/2616,Asterisco,2616,,,"terreno, suolo, idrologia",[0000-0002-7866-5534],amsacta
1,A Female Musician or Dancer of Iron Age in Sou...,10.6092/unibo/amsacta/2953,"[Angela, Bellia]",,preprint,1970-01-01,https://amsacta.unibo.it/id/eprint/2953,,2953,The excavations conducted by Paola Zancani Mon...,,ladden sistrum;cymbals; raschiatoio; musical o...,[0000-0002-1517-6012],amsacta
2,Gli strumenti musicali nelle immagini della Gr...,10.6092/unibo/amsacta/2955,"[Angela, Bellia]",,preprint,1970-01-01,https://amsacta.unibo.it/id/eprint/2955,,2955,Questo percorso didattico è dedicato agli stru...,,didattica museale musicale; museo virtuale;,[0000-0002-1517-6012],amsacta
3,"Mito, musica e rito nelle raffigurazioni music...",10.6092/unibo/amsacta/2957,"[Angela, Bellia]",,conference_item,2008-07-01,https://amsacta.unibo.it/id/eprint/2957,,2957,I pinakes locresi sono tavolette votive in ter...,,pinakes; Locri; lyra; aulos; rito; tartaruga; ...,[0000-0002-1517-6012],amsacta
4,Le raffigurazioni musicali nella coroplastica ...,10.6092/unibo/amsacta/2958,"[Angela, Bellia]",,conference_item,1970-01-01,https://amsacta.unibo.it/id/eprint/2958,,2958,"Nell’ambito degli studi archeologici, le ricer...",,aulos; tympanon; arpa; triadi; pinakes; pinax;...,[0000-0002-1517-6012],amsacta


## Read Software Heritage data
Software Heritage dataset consists only of these keys:\
**url**\
**creators**\
**dir_id**\
We extracred the title from the url, which actually is not very precise.

In [60]:
swh_ds = pd.read_json(path_or_buf='../datasets/unibo_repositories_swh.json')

In [61]:
def author_getter(creators_raw):
    # creators_raw is already the list we want
    if not isinstance(creators_raw, list):
        creators_raw = []

    creators = []

    for person in creators_raw:
        if not isinstance(person, dict):
            continue
        name = person.get("name", {})

        if name != 'GitHub':
            creators.append(name)    # get name

    return pd.Series({"creators": creators or np.nan})


In [62]:
swh_ds[["creators"]] = swh_ds["authors"].apply(author_getter) # Apply the function
swh_ds = swh_ds[['url', 'creators', 'dir_id']]
swh_ds['type'] = 'software'
swh_ds['src_repo'] = 'software heritage'
# Get the title from the url tail
swh_ds['title'] = swh_ds['url'].str.extract(r'.*/(.*?)$')
PATTERN = (
      r'(?<=[a-z0-9])(?=[A-Z])'    # lower-or-digit → Upper
    r'|(?<=[A-Za-z])(?=\d)'        # letter         → digit
    r'|[-_.]'                      # hyphen, underscore **or dot**
)

swh_ds['title'] = (
    swh_ds['title']
      .replace(PATTERN, ' ', regex=True)   # insert / swap for space
      .replace(r'\s+', ' ', regex=True)    # collapse doubles
)
swh_ds.rename(columns={'dir_id':'swh_id'}, inplace=True)
swh_ds.head()

Unnamed: 0,url,creators,swh_id,type,src_repo,title
0,https://dei-gitlab.dei.unibo.it/mengozzi/thesi...,[Mattia Mengozzi],9f0aedc03bbcf0bb1f449684b52dd7b8aa3f8f92,software,software heritage,thesis git
1,https://github.com/CVLAB-Unibo/Learning2AdaptF...,"[Alessio Tonioni, Alessio Tonioni, atonioni]",60cb80d0fb74814ea3f238ee419eff99451d989b,software,software heritage,Learning 2 Adapt For Stereo
2,https://github.com/unibo-bigdata/101-hadoop-hd...,"[Enrico Gallinucci, unknown]",e686a076dfa12e1496bbfc61617caa7fc19818a2,software,software heritage,101 hadoop hdfs Riccardo Salvatori
3,https://github.com/rrnextUsername/it.unibo.esa...,[Mattia Piretti],72ce2acb790ed9f8ecd0eada9bb9a0c5522f6785,software,software heritage,it unibo esame sprint 7 refactoring
4,https://bitbucket.org/shapournemati_unibo/cart...,"[Shapour Nemati, shapournemati_unibo]",8dfddb6a7358ee960815f4a6855664ca7df47532,software,software heritage,cartag android git


## Iris dump

#### first file consists of ids and creators

In [63]:
iris1 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_CON_PERSON.csv', dtype={'ITEM_ID':'str'})
iris1['creators'] = iris1['LAST_NAME']+', '+iris1['FIRST_NAME'] # Make the name column from first name and second name

# Groupby and create list of creators and orcid numbers for each record
iris1_agg = iris1.groupby(["ITEM_ID"]).agg({
    'creators' : lambda x: list(x.unique()),
    'ORCID' : lambda  x: list(x.unique()),
}).reset_index()
iris1_agg = iris1_agg.sort_values(['ITEM_ID']).astype('str')

#### second file consists of some records that are not present in the first file

In [64]:
iris2 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_DESCRIPTION.csv', dtype={'ITEM_ID':'str'})
iris2 = iris2[['ITEM_ID', 'DES_ALLPEOPLE']]

rows_only_in_iris2 = iris2[~iris2['ITEM_ID'].isin(iris1_agg['ITEM_ID'])].copy() # check for the records that are not in the first file
rows_only_in_iris2['creators'] = rows_only_in_iris2['DES_ALLPEOPLE'].str.split(';') # make a list of creators from allpeople column to add to main datafram
iris = pd.concat([iris1_agg, rows_only_in_iris2]).astype('str') # concatenate the non-overlapping 

#### Third file consists of identifiers (doi, pmid)

In [65]:
iris3 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_IDENTIFIER.csv', dtype='str')
iris3.dropna(subset='ITEM_ID', inplace=True)
iris3_id = iris3[['ITEM_ID', 'IDE_DOI', 'IDE_URL', 'IDE_PMID']]
iris = pd.merge(left=iris, right=iris3_id, on='ITEM_ID', how='left') # add  doi and url where it exists

#### Fourth consists of titles and the date(year)

In [66]:
iris4 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_MASTER_ALL.csv', dtype='str')
iris4 = iris4[['ITEM_ID','DATE_ISSUED_YEAR', 'TITLE', 'OWNING_COLLECTION_DES']]
iris = pd.merge(left=iris, right=iris4, on='ITEM_ID', how='outer') # merge the fourth iris data set (titles and date)

#### Fifth consists of publishers

In [67]:
iris5 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_PUBLISHER.csv', dtype='str')
iris5 = iris5[['ITEM_ID', 'PUB_NAME']]
iris = pd.merge(left=iris, right=iris5, on='ITEM_ID', how='outer')

#### relation in which issn is used

In [68]:
iris6 = pd.read_csv('../datasets/POSTPROCESS-iris-data-2025-05-27/ODS_L1_IR_ITEM_RELATION.csv', dtype='str')
iris6['REL_ISSN_IN_ERIH_PLUS'] = iris6['REL_ISSN_IN_ERIH_PLUS'].replace({'0':'','1':''})
iris6['issn'] = iris6['REL_ISSN'] + iris6['REL_ISSN_IN_ERIH_PLUS']
iris6 = iris6[['ITEM_ID','issn']]
iris = pd.merge(left=iris, right=iris6, on='ITEM_ID', how='left')

In [69]:
iris.drop(['DES_ALLPEOPLE'], axis=1, inplace=True)

iris = iris.rename(columns={'ITEM_ID':'id','ORCID':'orcid','IDE_DOI':'doi', 'IDE_URL':'url', 'OWNING_COLLECTION_DES':'type','DATE_ISSUED_YEAR':'date', 'TITLE':'title', 
                            'IDE_PMID':'pmid','PUB_NAME':'publisher'})
iris['src_repo'] = 'iris'
iris = iris.drop_duplicates(subset='id', keep='first')
iris.head()

Unnamed: 0,id,creators,orcid,doi,url,pmid,date,title,type,publisher,issn,src_repo
0,1,"['CHIUSOLI, ALESSANDRO']",[nan],,,,2004,Il verde nelle aree urbane,2.01 Capitolo / saggio in libro,EDAGRICOLE-EDIZIONI AGRICOLE DE UK IL SOLE 24 ORE,,iris
1,10,"['POGGI, VALENTINA']",[nan],,,,2005,SAMUEL RICHARDSON. LA VITA. PROFILO STORICO CR...,2.01 Capitolo / saggio in libro,GARZANTI,,iris
2,100,"['SEBASTIANI, ALBERTO']",['0000-0001-8197-2888'],,,,2005,Io mangio la mela? Io mangio la mela!,1.01 Articolo in rivista,,0012-3382,iris
3,1000,"['QUARANTA, MARILISA', 'OTTANI, VITTORIA']",[nan],,,,2005,Metalloprotesis activation in spontaneously br...,4.02 Riassunto (Abstract),,0176-8638,iris
4,10000,"['OMICINI, ANDREA']",['0000-0002-6655-3869'],,http://ceur-ws.org/Vol-1382/paper11.pdf,,2015,Coordination of Large-Scale Socio-Technical Sy...,4.01 Contributo in Atti di convegno,"Sun SITE Central Europe, RWTH Aachen University",1613-0073,iris


## Mashup

In [71]:
mashup = pd.concat([norm_zen_ds, ams_ds_filt ,swh_ds, iris], join='outer', ignore_index=True)
mashup.head()

Unnamed: 0,title,id,doi,creators,orcid,date,description,resource_type,url,type,rights,publisher,relation,communities,swh_id,keywords,src_repo,issn,pmid
0,Il Progetto ACCESs: esperienze di accessibilit...,7956878,10.5281/zenodo.7956878,"[Zanchi, Anna]",[zenodo],2023-05-22,Tesi di laurea magistrale del corso di Arti Vi...,thesis,https://doi.org/10.5281/zenodo.7956878,publication,open,Alma Mater Studiorum Università di Bologna,[7956877],[],,"[art, cultural heritage, accessibility, deaf p...",zenodo,,
1,La Chouffe DMP New,6411449,10.5281/zenodo.6411449,"[Chiara Catizone, Giulia Venditti, Davide Brem...","[0000-0003-2445-2426, 0000-0001-7696-7574, 000...",2022-04-04,This DMP has been created fo managing data rep...,output management plan,https://doi.org/10.5281/zenodo.6411449,publication,open,zenodo,[6411448],[argos],,,zenodo,,
2,Footactile rhythmics: protocols and data colle...,5504259,10.5281/zenodo.5504259,"[Dall'Osso, Giorgio]",[0000-0002-4219-7513],2021-09-13,The data shared refer to research investigatin...,dataset,https://doi.org/10.5281/zenodo.5504259,dataset,open,Alma Mater Studiorum - Università di Bologna,[5504258],[],,"[haptic, protocol, design, advanced design, be...",zenodo,,
3,La Chouffe DMP,6411382,10.5281/zenodo.6411382,[Chiara Catizone],[zenodo],2022-04-04,This DMP has been created fo managing data rep...,output management plan,https://doi.org/10.5281/zenodo.6411382,publication,open,zenodo,[6411381],[argos],,,zenodo,,
4,Addressing the Challenges of Health Data Stand...,15358180,10.5281/zenodo.15358180,"[Marfoglia, Alberto, Arcobelli, Valerio Antoni...","[0009-0000-5857-2376, 0000-0002-1262-9899, 000...",2025-05-07,This table presents the data extraction from t...,dataset,https://doi.org/10.5281/zenodo.15358180,dataset,open,zenodo,[15358179],[],,"[Health Data Standard, FHIR, OMOP-CDM, openEHR]",zenodo,,


## output

In [72]:
mashup.to_csv('../mashup_dataset/mashup_v4.csv', index=False)