## Populate an RDF database

This notebook reports the main steps to download CSV files, process them and create an RDF dataset from them accordingly to an ontology. 

To measure execution time in Jupyter notebooks: <code>pip install ipython-autotime</code>

In [167]:
# required libraries
import pandas as pd
import os
import ast
import unicodedata
import hashlib
import re
from pathlib import Path
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [168]:
# parameters and URLs
path = str(Path(os.path.abspath(os.getcwd())).parent.parent.absolute())
print(path)
grammyUrl = path + '/csv/the_grammy_awards_mapped_uppercase.csv'
print(grammyUrl)
albumsUrl = path + '/csv/musicoset_metadata/albums.csv'
print(albumsUrl)
songsUrl = path + '/csv/musicoset_metadata/albums.csv'
print(songsUrl)
artistsUrl = path + '/csv/musicoset_metadata/albums.csv'
print(artistsUrl)
tracksUrl = path + '/csv/musicoset_metadata/tracks.csv'
print(tracksUrl)
songInChartUrl = path + '/csv/musicoset_popularity/song_chart.csv'
print(songInChartUrl)

# saving folder
savePath =  path + '/PopulateRDFdb/PopulateGrammyCategories/'

c:\Users\fgall\Desktop\MELODY
c:\Users\fgall\Desktop\MELODY/csv/the_grammy_awards_mapped_uppercase.csv
c:\Users\fgall\Desktop\MELODY/csv/musicoset_metadata/albums.csv
c:\Users\fgall\Desktop\MELODY/csv/musicoset_metadata/albums.csv
c:\Users\fgall\Desktop\MELODY/csv/musicoset_metadata/albums.csv
c:\Users\fgall\Desktop\MELODY/csv/musicoset_metadata/tracks.csv
c:\Users\fgall\Desktop\MELODY/csv/musicoset_popularity/song_chart.csv


## Grammy & candidates/winners (Songs, Artists, Albums)

In [169]:
# Load the CSV files in memory
# we need to convert NaN values to something else otherwise NA strings are converted to NaN -> problem with Namibia
grammy = pd.read_csv(grammyUrl, sep=',', keep_default_na=False, na_values=['_'])

In [None]:
album = pd.read_csv(albumsUrl, sep='\t', index_col='album_id', keep_default_na=False, na_values=['_'])
album.info()
# Lista per salvare le coppie GrammyID, album e isWinner.
# Contiene quindi il grammy specifico e il corrispondente album vincitore/candidato
matched_pairs_grammy_album = []

<class 'pandas.core.frame.DataFrame'>
Index: 26519 entries, 5n1GSzC1Reao29ScnpLYqp to 6wf7Rh10EoTaqZMdN2xRlI
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          26519 non-null  object
 1   billboard     26519 non-null  object
 2   artists       26519 non-null  object
 3   popularity    26519 non-null  int64 
 4   total_tracks  26519 non-null  int64 
 5   album_type    26519 non-null  object
 6   image_url     26519 non-null  object
dtypes: int64(2), object(5)
memory usage: 1.6+ MB


In [171]:
grammy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6323 entries, 0 to 6322
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   year      6323 non-null   int64 
 1   category  6323 non-null   object
 2   nominee   6323 non-null   object
 3   workers   6323 non-null   object
 4   winner    6323 non-null   bool  
dtypes: bool(1), int64(1), object(3)
memory usage: 203.9+ KB


We need to install <code>RDFLib</code>

<code>pip3 install rdflib </code> [Documentation](https://rdflib.readthedocs.io/en/stable/gettingstarted.html)

In [172]:
# Load the required libraries
from rdflib import Graph, Literal, RDF, URIRef, Namespace
# rdflib knows about some namespaces, like FOAF
from rdflib.namespace import FOAF, XSD, SKOS, RDFS


In [174]:
# Construct the country and the movie ontology namespaces not known by RDFlib
ME = Namespace("http://www.dei.unipd.it/~gdb/ontology/melody#")

#create the graph
g = Graph()

# Bind the namespaces to a prefix for more readable output
g.bind("xsd", XSD)
g.bind("mel", ME)
g.bind("skos", SKOS)
g.bind("rdfs", RDFS)

In [None]:
# ID univoco nel formato YYYY_CAT_HASH
def create_grammy_id(year, category, title, artist, is_winner):
    
    year = str(year)
    category = str(category) if pd.notna(category) else ''
    title = str(title) if pd.notna(title) else ''
    artist = str(artist) if pd.notna(artist) else ''
    is_winner = str(is_winner) if pd.notna(is_winner) else ''

    # Pulizia e normalizzazione dei dati
    def clean_text(text):
        # Rimuove caratteri speciali e converte in lowercase
        return re.sub(r'[^\w\s-]', '', text).lower().strip()
    
    # Crea una stringa concatenata con tutti i dati
    full_string = f"{year}_{clean_text(category)}_{clean_text(title)}_{clean_text(artist)}_{is_winner}"
    
    # Genera un hash SHA-256 troncato
    hash_object = hashlib.sha256(full_string.encode())
    short_hash = hash_object.hexdigest()[:8]
    
    # Crea l'ID finale
    category_abbr = ''.join(word[0] for word in clean_text(category).split()[:3])
    final_id = f"{year}_{category_abbr}_{short_hash}"
    
    return final_id


def normalize_uri(name):
    # Rimuove accenti e caratteri speciali
    name = unicodedata.normalize('NFKD', name).encode('ASCII', 'ignore').decode('ASCII')
    name = name.replace(" ", "-")
    name = name.replace(",", "").replace("'", "")
    name = re.sub(r'&(\w)', lambda match: "&" + match.group(1).upper(), name)
    name = name.replace("&", "n")
    name = re.sub(r'/(\w)', lambda match: "/" + match.group(1).upper(), name)
    name = name.replace("/", "")

    return name

# Funzione per estrarre gli artisti
def extract_artist_name_fast(artists_column):
    return artists_column.str.extract(r"'[^']*': '([^']*)'")[0]  # Estrai direttamente il nome

# Converte stringa in Bool
def string_to_bool(s):
    if isinstance(s, bool):
        return s
    return s.lower() == "true"

## Aggiunge gli individui Grammy - Genera lista contenente gli album che hanno vinto un grammy

In [None]:
%%time 
#measure execution time


# Dictionary to track Grammy IDs for unique year-category combinations
grammy_id_lookup = {}

#iterate over the grammy dataframe
for index, row in grammy.iterrows():

    # Create a unique key for year and category
    year_category_key = (row['year'], normalize_uri(row['category']))

    # Check if this year-category combination has been seen before
    if year_category_key in grammy_id_lookup:
        # Use the previously created Grammy ID
        grammy_id = grammy_id_lookup[year_category_key]
    else:
        # Create a new Grammy ID
        grammy_id = create_grammy_id(
            row['year'],       
            row['category'],
            row['nominee'],
            row['workers'],
            row['winner']
        )
        
        # Store this Grammy ID for the year-category combination
        grammy_id_lookup[year_category_key] = grammy_id


    # Create the node to add to the Graph
    # the node has the namespace + the grammy_id as URI
    current_grammy = URIRef(ME[grammy_id])

    if False:
        print(row)
        print(grammy_id) #1959_a_a98ae627
        print(ME.Grammy) # http://www.dei.unipd.it/~gdb/ontology/melody/Grammy
        print(current_grammy) # http://www.dei.unipd.it/~gdb/ontology/melody/1959_a_a98ae627

    # Aggiungi Grammy, lo colleghi alla corrispondente categoria SKOS e alla dataPropery Year
    g.add((current_grammy, RDF.type, ME.Grammy))
    g.add((current_grammy, ME['hasCategory'], Literal(normalize_uri(row['category']))))
    g.add((current_grammy, ME['year'], Literal(row['year'], datatype=XSD.gYear)))

    # Se il grammy ha "album" nel titolo della categoria
    if "album" in row['category'].lower():

        # Estrae da album['artists'] i nomi degli artisti
        # Converte la stringa in dizionario e prendere il valore
        album['artist_name'] = album['artists'].apply(lambda x: list(eval(x).values())[0]) 
         
        isWinner = string_to_bool(row['winner'])

        # Pulire row['workers'] e ottenere una lista di nomi
        worker_names = re.sub(r'\([^)]*\)', '', row['workers'])  # Rimuove il contenuto tra parentesi
        worker_names = [name.strip() for name in worker_names.split(',')]  # Divide per virgola e rimuove gli spazi

        # Filtra gli album che corrispondono per titolo e (almeno un) artista
        # OVVERO, matched_album contiene solo gli album vincitori/nominati per un grammy, risolviamo i casi di omonimia filtrando per nome artista sia in grammy che artists
        matched_album = album[
            (album['billboard'].str.lower() == row['nominee'].lower().rstrip('.,').strip()) & 
            (album['artist_name'].str.lower().isin([name.lower() for name in worker_names]))
        ]

        if not matched_album.empty:
            #print(matched_album)
            for album_row in matched_album.itertuples(index=True):
                print(grammy_id, album_row.Index, isWinner)
                matched_pairs_grammy_album.append((grammy_id, album_row, isWinner))


 




1962_a_1ec091a2 3M0RdVAdvPETjcEvPj7AI0 True
1962_b_95f0c0db 3M0RdVAdvPETjcEvPj7AI0 True
1965_b_c7f33116 381STOo2xuzaUPS3GaGiwi True
1966_b_a62bd200 0eBIK8q43cUJ4SiYCoDVNu True
1966_b_7684d930 4wXA2W3Ody5SO5wKp2Y4Rk True
1966_b_884d0753 27SMV8TOEzD5NvCI6dK7Xc True
1967_a_b42e3d67 6lmuhQkwg4qkjytuho7Oxz True
1967_b_5e6d2afb 41q01T7MnMnWy3a3Swa63g True
1968_a_be05f7c6 6QaVfG1pHYl1z15ZxkvVDW True
1968_b_e946962d 5Ju2gORF8m4VA9TZn006kv True
1968_b_f0dbae4f 6QaVfG1pHYl1z15ZxkvVDW True
1968_b_8b1065b9 7b4LQX82Gnt2kZ9I5MUZil True
1969_a_d4807848 29tTA46kurlOioRkjBqOMS True
1969_b_f1b3418a 4TJIdlY9hGSSTO1kUs1neh True
1969_b_fa27b73f 056tSaBR2WyN1nnmfIzkEE True
1969_b_95d79f96 4ALVyY1OJCf30LCBwhkzOd True
1970_b_4263c8ff 5WBx64FIN04CvM2T1MGrUN True
1970_b_d39d4a4d 0ETFjACtuP2ADo6LFhL6HN True
1970_b_b8b724fa 03iFLgmgkLT7X5gnXVPID5 True
1970_b_430227f0 0o2ZKR3DbPg23bt11WiWhS True
1971_a_25293063 0JwHz5SSvpYWuuCNbtYZoV True
1971_b_300869e8 0JwHz5SSvpYWuuCNbtYZoV True
1971_b_1716b439 1wVO8nHzgcim0IBz

In [177]:
%%time
# print all the data in the Turtle format
print("--- saving serialization ---")
with open(savePath + 'grammy.ttl', 'w', encoding='utf-8') as file:
    file.write(g.serialize(format='turtle'))


--- saving serialization ---
CPU times: total: 172 ms
Wall time: 307 ms


# Referential integrity
Note that in RDF we are in an open world situation. We cannot guarantee the referential integrity between the entities. 

## Album

Let us generate the RDF data relative to the movie workers.

In [178]:
albums = pd.read_csv(albumsUrl, sep='\t', index_col='album_id', keep_default_na=False, na_values=['_'])
albums.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26519 entries, 5n1GSzC1Reao29ScnpLYqp to 6wf7Rh10EoTaqZMdN2xRlI
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          26519 non-null  object
 1   billboard     26519 non-null  object
 2   artists       26519 non-null  object
 3   popularity    26519 non-null  int64 
 4   total_tracks  26519 non-null  int64 
 5   album_type    26519 non-null  object
 6   image_url     26519 non-null  object
dtypes: int64(2), object(5)
memory usage: 1.6+ MB


In [179]:
tracks = pd.read_csv(tracksUrl, sep='\t', index_col='album_id', keep_default_na=False, na_values=['_'])
tracks.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20405 entries, 2fYhqwDWXjbpjaIJPEfKFw to 3pBArpt3QcnvVj58hl6Ghe
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   song_id                 20405 non-null  object
 1   track_number            20405 non-null  int64 
 2   release_date            20405 non-null  object
 3   release_date_precision  20405 non-null  object
dtypes: int64(1), object(3)
memory usage: 797.1+ KB


People are modeled with the FOAF ontology. 
Refer to [FOAF Documentation](http://xmlns.com/foaf/spec/)

In [180]:
songInChart = pd.read_csv(songInChartUrl, sep='\t', index_col='song_id', keep_default_na=False, na_values=['_'])
songInChart.info()
songInChart.index

<class 'pandas.core.frame.DataFrame'>
Index: 250392 entries, 3e9HZxeyfWwjeyPAMmWSSQ to 1TRvdHDqCIcTQpHTZbFttC
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   rank_score      250392 non-null  int64 
 1   peak_position   250392 non-null  int64 
 2   weeks_on_chart  250392 non-null  int64 
 3   week            250392 non-null  object
dtypes: int64(3), object(1)
memory usage: 9.6+ MB


Index(['3e9HZxeyfWwjeyPAMmWSSQ', '5p7ujcrUXASCNwRaWNHR1C',
       '2xLMifQCjDGFmkHkpNLD9h', '3KkXRkHbMCARz0aVfEt68P',
       '1rqqCSm0Qe4I9rUvWncaom', '0bYg9bo50gSsH3LtXe2SQn',
       '5hslUAKq9I9CG2bAulFkHN', '2EjXfH91m7f8HiJN1yQg97',
       '0MMSmg7zyo6pOKZrfHUOqu', '6Z924AupOiJLdnAKH6UgCu',
       ...
       '1959PESdlKUTrT3Tx8wXkq', '0eAvTDkEwjRMdD8gdMqB18',
       '2DUJXCCBYYdGO73SjzMH1I', '0gkxAsG2Lng71KBT31TZNd',
       '73m1WoUPFEtJKo0WihM5dW', '6vPS75nWOKkuH5WTLD8hDc',
       '2BCjoFA0nJlXBLRf3164bN', '7Gk1QKi2BAZCnrYlrYEDjC',
       '7rxev6ErpRE5VYaamjs4T3', '1TRvdHDqCIcTQpHTZbFttC'],
      dtype='object', name='song_id', length=250392)

In [181]:
#create a new graph
g = Graph()
ME = Namespace("http://www.dei.unipd.it/~gdb/ontology/melody#")
g.bind("xsd", XSD)
g.bind("mel", ME)
g.bind("skos", SKOS)
g.bind("rdfs", RDFS)

# Collega canzone ad Album

In [None]:
%%time 
#measure execution time

#iterate over the album dataframe
for index, row in albums.iterrows():
    # Create the node to add to the Graph

    if False:
        print(row)
        print(index) #5n1GSzC1Reao29ScnpLYqp
        print(ME.Album) # http://www.dei.unipd.it/~gdb/ontology/melody/Album
        print(current_album) # http://www.dei.unipd.it/~gdb/ontology/melody/5n1GSzC1Reao29ScnpLYqp

    # righe ottenute da tracks che contengono tutte le canzoni dell current album (row)
    album_tracks = tracks[tracks.index == index]
    #print(album_tracks['song_id'])
    #print(songInChart.index)
    #break
    
    if not album_tracks.empty:
            # Iterate through all tracks for this album
            for _, track in album_tracks.iterrows():
                # Se la canzone dell'album è tra le canzoni della top 100
                match = songInChart.index.str.contains(track['song_id']).any()
                if match:
                    #print(track['song_id'])
                    song_id = track['song_id']
                    current_song = URIRef(ME[song_id])
                    current_album = URIRef(ME[index])
                    g.add((current_album, RDF.type, ME.Album))
                    g.add((current_album, ME['name'], Literal(row['name'], datatype=XSD.string)))
                    g.add((current_album, ME['containsSong'], current_song))
                    if row['total_tracks'] is not None and isinstance(row['total_tracks'], int):
                        g.add((current_album, ME['totalTracks'], Literal(row['total_tracks'], datatype=XSD.positiveInteger)))
                    else:
                        g.add((current_album, ME['totalTracks'], Literal(0, datatype=XSD.positiveInteger)))
                    print(f"\nAlbum {index} contains song {song_id}\nSo we add Album {row['name']} ")

                    # Aggiungi link ad artista
                    name_dict = eval(row.artists)
                    for artist_id in name_dict.keys():
                        current_artist = URIRef(ME[artist_id])
                        g.add((current_album, ME['hasMainArtist'], current_artist))
                        print(f"Artist ID: {artist_id}")

                    # Aggiungi dicitura winner/candidated agli album vincitori di grammy
                    for grammy_id, album_row, winner in matched_pairs_grammy_album:
                        if album_row.Index == index:
                            album_id = album_row.Index
                            current_album = URIRef(ME[album_id])
                            current_grammy = URIRef(ME[grammy_id])


                            if (winner):
                                g.add((current_album, ME['winner'], current_grammy))
                                print(f"{album_id} won {grammy_id}")
                            else:
                                g.add((current_album, ME['candidated'], current_grammy))
                                print(f"{album_id} lost {grammy_id}")

                else:
                    print(f"\nAlbum {index} does not contain songs fromn Top 100")
    else:
        print(f"\nNo tracks found for album {index}")

 





    
    
    


Album 5n1GSzC1Reao29ScnpLYqp contains song 2MShy1GSSgbmGUxADNIao5
So we add Album Dying To Live 
Artist ID: 46SHBwWsqBkxI7EeeBEQG7

Album 5n1GSzC1Reao29ScnpLYqp contains song 6Gg25EZRbQd4IHiJz2KSY0
So we add Album Dying To Live 
Artist ID: 46SHBwWsqBkxI7EeeBEQG7

Album 5n1GSzC1Reao29ScnpLYqp contains song 1XJBDDpeuZaj5fvmwhdIMw
So we add Album Dying To Live 
Artist ID: 46SHBwWsqBkxI7EeeBEQG7

Album 5n1GSzC1Reao29ScnpLYqp contains song 7HdNB8nvJOBwHa8hIkzvxp
So we add Album Dying To Live 
Artist ID: 46SHBwWsqBkxI7EeeBEQG7

Album 6UYZEYjpN1DYRW0kqFy9ZE contains song 3EQ9QP2E7wjYQba8OSPBst
So we add Album Championships 
Artist ID: 20sxb77xiYeusSH8cVdatc

Album 6UYZEYjpN1DYRW0kqFy9ZE contains song 6pUBrvKfJL4qmn95mxU7rC
So we add Album Championships 
Artist ID: 20sxb77xiYeusSH8cVdatc

Album 6UYZEYjpN1DYRW0kqFy9ZE contains song 45CCe4gu08OYG1I4MH8TU6
So we add Album Championships 
Artist ID: 20sxb77xiYeusSH8cVdatc

Album 6UYZEYjpN1DYRW0kqFy9ZE contains song 1bx4Jw8A7GQtImIQGccI6D
So we add

In [183]:
%%time
# print all the data in the Turtle format
print("--- saving serialization ---")
with open(savePath + 'albums.ttl', 'w', encoding='utf-8') as file:
    file.write(g.serialize(format='turtle'))

--- saving serialization ---
CPU times: total: 859 ms
Wall time: 1.11 s
