## Populate Songs

This notebook is for populating the RDF database of Songs in billboard starting from the following __.csv__ files:
- [songs.csv](../../csv/musicoset_metadata/songs.csv)
- [song_chart.csv](../../csv/musicoset_popularity/song_chart.csv)
- [acoustic_features.csv](../../csv/musicoset_songfeatures/acoustic_features.csv)
- [the_grammy_awards_mapped.csv](../../csv/the_grammy_awards_mapped.csv)

In order to match for the entries in the different datasets, we used the __fuzzywuzzy__ package for string matching.

Before running the code you need to install the following packages:
- <code>pip install fuzzywuzzy[speedup]</code>
- <code>pip install pandas</code>
- <code>pip install rdflib</code>


In [11]:
# required libraries
import pandas as pd
import os
from pathlib import Path
from fuzzywuzzy import fuzz, process
import hashlib
import re
from rdflib import Graph, Literal, RDF, URIRef, Namespace
from rdflib.namespace import XSD

### Path of the songs

In [None]:
# parameters and URLs
path = str(Path(os.path.abspath(os.getcwd())).parent.parent.absolute())
songsUrl = path + '/csv/musicoset_metadata/songs.csv'
songInChartUrl = path + '/csv/musicoset_popularity/song_chart.csv'
acousticFeaturesUrl = path + '/csv/musicoset_songfeatures/acoustic_features.csv'
grammyUrl = path + '/csv/the_grammy_awards_mapped_uppercase.csv'

# saving folder
savePath =  path + '/PopulateRDFdb/PopulateSongs/'

# print the paths
print(f'''
    Songs path: {songsUrl}
    Chart path: {songInChartUrl}
    Acoustic features path: {acousticFeaturesUrl}
    Grammy path: {grammyUrl}
      ''')

### Load the songs

In [None]:
# Load the CSV files in memory
songs = pd.read_csv(songsUrl, sep='\t', index_col='song_id')
acousticFeatures = pd.read_csv(acousticFeaturesUrl, sep='\t', index_col='song_id')
songCharts = pd.read_csv(songInChartUrl, sep='\t', index_col='song_id')
grammy = pd.read_csv(grammyUrl, sep=',', keep_default_na=False, na_values=['_'])

songs.info()
acousticFeatures.info()
songCharts.info()
grammy.info()

### Namespaces and binding

In [14]:
# Construct the melody ontology namespace not known by RDFlib
MEL = Namespace("http://www.dei.unipd.it/~gdb/ontology/melody#")

# Create the graph
g = Graph()

# Bind the namespaces to a prefix for more readable output
g.bind("xsd", XSD)
g.bind("mel", MEL)

### Unique id functions

These functions are used to create unique IDs of the GRAMMYs and the charts.

In [15]:
def create_grammy_id(year, category):
    
    year = str(year)
    category = str(category) if pd.notna(category) else ''

    # Pulizia e normalizzazione dei dati
    def clean_text(text):
        # Rimuove caratteri speciali e converte in lowercase
        return re.sub(r'[^\w\s-]', '', text).lower().strip()
    
    # Crea una stringa concatenata con tutti i dati
    full_string = f"{year}_{clean_text(category)}"
    
    # Genera un hash SHA-256 troncato
    hash_object = hashlib.sha256(full_string.encode())
    short_hash = hash_object.hexdigest()[:8]
    
    # Crea l'ID finale
    category_abbr = ''.join(word[0] for word in clean_text(category).split()[:3])
    final_id = f"{year}_{category_abbr}_{short_hash}"
    
    return final_id


def create_chart_id(date):
    return "bil_" + str(date).replace("-", "")


### Songs to remove from the RDF dataset
There are songs that are present in the [songs.csv](../../csv/musicoset_metadata/songs.csv) file but not in the [song_chart.csv](../../csv/musicoset_popularity/song_chart.csv) file. These songs should not be mapped in the RDF dataset because are not of interest to us. We only consider songs that have been at least once in a Billboard Hot 100.

In [None]:
songsIDToRemove = songs[~songs.index.isin(songCharts.index)].index
print(songsIDToRemove.to_list())
songs = songs[~songs.index.isin(songsIDToRemove)]

### Remix matching
Using the titles and artist names, identify whether a remix has an original version in the Billboard Hot 100 song dataset.
The potential original songs are those that include at least one artist featured in the remix.

In [None]:
print("--- remix matching ---")

found = 0 # The number of found associations
searchThreshold = 90
for index, row in songs.iterrows():
    # Consider only the songs that contains the word "remix" in the title
    if "remix" in row["song_name"].lower():
        # Extract the artist/s of the song to get a list of possible original songs
        possibleOriginalSongs = []
        for index2, row2 in songs.iterrows():
            # Skip the song with the same id
            if index2 == index:
                continue
            if any(id in eval(row["artists"]).keys() for id in eval(row2["artists"]).keys()):
                possibleOriginalSongs.append(row2["song_name"])

        # If there are some possible original song try to find the one that matches the title
        if len(possibleOriginalSongs) > 0:
            result = process.extractOne(row['song_name'], songs[songs['song_name'].isin(possibleOriginalSongs)]['song_name'])
            if result[1] >= searchThreshold: # original song found
                print("Adding to the graph: " + row['song_name'] + " --> " + result[0])
                # Add the triple in the graph
                g.add((URIRef(MEL[index]), MEL['isRemixOf'], URIRef(MEL[result[2]])))
                found += 1

print("Number of REMIX/song associations found: ", found)

### Mapping the songs to the GRAMMYs

The mapping is based on the **title of the songs**. Some cleaning is necessary before the actual matching that is performed using the *FuzzyWuzzy* library.

In [None]:
print("--- song to grammy mapping ---")

found = 0 # Count the number of found associations
searchThreshold = 98

# clean the titles of the songs to make these match to the grammy nominee field
words_to_remove = ['version', 'theme', 'single', 'mono'] # list of words that causes noise in the titles
pattern = r'\b(?:' + '|'.join(words_to_remove) + r')\b'
cleanSongsTitlesDf = songs['song_name'].str.lower() # All the titles in lowercase for a better matching
cleanSongsTitlesDf = cleanSongsTitlesDf.str.replace(pattern, '', regex=True).str.strip() # remove noisy words
cleanSongsTitlesDf = cleanSongsTitlesDf.str.replace(r'-.*$', '', regex=True).str.strip() # remove all what comes after an "-"

# Add song-grammy mapping
for grammyIndex, grammyRow in grammy.iterrows():
    
    # skip if the grammy category contains the word album or artist (we want to map just the songs)
    # and avoid all the categories without a nominee (it means it is a grammy for an artist)
    if all(keyword not in grammyRow['category'].lower() for keyword in ["album", "artist"]) and grammyRow['workers']:
        
        # clean the nominee field for a better match
        nominee = grammyRow['nominee'].lower() # nominee string to lowercase
        nominee = re.sub(pattern, '', nominee) # Remove noisy words
        nominee = nominee.split('.')[0] # Clear what comes after the "."

        # Try to find the associated song in the song dataset
        result = process.extractOne(nominee, cleanSongsTitlesDf, scorer=fuzz.ratio)
        
        # If it is a match according to the search threshold
        if result[1] >= searchThreshold:
            found += 1
            # Add the triple to the graph
            Grammy = URIRef(MEL[create_grammy_id(
                grammyRow['year'], 
                grammyRow['category'])])
            
            if grammyRow['winner']:
                g.add((URIRef(MEL[result[2]]), MEL['winner'], Grammy))
            else:
                g.add((URIRef(MEL[result[2]]), MEL['candidated'], Grammy))
            
            print("Adding to the graph: " + result[0] + " --> " + nominee + " " + str(result[1]))
            
print("Found associations between grammy and songs: ", found)

### Song feature mapping
The feature of the songs are added to the graph.

In [None]:
print("--- song features mapping ---")

# keyMap array used to parse the key of the song
keyMap = ["C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"]

for index, row in acousticFeatures.iterrows():
    # Check if the index is present in the songs pandas dataframe.
    if (songs.index == index).any():
        Song = URIRef(MEL[index])
        songMetadata = songs[songs.index == index]
        g.add((Song, RDF.type, MEL.Song))
        g.add((Song, MEL['acousticness'], Literal(row['acousticness'], datatype=XSD.float)))
        
        bpm = round(row['tempo'])
        if bpm > 0:
            g.add((Song, MEL['bpm'], Literal(bpm, datatype=XSD.positiveInteger)))
        
        g.add((Song, MEL['danceability'], Literal(row['danceability'], datatype=XSD.float)))
        g.add((Song, MEL['valence'], Literal(row['valence'], datatype=XSD.float)))
        
        duration = int(row['duration_ms'])
        if duration > 0:
            g.add((Song, MEL['duration'], Literal(duration, datatype=XSD.positiveInteger)))
        
        # Add key and mode if present
        if int(row['key']) != -1 and row['mode'] in {0, 1}:
            g.add((Song, MEL['key'], Literal((keyMap[int(row['key'])] + ("m" if row['mode'] == 0 else "")), lang="en")))
        
        # Add song name
        songName = str(songMetadata['song_name'].iloc[0])
        g.add((Song, MEL['name'], Literal(songName, datatype=XSD.string)))

        # Add song type and check if it is compliant
        if(songMetadata['song_type'].iloc[0] in {"Collaboration", "Solo"}):
            g.add((Song, MEL['songType'], Literal(songMetadata['song_type'].iloc[0], lang="en")))


### Song to artist mapping

Each song is linked to one or more artists.

In [None]:
print("--- song to artist mapping ---")

for index, row in songs.iterrows():
    Song = URIRef(MEL[index])
    # Add artist/s
    for artist_id in eval(row["artists"]).keys():
        g.add((Song, MEL['sungBy'], URIRef(MEL[artist_id])))

### Song to Billboard mapping
The songs in the [songs.csv](../../csv/musicoset_metadata/songs.csv) file are taken form the Billboard Hot 100 from the last 56 years, therefore we mapped each song to its Billboard.

In [None]:
print("--- song to billboard mapping ---")

for index, row in songCharts.iterrows():
    Song = URIRef(MEL[index])
    Membership = URIRef(MEL["m_" + index + "_" + str(row['rank_score'])])
    Billboard = URIRef(MEL[create_chart_id(row['week'])])

    g.add((Membership, RDF.type, MEL.Membership))
    g.add((Billboard, RDF.type, MEL.BillboardHot100))
    g.add((Song, MEL['classified'], Membership))
    g.add((Membership, MEL['position'], Literal(row['rank_score'], datatype=XSD.positiveInteger)))
    g.add((Membership, MEL['classifiedIn'], Billboard))
    g.add((Billboard, MEL['week'], Literal(row['week'], datatype=XSD.date)))


### Save the turtle serialization

In [None]:
print("--- saving serialization ---")

with open(savePath + 'songs.ttl', 'w', encoding="utf-8") as file:
    file.write(g.serialize(format='turtle'))