
## Dataset Description

Our analysis is based on the **Spotify Million Playlist Dataset (MPD)**, introduced as part of the RecSys Challenge 2018 by Spotify Research [1]. The dataset contains **1,000,000 user-generated playlists**, sampled from over 4 billion public playlists created on the Spotify platform between January 2010 and November 2017. Each playlist includes metadata such as playlist title, number of tracks, number of albums, and duration, as well as detailed track-level information including track name, artist name, album name, and track duration. In total, the MPD comprises **over 2 million unique tracks by nearly 300,000 artists**. Playlists were sampled with randomization and manually filtered to ensure quality and remove offensive content [2].

In our report we have decided to use the first 1000 playlists as the foundation for our network. 

To enrich the network with semantic information, we scraped lyrics for the tracks associated with each artist from the Genius website[3] using a Genius API[4]. For each artist, we collected the full text of their songs, including verses, choruses, and bridges, and stored them in plain text files. These files were aggregated at the artist level, resulting in a single combined lyric corpus per artist.

**References**  
[1] Spotify Research. *The Million Playlist Dataset Challenge*. RecSys Challenge 2018. Available at: [https://www.aicrowd.com/challengesspotify-million-playlist-dataset-challenge  
[2] McFee, B., et al. (2018). *The Million Playlist Dataset Challenge*. Proceedings of the ACM RecSys Challenge 2018.  
[3] https://genius.com/  
[4] https://lyricsgenius.readthedocs.io/en/master/reference/genius.html  

## Loading the playlist data into dictionaries

In [3]:
import networkx as nx
from collections import defaultdict
from itertools import combinations
from pathlib import Path
import lyricsgenius
import os
import re
import time
import json

# number of playlist to process out of the slice files
NUM_PLAYLISTS = 1000

# Path to your original slice file
folder_path = Path(r"/Users/noa/Desktop/02805 - Social Graphs/playlist_data/")

# folder with artist/lyrics files
artist_folder = Path(r"/Users/noa/Desktop/02805 - Social Graphs/artist_lyrics_cleaned")

# Load all mpd slice JSON files in the folder and merge their playlists
file_list = sorted(folder_path.glob("mpd.slice.*.json"))
playlists = []
for fp in file_list:
    with open(fp, 'r', encoding='utf-8') as f:
        data = json.load(f)
        playlists.extend(data.get("playlists", []))

print(f"Loaded {len(file_list)} files, total playlists merged: {len(playlists[:])}")

Loaded 1 files, total playlists merged: 1000


We have decided to filter out some of the playlists using the following criteria:
- A playlist should contain between 20 and 100 songs
  - This is done such that we don't have playlists without themes, as we expect playlists of length > 100 to be more randomly sampled and less curated.
  - Playlists of less than 20 songs might not have enough songs for it to have thematic meaning.
- Each playlist should contain at least 6 different artists. 
  - this is done to ensure variety in the playlists.

In [4]:
# Accumulators
artist_songs = defaultdict(set)
artist_playlists = defaultdict(set)
artist_albums = defaultdict(set)
artist_durations = defaultdict(list)
edge_playlists = defaultdict(set)

def normalize_artist_name(name):
    if not name:
        return None
    return name.replace(' ', '_').strip()

included_playlists = 0

for pl in playlists[:NUM_PLAYLISTS]:
    pid = pl.get("pid")
    tracks = pl.get("tracks", [])
    # filter playlists by track count and unique artist count
    if not (20 <= len(tracks) <= 100):
        continue

    # build normalized set of unique artists for this playlist
    unique_artists = {normalize_artist_name(t["artist_name"]) for t in tracks if t.get("artist_name")}
    unique_artists = {a for a in unique_artists if a}  # drop Nones/empty
    if len(unique_artists) < 6:
        continue

    included_playlists += 1

    # collect songs, albums, durations, playlist membership per artist
    for t in tracks:
        raw_artist = t.get("artist_name")
        artist = normalize_artist_name(raw_artist)
        if not artist:
            continue
        track_name = t.get("track_name")
        album_name = t.get("album_name")
        duration = t.get("duration_ms")

        if track_name:
            artist_songs[artist].add(track_name)
        if album_name:
            artist_albums[artist].add(album_name)
        if duration:
            artist_durations[artist].append(duration)
        artist_playlists[artist].add(pid)

    # increment edge counters for every pair of (normalized) artists in this playlist
    for a, b in combinations(sorted(unique_artists), 2):
        edge_playlists[(a, b)].add(pid)

print(f"Included playlists: {included_playlists}")


Included playlists: 583


The assumptions have filtered out approximately half of the playlists. It is important to remark that the lyrics that is added to all artists as an attribute, is the scraping from all 1000 playlists text from Genius.

## Scrape the lyrics

OBS! This code was run on the HPC and all concatenated lyrics were saved in txt files.

We have had these considerations
- Avoid duplicates: Only fetch lyrics for unique songs per artist.
- Rate limits: Genius API has limits, so we have needed to add delays.

In [None]:
import json
from pathlib import Path
import lyricsgenius
import os
import re
import time
from collections import defaultdict

# Folder to save lyrics
lyrics_folder = Path("artist_lyrics")
lyrics_folder.mkdir(exist_ok=True)

# Setup Genius API
GENIUS_ACCESS_TOKEN = "IKoqZN1ANyU_2G6zmTPF2xlaH2OlIEEUlDoD97Mo9-P_A6-2QgnSoQlwsJ3Hy3DY"  # <--- paste your token

# Initialize Genius client
genius = lyricsgenius.Genius(
    GENIUS_ACCESS_TOKEN,
    remove_section_headers=True,   # cleans up [Verse], etc.
    timeout=15,
    retries=3
)

artist_lyrics = defaultdict(str)

for artist, songs in artist_songs.items():
    for track_name in songs:
        try:
            song = genius.search_song(track_name, artist)
            if song and song.lyrics:
                artist_lyrics[artist] += "\n" + song.lyrics
        except Exception as e:
            print(f"Error retrieving {track_name} by {artist}: {e}")
        time.sleep(1)  # Avoid hitting rate limits

# Save lyrics to files
for artist, lyrics in artist_lyrics.items():
    safe_name = re.sub(r'[^\w\s-]', '', artist).strip().replace(' ', '_')
    file_path = lyrics_folder / f"{safe_name}.txt"
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(lyrics)

## Clean and load lyrics for all artists

In [None]:
import re
from pathlib import Path
from typing import Iterable, Union
import argparse

def remove_metadata_blocks_from_text(text: str) -> str:
    """
    Remove metadata blocks from a text.
    Returns the cleaned text.
    """
    lines = text.splitlines(keepends=True)
    out_lines = []
    i = 0

    # match any line that starts with optional spaces, then digits, then 'Contributors'
    pattern = re.compile(r'^\s*\d+\s+Contributors')  # matches start of metadata block

    #match any contiguous non-space "word" that ends with 'Embed'
    embed_pat = re.compile(r'\b\S+Embed\b', re.IGNORECASE)

    while i < len(lines):
        # remove tokens like '87Embed' but NOT standalone 'Embed'
        line = embed_pat.sub('', lines[i])

        if pattern.match(line):
            # skip this metadata line and subsequent non-blank lines
            i += 1
            while i < len(lines) and lines[i].strip() != "":
                i += 1
            # if there's a blank line, preserve it (to keep stanza breaks)
            if i < len(lines) and lines[i].strip() == "":
                out_lines.append(lines[i])
                i += 1
        else:
            out_lines.append(line)
            i += 1

    return "".join(out_lines)


def process_file(path: Union[str, Path], inplace: bool = True, backup: bool = True, encoding: str = "utf-8") -> str:
    """
    Process a single file. If inplace is True, overwrite the file (optionally making a .bak backup).
    Returns the cleaned text.
    """
    p = Path(path)
    text = p.read_text(encoding=encoding)
    cleaned = remove_metadata_blocks_from_text(text)

    if inplace:
        if backup:
            bak = p.with_suffix(p.suffix + ".bak")
            bak.write_text(text, encoding=encoding)
        p.write_text(cleaned, encoding=encoding)

    return cleaned


def process_paths(paths: Iterable[Union[str, Path]], **kwargs) -> None:
    """
    Process multiple files or directories. If a directory is supplied, all files
    inside (non-recursive) will be processed. kwargs are passed to process_file.
    """
    for p in paths:
        p = Path(p)
        if p.is_dir():
            for child in p.iterdir():
                if child.is_file():
                    process_file(child, **kwargs)
        elif p.is_file():
            process_file(p, **kwargs)

In [None]:
# Clean all files in a folder
PATH = '/Users/noa/Desktop/02805 - Social Graphs/artist_lyrics_cleaned' # TODO
process_paths([PATH], inplace=True, backup=False, encoding="utf-8")

Now we create the dictionary of the lyrics for each artist

In [5]:
lyrics_dict = {}

# Load lyrics for artists
for txt_file in artist_folder.glob("*.txt"):
    artist_name = txt_file.stem  # filename without extension
    with open(txt_file, 'r', encoding='utf-8') as f:
        lyrics_dict[artist_name.lower()] = f.read()

## Creating the undirected co-occurence graph
From the MPD, we construct an **artist co-occurrence network** where nodes represent artists and edges represent co-occurrence in playlists. An edge between two artists indicates that they appear together in at least one playlist. 

**Nodes** represent artists.
Node attributes:
- songs: set of track names
- playlists: set of playlist IDs
- num_playlists: count of playlists
- num_songs: count of songs
- avg_song_duration: average track duration
- albums: set of album names
- lyrics: if available (for artists like 2Pac)

**Edges** represent co-occurrence in playlists.
Edge attributes:
- shared_playlists: set of playlist IDs
- weight: number of shared playlists
- co_occurrence_count: number of times they appear together

We are not keeping nodes where:
- the Genius API was not able to find the lyrics of the artist
- Artist that have less than four songs represented in the 1000 playlists

Lastly we only keep the largest connected component as the main graph.

In [6]:
# Build the graph
G = nx.Graph()

# Add nodes with attributes
for artist in artist_songs.keys():
    num_playlists = len(artist_playlists[artist])
    num_songs = len(artist_songs[artist])
    avg_duration = sum(artist_durations[artist]) / len(artist_durations[artist]) if artist_durations[artist] else None
    lyrics = lyrics_dict.get(artist.lower(), None)

    G.add_node(artist,
               songs=list(artist_songs[artist]),
               albums=list(artist_albums[artist]),
               playlists=list(artist_playlists[artist]),
               num_playlists=num_playlists,
               num_songs=num_songs,
               avg_song_duration=avg_duration,
               lyrics=lyrics)

# Add edges with attributes
for (a, b), pls in edge_playlists.items():
    G.add_edge(a, b,
               shared_playlists=list(pls),
               weight=len(pls),
               co_occurrence_count=len(pls))

# remove nodes where lyrics attribute is missing or empty
nodes_to_remove = [n for n, attrs in G.nodes(data=True) if not attrs.get('lyrics')]

# remove nodes from the graph where num_songs is less than 4 
nodes_to_remove += [n for n, attrs in G.nodes(data=True) if attrs.get('num_songs', 0) < 4]

G.remove_nodes_from(nodes_to_remove) 

# keep largest connected component only
if not nx.is_connected(G):
    largest_cc = max(nx.connected_components(G), key=len)
    G = G.subgraph(largest_cc).copy()

print(f"After removing artists with no lyrics: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

After removing artists with no lyrics: 1015 nodes, 63840 edges


## Backbone

Because our network is very dense, we are going to filter out some nodes and edges finding the backbone of the graph.

Skal vi mÃ¥ske bruge denne til at beregne modularitet?:

In [None]:
def modularity(G, partition):
    """
    Compute modularity using Eq. 9.12:
    """
    L = G.number_of_edges()  # total edges
    degrees = dict(G.degree())

    M = 0.0
    for community in partition:
        # Internal edges in community
        L_c = G.subgraph(community).number_of_edges()
        # Sum of degrees in community
        k_c = sum(degrees[node] for node in community)
        M += (L_c / L) - (k_c / (2 * L)) ** 2

    return M