# Motivation

The music industry is a industry made up not only by the artists who perform, but also a huge network of collaborators who contribute to their work, the songwriters and producers. Award ceremonies like the Grammys represent key milestones in artists’ careers and offer a natural starting point for exploring deeper patterns of collaboration and influence within the industry.
is shaped not only by the artists who perform but also by the vast networks of collaborators who contribute to their work—particularly songwriters and producers. 

In this project, we focus on the top Grammy categories: Album of the Year, Record of the Year, Song of the Year, and Best New Artist, because these categories are genre-agnostic and represent the most culturally significant and commercially visible names in music. By starting with these nominees, we aim to construct two graphs, one to explore the writers, where the nodes are writers, and an edge represents two writers who have both worked on the same artist's discography, the other graph will explore the artists, where nodes are artists, and an edge represents two artists who have worked with the same writer. We will explore how the trends in data evolves and changes over decades, so we make a different graph for each decade. Our goal is to uncover how closely connected the top tier of the music world really is.

# Basic Stats
In total we gathered 548 different artists across all five grammy nominees, after removing duplicates. Afterwards we used the site musicbrainz to gather information on each artist. We went through each artist's discography and collected all their releases. Then we went through all the releases and found the writers each artist has worked with. Doing this we gathered some key datasets:
- artist dataset, containing information about each artist
- songs dataset, where each entry is a song, with an id and also the corresponding artist's id. Each entry also have some general information like the release year and the duration.
- writerships dataset, where each row is a relationship between a release id and a writer's id

ChatGPT prompt for stats:

"  
[upload alle de dataset som er relevante (artists.csv, songs.csv, writerships.csv, song_genres.csv, ect..)]
Provided our datasets for the our project what sort of statistics can we make from it? Our project description says following:
#### Basic stats. Let's understand the dataset better
Write about your choices in data cleaning and preprocessing
Write a short section that discusses the dataset stats

Please look at our data and suggest what stats would be relevant to show and possible visualizations. Provide python code for this  
"

In [10]:
import pandas as pd
import networkx as nx
import json
import itertools
from pathlib import Path
import os
from collections import defaultdict

# ─── CONFIG ─────────────────────────────────────────────
DATA_DIR = Path('data')
OUT_DIR = Path('public/data')
OUT_DIR.mkdir(parents=True, exist_ok=True)

print("Loading datasets...")

# ─── LOAD CSVs ───────────────────────────────────────────
# Core datasets
songs = pd.read_csv(DATA_DIR / 'songs.csv')
writerships = pd.read_csv(DATA_DIR / 'writerships.csv')
artists_all = pd.read_csv(DATA_DIR / 'artists_all.csv') 
genres = pd.read_csv(DATA_DIR / 'artist_genres.csv')

# ─── DATA PREPARATION ───────────────────────────────────
print("Preparing data...")

# Clean IDs by stripping whitespace
songs['recording_mbid'] = songs['recording_mbid'].astype(str).str.strip()
writerships['recording_mbid'] = writerships['recording_mbid'].astype(str).str.strip()
writerships['writer_id'] = writerships['writer_id'].astype(str).str.strip()
songs['artist_mbid'] = songs['artist_mbid'].astype(str).str.strip()

# Create decade column
songs['first_release_year'] = pd.to_numeric(songs['first_release_year'], errors='coerce')
songs['decade'] = ((songs['first_release_year'] // 10) * 10).astype(str) + 's'
songs = songs.dropna(subset=['first_release_year', 'decade', 'recording_mbid', 'artist_mbid'])

# Filter to valid recordings
valid_recordings = songs[['recording_mbid', 'decade', 'artist_mbid', 'title', 'first_release_year']].dropna()

# Create lookup dictionaries
id_to_name = artists_all.set_index('artist_mbid')['name'].to_dict()

# Create genre lookup
genre_lookup = genres.groupby('artist_mbid')['genre'].apply(list).to_dict()

# Define default genre groups (for use if an artist has multiple genres)
genre_priority = {
    'pop': 1, 
    'rock': 2, 
    'r&b': 3, 
    'hip hop': 4, 
    'country': 5,
    'electronic': 6,
    'jazz': 7,
    'folk': 8,
    'classical': 9,
    'latin': 10
}

def get_primary_genre(genres_list):
    """Determine primary genre from a list of genres"""
    if not genres_list or len(genres_list) == 0:
        return "unknown"
    
    # Try to find a main genre from our priority list
    for genre in genres_list:
        for main_genre in genre_priority.keys():
            if main_genre in genre.lower():
                return main_genre
    
    # Otherwise return the first genre
    return genres_list[0]

# ─── DETERMINE WRITER'S FIRST DECADE ────────────────────
print("Determining each writer's first decade of work...")

# Merge songs and writerships to get year information for each writer contribution
writer_years = (
    songs[['recording_mbid', 'first_release_year', 'decade']]
    .merge(writerships[['recording_mbid', 'writer_id']], on='recording_mbid')
    .dropna(subset=['first_release_year', 'writer_id'])
)

# Find the earliest song year for each writer
writer_first_year = writer_years.groupby('writer_id')['first_release_year'].min().reset_index()
writer_first_year['decade'] = ((writer_first_year['first_release_year'] // 10) * 10).astype(str) + 's'

# Create a lookup from writer to their first decade
writer_to_first_decade = dict(zip(writer_first_year['writer_id'], writer_first_year['decade']))

print(f"Found first decade for {len(writer_to_first_decade)} writers")

# ─── BUILD NETWORKS BY DECADE ───────────────────────────────
print("Building networks by decade...")

# Get all decades from the songs dataset
all_decades = sorted(songs['decade'].unique())

for decade in all_decades:
    print(f"Processing {decade}...")
    
    # Get writers who first appeared in this decade
    decade_writers = [w for w, d in writer_to_first_decade.items() if d == decade]
    
    if len(decade_writers) == 0:
        print(f"  No writers first appeared in {decade}, skipping")
        continue
    
    print(f"  {decade}: {len(decade_writers)} writers first appeared in this decade")
    
    # Get all writerships for these writers (from ANY decade)
    decade_writerships = writerships[writerships['writer_id'].isin(decade_writers)]
    
    if len(decade_writerships) == 0:
        print(f"  No writerships for writers from {decade}, skipping")
        continue
    
    # Get all recordings these writers worked on
    decade_recordings = decade_writerships['recording_mbid'].unique()
    
    # Get unique artists these writers worked with
    decade_song_artists = songs[songs['recording_mbid'].isin(decade_recordings)]['artist_mbid'].unique()
    
    # Merge to get recording details for all songs by these writers
    decade_data = (
        songs[songs['recording_mbid'].isin(decade_recordings)]
        .merge(writerships, on='recording_mbid')
    )
    
    print(f"  {decade}: {len(decade_writers)} writers, {len(decade_recordings)} recordings, {len(decade_song_artists)} artists")
    
    # ─── WRITER NETWORK ───────────────────────────────────
    # For writers from this decade, who collaborated with whom
    writer_G = nx.Graph()
    
    # Group by recording to find writers who worked on the same song
    for rec_id, group in decade_data.groupby('recording_mbid'):
        writers = sorted(set(group['writer_id'].dropna()))
        
        # Only include writers who first appeared in this decade
        decade_only_writers = [w for w in writers if w in decade_writers]
        
        # Add edges between all writers who worked on this recording
        for w1, w2 in itertools.combinations(decade_only_writers, 2):
            if writer_G.has_edge(w1, w2):
                writer_G[w1][w2]['weight'] += 1
                writer_G[w1][w2]['songs'].append(group['title'].iloc[0])
            else:
                writer_G.add_edge(w1, w2, weight=1, songs=[group['title'].iloc[0]])
    
    # Create nodes and links for the JSON output
    writer_nodes = []
    for writer_id in writer_G.nodes():
        degree = writer_G.degree(writer_id, weight='weight')
        writer_genres = genre_lookup.get(writer_id, [])
        primary_genre = get_primary_genre(writer_genres)
        
        writer_nodes.append({
            'id': writer_id,
            'name': id_to_name.get(writer_id, writer_id),
            'value': degree,
            'decade': decade,
            'genre': writer_genres,
            'group': primary_genre
        })
    
    writer_links = []
    for u, v, d in writer_G.edges(data=True):
        writer_links.append({
            'source': u,
            'target': v, 
            'value': d['weight'],
            'songs': d.get('songs', [])
        })
    
    # Save writer network
    writer_network_file = OUT_DIR / f"writer-network-{decade}.json"
    with open(writer_network_file, 'w', encoding='utf-8') as f:
        json.dump({'nodes': writer_nodes, 'links': writer_links}, f, indent=2, ensure_ascii=False)
    
    print(f"  ✓ Writer network saved: {writer_network_file}")
    
    # ─── ARTIST NETWORK ───────────────────────────────────
    # Artists who share writers from this decade
    artist_G = nx.Graph()
    
    # Group by writer to find artists who share writers from this decade
    for writer_id, group in decade_data[decade_data['writer_id'].isin(decade_writers)].groupby('writer_id'):
        artists = sorted(set(group['artist_mbid'].dropna()))
        
        # Add edges between all artists who share this writer
        for a1, a2 in itertools.combinations(artists, 2):
            if artist_G.has_edge(a1, a2):
                artist_G[a1][a2]['weight'] += 1
            else:
                artist_G.add_edge(a1, a2, weight=1)
    
    # Create nodes and links for the JSON output
    artist_nodes = []
    for artist_id in artist_G.nodes():
        degree = artist_G.degree(artist_id, weight='weight')
        artist_genres = genre_lookup.get(artist_id, [])
        primary_genre = get_primary_genre(artist_genres)
        
        artist_nodes.append({
            'id': artist_id,
            'name': id_to_name.get(artist_id, artist_id),
            'value': degree,
            'decade': decade,
            'genre': artist_genres,
            'group': primary_genre
        })
    
    artist_links = []
    for u, v, d in artist_G.edges(data=True):
        artist_links.append({
            'source': u,
            'target': v, 
            'value': d['weight']
        })
    
    # Save artist network
    artist_network_file = OUT_DIR / f"artist-network-{decade}.json"
    with open(artist_network_file, 'w', encoding='utf-8') as f:
        json.dump({'nodes': artist_nodes, 'links': artist_links}, f, indent=2, ensure_ascii=False)
    
    print(f"  ✓ Artist network saved: {artist_network_file}")

print("\n🎉 All network data generated and saved to:", OUT_DIR)
print("\nNetwork statistics:")
for decade in all_decades:
    writer_file = OUT_DIR / f"writer-network-{decade}.json"
    artist_file = OUT_DIR / f"artist-network-{decade}.json"
    
    if writer_file.exists() and artist_file.exists():
        with open(writer_file, 'r', encoding='utf-8') as f:
            writer_data = json.load(f)
        with open(artist_file, 'r', encoding='utf-8') as f:
            artist_data = json.load(f)
            
        print(f"  {decade}: {len(writer_data['nodes'])} writers, {len(writer_data['links'])} writer connections, " + 
              f"{len(artist_data['nodes'])} artists, {len(artist_data['links'])} artist connections")

Loading datasets...
Preparing data...
Determining each writer's first decade of work...
Found first decade for 4434 writers
Building networks by decade...
Processing 1900.0s...
  No writers first appeared in 1900.0s, skipping
Processing 1920.0s...
  1920.0s: 1 writers first appeared in this decade
  1920.0s: 1 writers, 210 recordings, 2 artists
  ✓ Writer network saved: public/data/writer-network-1920.0s.json
  ✓ Artist network saved: public/data/artist-network-1920.0s.json
Processing 1930.0s...
  1930.0s: 4 writers first appeared in this decade
  1930.0s: 4 writers, 218 recordings, 7 artists
  ✓ Writer network saved: public/data/writer-network-1930.0s.json
  ✓ Artist network saved: public/data/artist-network-1930.0s.json
Processing 1940.0s...
  1940.0s: 47 writers first appeared in this decade
  1940.0s: 47 writers, 17297 recordings, 133 artists
  ✓ Writer network saved: public/data/writer-network-1940.0s.json
  ✓ Artist network saved: public/data/artist-network-1940.0s.json
Processin