---
# **Visual Analytics Final Project**
## **A Visual Music Recommender System Powered by Spotify Data**
---
### Members:
- Marina Castellano Blanco NIA 242409
- Júlia Othats-Dalès Gibert NIA 254435


## **1. Libraries**

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
import numpy as np

In [3]:
sns.set(style="whitegrid")

## **2. Loading and Merging**

### 2.1. Dataset Spotify Modern + Clásico

In [5]:
data1 = pd.read_csv('data/spotify_data clean.csv')
data2 = pd.read_csv('data/track_data_final.csv')

# Convertir duración a segundos
if 'track_duration_min' in data1.columns:
    data1['track_duration_sec'] = (data1['track_duration_min'] * 60).round().astype(int)
    data1.drop(columns=['track_duration_min'], inplace=True)

if 'track_duration_ms' in data2.columns:
    data2['track_duration_sec'] = (data2['track_duration_ms'] / 1000).round().astype(int)
    data2.drop(columns=['track_duration_ms'], inplace=True)

# Concatenar datasets
df_spotify = pd.concat([data1, data2], axis=0).reset_index(drop=True)

df_spotify.head()

Unnamed: 0,track_id,track_name,track_number,track_popularity,explicit,artist_name,artist_popularity,artist_followers,artist_genres,album_id,album_name,album_release_date,album_total_tracks,album_type,track_duration_sec
0,3EJS5LyekDim1Tf5rBFmZl,Trippy Mane (ft. Project Pat),4,0,True,Diplo,77.0,2812821.0,moombahton,5QRFnGnBeMGePBKF2xTz5z,"d00mscrvll, Vol. 1",2025-10-31,9,album,93
1,1oQW6G2ZiwMuHqlPpP27DB,OMG!,1,0,True,Yelawolf,64.0,2363438.0,"country hip hop, southern hip hop",4SUmmwnv0xTjRcLdjczGg2,OMG!,2025-10-31,1,single,184
2,7mdkjzoIYlf1rx9EtBpGmU,Hard 2 Find,1,4,True,Riff Raff,48.0,193302.0,,3E3zEAL8gUYWaLYB9L7gbp,Hard 2 Find,2025-10-31,1,single,153
3,67rW0Zl7oB3qEpD5YWWE5w,Still Get Like That (ft. Project Pat & Starrah),8,30,True,Diplo,77.0,2813710.0,moombahton,5QRFnGnBeMGePBKF2xTz5z,"d00mscrvll, Vol. 1",2025-10-31,9,album,101
4,15xptTfRBrjsppW0INUZjf,ride me like a harley,2,0,True,Rumelis,48.0,8682.0,dark r&b,06FDIpSHYmZAZoyuYtc7kd,come closer / ride me like a harley,2025-10-30,2,single,143


### 2.2 Limpieza de nombres de artistas y merge con metadata

In [6]:
artist_meta = pd.read_csv("data/top10k-spotify-artist-metadata.csv")
artist_meta['artist_name'] = artist_meta['artist'].str.strip().str.lower()
df_spotify['main_artist_clean'] = df_spotify['artist_name'].str.extract(
    r'^(.*?)(?:\s*\(?\s*(?:ft\.|feat\.|featuring)\s+.*)?$', flags=re.IGNORECASE
)[0].str.strip().str.lower()

df_spotify = df_spotify.merge(
    artist_meta[['artist_name', 'country']],
    left_on='main_artist_clean',
    right_on='artist_name',
    how='left'
)

df_spotify = df_spotify.drop(columns=['main_artist_clean', 'artist_name_y'])
df_spotify = df_spotify.rename(columns={'artist_name_x': 'artist_name'})

artist_meta.head()

Unnamed: 0.1,Unnamed: 0,index,artist,gender,age,type,country,city_1,district_1,city_2,district_2,city_3,district_3,artist_name
0,0,0,Drake,male,33,person,CA,,,Toronto,,,,drake
1,1,1,Post Malone,male,25,person,US,,,Syracuse,,,,post malone
2,2,2,Ed Sheeran,male,29,person,GB,,,Halifax,,,,ed sheeran
3,3,3,J Balvin,male,35,person,CO,,,Medellín,,,,j balvin
4,4,4,Bad Bunny,male,26,person,PR,,,San Juan,,,,bad bunny


### 2.3 Dataset Top Songs 2010-2019

In [7]:
df_top = pd.read_csv("data/top10s.csv", encoding='latin1')
df_top = df_top.drop(df_top.columns[0], axis=1) # Drop unnamed index column

df_top.head()

Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
0,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,Bad Romance,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,Just the Way You Are,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


## **3. Limpieza de datos**