# Rekomendasi Musik Spotify: Content-Based Filtering
**Dataset:**
### Top 10000 Songs on Spotify 1950-Present
The best and biggest songs from ARIA & Billboard charts spanning 7 decades.
https://www.kaggle.com/datasets/joebeachcapital/top-10000-spotify-songs-1960-now/data

## Import Library

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack, csr_matrix
import random
import math

## Load Dataset

In [2]:
file_path = '/content/drive/MyDrive/Colab Notebooks/DBS Coding Camp 2025/MLT/Proyek Kedua/top_10000_1950-now.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Track URI,Track Name,Artist URI(s),Artist Name(s),Album URI,Album Name,Album Artist URI(s),Album Artist Name(s),Album Release Date,Album Image URL,...,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Time Signature,Album Genres,Label,Copyrights
0,spotify:track:0vNPJrUrBnMFdCs8b2MTNG,Fader,spotify:artist:4W48hZAnAHVOC2c8WH8pcq,The Temper Trap,spotify:album:0V59MMtgoruvEqMv18KAOH,Conditions (Tour Edition),spotify:artist:4W48hZAnAHVOC2c8WH8pcq,The Temper Trap,2009,https://i.scdn.co/image/ab67616d0000b273f86ae8...,...,0.0353,0.000101,0.69,0.0752,0.158,134.974,4.0,,Liberation Records,"C 2010 Liberation Music, P 2010 Liberation Music"
1,spotify:track:0NpvdCO506uO58D4AbKzki,Sherry,spotify:artist:6mcrZQmgzFGRWf7C0SObou,Frankie Valli & The Four Seasons,spotify:album:0NUEQILaBzavnzcMEs4buZ,The Very Best of Frankie Valli & The 4 Seasons,spotify:artist:6mcrZQmgzFGRWf7C0SObou,Frankie Valli & The Four Seasons,2003-01-14,https://i.scdn.co/image/ab67616d0000b273b96c21...,...,0.0441,0.626,0.0,0.113,0.734,117.562,4.0,,Rhino,C © 2004 Bob Gaudio & Frankie Valli d/b/a The ...
2,spotify:track:1MtUq6Wp1eQ8PC6BbPCj8P,I Took A Pill In Ibiza - Seeb Remix,"spotify:artist:2KsP6tYLJlTBvSUxnwlVWa, spotify...","Mike Posner, Seeb",spotify:album:1Tz3Ai1guEFf4hV3d9i17K,"At Night, Alone.",spotify:artist:2KsP6tYLJlTBvSUxnwlVWa,Mike Posner,2016-05-06,https://i.scdn.co/image/ab67616d0000b273a19be7...,...,0.111,0.0353,8e-06,0.0843,0.71,101.969,4.0,,"Monster Mountain, LLC / Island","C © 2016 Island Records, a division of UMG Rec..."
3,spotify:track:59lq75uFIqzUZcgZ4CbqFG,Let Go for Tonight,spotify:artist:7qRll6DYV06u2VuRPAVqug,Foxes,spotify:album:5AQ7uKRSpAv7SNUl4j24ru,Glorious (Deluxe),spotify:artist:7qRll6DYV06u2VuRPAVqug,Foxes,2014-05-12,https://i.scdn.co/image/ab67616d0000b273ae5c7d...,...,0.0632,0.0429,2e-06,0.326,0.299,140.064,4.0,,Sign Of The Times Records,P (P) 2014 Sign Of The Times Limited under exc...
4,spotify:track:7KdcZQ3GJeGdserhK61kfv,The Way I Want To Touch You,spotify:artist:7BEfMxbaqx6dOpbtlEqScm,Captain & Tennille,spotify:album:3GUxesVyOehInaxJyCTh6d,Love Will Keep Us Together,spotify:artist:7BEfMxbaqx6dOpbtlEqScm,Captain & Tennille,1975-01-01,https://i.scdn.co/image/ab67616d0000b273e21a28...,...,0.0248,0.624,0.000112,0.343,0.597,111.29,4.0,,A&M,"C © 1975 A&M Records, P This Compilation ℗ 197..."


## Data Understanding & Cleaning

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 35 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Track URI             10000 non-null  object 
 1   Track Name            9998 non-null   object 
 2   Artist URI(s)         9998 non-null   object 
 3   Artist Name(s)        9998 non-null   object 
 4   Album URI             9998 non-null   object 
 5   Album Name            9998 non-null   object 
 6   Album Artist URI(s)   9998 non-null   object 
 7   Album Artist Name(s)  9998 non-null   object 
 8   Album Release Date    9998 non-null   object 
 9   Album Image URL       9996 non-null   object 
 10  Disc Number           10000 non-null  int64  
 11  Track Number          10000 non-null  int64  
 12  Track Duration (ms)   10000 non-null  int64  
 13  Track Preview URL     9937 non-null   object 
 14  Explicit              10000 non-null  bool   
 15  Popularity          

In [4]:
df.shape

(10000, 35)

Ukuran dataset adalah (10000, 35). Artinya ada 10.000 lagu dan 35 kolom fitur (metadata + fitur audio).

In [5]:
# Kolom fitur audio dan metadata
audio_features = ['Danceability', 'Energy', 'Valence', 'Tempo', 'Loudness',
                  'Speechiness', 'Acousticness', 'Instrumentalness', 'Liveness']

# Drop baris dengan missing pada fitur audio, Artist Name(s), Track Name, Artist Genres
df = df.dropna(subset=audio_features + ['Artist Name(s)', 'Track Name', 'Artist Genres']).reset_index(drop=True)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9446 entries, 0 to 9445
Data columns (total 35 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Track URI             9446 non-null   object 
 1   Track Name            9446 non-null   object 
 2   Artist URI(s)         9446 non-null   object 
 3   Artist Name(s)        9446 non-null   object 
 4   Album URI             9446 non-null   object 
 5   Album Name            9446 non-null   object 
 6   Album Artist URI(s)   9446 non-null   object 
 7   Album Artist Name(s)  9446 non-null   object 
 8   Album Release Date    9446 non-null   object 
 9   Album Image URL       9446 non-null   object 
 10  Disc Number           9446 non-null   int64  
 11  Track Number          9446 non-null   int64  
 12  Track Duration (ms)   9446 non-null   int64  
 13  Track Preview URL     9397 non-null   object 
 14  Explicit              9446 non-null   bool   
 15  Popularity           

In [6]:
# Ekstrak "primary artist" dari kolom 'Artist Name(s)'
def get_primary_artist(artist_str):
    if isinstance(artist_str, str) and artist_str.strip() != '':
        return artist_str.split(',')[0].strip()
    return ''

# Ekstrak "primary genre" dari kolom 'Artist Genres'
def get_primary_genre(genres):
    if isinstance(genres, str) and genres.strip() != '':
        return genres.split(',')[0].strip()
    return ''

# Terapkan ekstraksi pada DataFrame baru _df
c_df = df.copy()
c_df['Primary Artist'] = c_df['Artist Name(s)'].apply(get_primary_artist)
c_df['Primary Genre'] = c_df['Artist Genres'].apply(get_primary_genre)

# Buat ID unik untuk setiap track setelah cleaning
df_clean = c_df.copy().reset_index(drop=True)
df_clean['Track ID'] = df_clean.index

In [7]:
# Drop Duplikat Judul
# Jika ada duplikat Track Name, simpan hanya entri dengan Popularitas tertinggi
df_clean = df_clean.sort_values('Popularity', ascending=False).drop_duplicates(subset=['Track Name'])
df_clean = df_clean.reset_index(drop=True)
# Update Track ID setelah drop duplikat
df_clean['Track ID'] = df_clean.index

In [14]:
df_clean.shape

(7813, 38)

## Data Preparation

Normalisasi fitur audio dengan MinMaxScaler agar masing-masing berada di rentang 0-1. Proses kolom 'Artist Genres' menjadi vektor TF-IDF (maksimal 200 fitur) untuk menangkap informasi genre. Ubah matriks fitur audio menjadi format sparse untuk efisiensi. Gabungkan matriks audio (sparse) dan TF-IDF genre menjadi satu matriks fitur gabungan.

In [8]:
# Skala fitur audio dengan MinMaxScaler (0-1)
scaler = MinMaxScaler()
df_clean[audio_features] = scaler.fit_transform(df_clean[audio_features])

# Buat TF-IDF untuk kolom 'Artist Genres'
vectorizer = TfidfVectorizer(stop_words='english', max_features=200)  # batasi dimensi
genre_tfidf = vectorizer.fit_transform(df_clean['Artist Genres'])  # sparse matrix shape=(n_songs, n_genre)

# Ubah matriks fitur audio menjadi sparse
audio_matrix = csr_matrix(df_clean[audio_features].values)

# Gabungkan fitur audio + TF-IDF genre
combined_features = hstack([audio_matrix, genre_tfidf])  # shape: (n_songs, n_audio + n_genre)

### Hitung Similarity Matrix
Hitung cosine similarity di antara semua baris (lagu) pada matriks fitur gabungan. Hasilnya adalah matriks sim_matrix berukuran (n_songs, n_songs).

In [9]:
# Cosine similarity (sparse)
sim_matrix = cosine_similarity(combined_features, dense_output=False)

### Mapping untuk Rekomendasi
- track_to_idx: mapping dari judul lagu ke Track ID (index)
- idx_to_track: mapping dari Track ID ke judul lagu
- idx_to_artist: mapping dari Track ID ke primary artist
- idx_to_genre: mapping dari Track ID ke primary genre

In [10]:
# Mapping untuk Rekomendasi
track_to_idx = pd.Series(df_clean['Track ID'].values, index=df_clean['Track Name']).to_dict()
idx_to_track = pd.Series(df_clean['Track Name'].values, index=df_clean['Track ID']).to_dict()
idx_to_artist = pd.Series(df_clean['Primary Artist'].values, index=df_clean['Track ID']).to_dict()
idx_to_genre = pd.Series(df_clean['Primary Genre'].values, index=df_clean['Track ID']).to_dict()

# Fungsi rekomendasi: diberikan judul track, k, kembalikan top k rekomendasi
def recommend(track_name, top_k=10):
    if track_name not in track_to_idx:
        print(f"Track '{track_name}' tidak ditemukan dalam dataset.")
        return []
    idx = track_to_idx[track_name]
    sim_scores = list(enumerate(sim_matrix[idx].toarray().ravel()))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    recommendations = []
    count = 0
    for i, score in sim_scores:
        if i == idx:
            continue
        recommendations.append((idx_to_track[i], idx_to_artist[i], idx_to_genre[i], score))
        count += 1
        if count >= top_k:
            break
    return recommendations

# Uji rekomendasi
example_track = df_clean['Track Name'].iloc[0]
print("Track contoh:", example_track)
print("Rekomendasi 5 lagu mirip:")
for track, artist, genre, score in recommend(example_track, top_k=5):
    print(f"{track} - {artist} [{genre}] (score: {score:.4f})")

Track contoh: Espresso
Rekomendasi 5 lagu mirip:
Misery - Maroon 5 [pop] (score: 0.9978)
Up All Night - Khalid [pop] (score: 0.9965)
Sweet but Psycho - Ava Max [pop] (score: 0.9959)
One More Night - Maroon 5 [pop] (score: 0.9958)
Teenage Dream - Katy Perry [pop] (score: 0.9958)


## Evaluasi Model

In [11]:
# Evaluasi (Precision@10, Recall@10, NDCG@10) Berdasarkan Primary Genre
def precision_at_k(relevant, recommended, k):
    recommended_k = recommended[:k]
    if not recommended_k:
        return 0.0
    true_positives = sum([1 for r in recommended_k if r in relevant])
    return true_positives / k


def recall_at_k(relevant, recommended, k):
    recommended_k = recommended[:k]
    if not relevant:
        return 0.0
    true_positives = sum([1 for r in recommended_k if r in relevant])
    return true_positives / len(relevant)


def ndcg_at_k(relevant, recommended, k):
    dcg = 0.0
    for i, rec in enumerate(recommended[:k]):
        if rec in relevant:
            dcg += 1 / math.log2(i + 2)
    ideal_rels = min(len(relevant), k)
    idcg = sum([1 / math.log2(i + 2) for i in range(ideal_rels)])
    return dcg / idcg if idcg > 0 else 0.0

In [12]:
# sampel query untuk evaluasi (100 track random)
random.seed(42)
sample_indices = random.sample(list(df_clean['Track ID']), 100)
eval_results = []
K = 10

for q_idx in sample_indices:
    # Tentukan relevant sebagai set lagu dengan Primary Genre sama (kecuali diri sendiri)
    query_genre = idx_to_genre[q_idx]
    relevant_tracks = set(
        df_clean.loc[
            (df_clean['Primary Genre'] == query_genre) & (df_clean['Track ID'] != q_idx),
            'Track Name'
        ].values
    )
    query_track = idx_to_track[q_idx]
    recs = [name for name, artist, genre, _ in recommend(query_track, top_k=K)]
    prec = precision_at_k(relevant_tracks, recs, K)
    rec_score = recall_at_k(relevant_tracks, recs, K)
    ndcg = ndcg_at_k(relevant_tracks, recs, K)
    eval_results.append((prec, rec_score, ndcg))

# Hitung rata-rata metrik
eval_df = pd.DataFrame(eval_results, columns=['Precision@10', 'Recall@10', 'NDCG@10'])
print("\nHasil Evaluasi Rata-Rata:")
print(eval_df.mean())


Hasil Evaluasi Rata-Rata:
Precision@10    0.702000
Recall@10       0.138689
NDCG@10         0.752749
dtype: float64


Nilai Precision@10 dan NDCG@10 menunjukkan sistem sudah mampu memilih dan meranking lagu‐lagu yang satu genre dengan input. Itu berarti sudah baik untuk content‐based filtering di skala 10.000 lagu. Recall yang lebih rendah dapat dipahami karena jumlah lagu dalam satu genre sangat banyak, sementara rekomendasi dibatasi 10 entri saja.

In [13]:
# Rekomendasi 10 Lagu
query = input("Masukkan judul lagu (Track Name): ")
results = recommend(query, top_k=10)
if results:
    print(f"\nRekomendasi 10 lagu mirip dengan '{query}':")
    for i, (track, artist, genre, score) in enumerate(results, start=1):
        print(f"{i}. {track} - {artist} [{genre}] (similarity: {score:.4f})")

Masukkan judul lagu (Track Name): Faded

Rekomendasi 10 lagu mirip dengan 'Faded':
1. Million Voices - Radio Edit - Otto Knows [edm] (similarity: 0.9287)
2. Satisfaction - Uk Radio Edit - Benny Benassi [dutch house] (similarity: 0.9168)
3. Summertime Sadness (Lana Del Rey Vs. Cedric Gervais) - Cedric Gervais Remix - Lana Del Rey [art pop] (similarity: 0.9066)
4. Something New - Axwell /\ Ingrosso [edm] (similarity: 0.8986)
5. Hurricane - Radio Edit - Dzeko & Torres [dutch house] (similarity: 0.8945)
6. Young And Beautiful [Lana Del Rey vs. Cedric Gervais] - Cedric Gervais Remix Radio Edit - Lana Del Rey [art pop] (similarity: 0.8913)
7. Tsunami (Jump) - Radio Edit - DVBBS [canadian electronic] (similarity: 0.8898)
8. Tsunami (Jump) [feat. Tinie Tempah] - DVBBS [canadian electronic] (similarity: 0.8898)
9. Legacy - Radio Edit - Nicky Romero [big room] (similarity: 0.8886)
10. You’ll Be Mine - Havana Brown [australian dance] (similarity: 0.8848)
