# Spotify Song Recommendations - Muchammad Wildan Alkautsar

## Import Library

In [1]:
import numpy as np 
import pandas as pd
import zipfile
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import precision_score, recall_score, f1_score

Library which will be used for project

## Loading the Data

In [2]:
with zipfile.ZipFile("Spotify.zip", "r") as z:
    file_list = z.namelist()

print("List file in the ZIP:", file_list)

List file in the ZIP: ['top10s.csv']


In [3]:
with zipfile.ZipFile("Spotify.zip", "r") as z:
    with z.open("top10s.csv") as f:  
        df = pd.read_csv(f, encoding='latin1')

# Show first 5 rows
df.head().set_index('Unnamed: 0')

Unnamed: 0_level_0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
2,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
3,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
4,Bad Romance,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
5,Just the Way You Are,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


Open first 5 Rows

## Univariate Exploratory Data Analysis


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603 entries, 0 to 602
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  603 non-null    int64 
 1   title       603 non-null    object
 2   artist      603 non-null    object
 3   top genre   603 non-null    object
 4   year        603 non-null    int64 
 5   bpm         603 non-null    int64 
 6   nrgy        603 non-null    int64 
 7   dnce        603 non-null    int64 
 8   dB          603 non-null    int64 
 9   live        603 non-null    int64 
 10  val         603 non-null    int64 
 11  dur         603 non-null    int64 
 12  acous       603 non-null    int64 
 13  spch        603 non-null    int64 
 14  pop         603 non-null    int64 
dtypes: int64(12), object(3)
memory usage: 70.8+ KB


Information about data

Data dictionary

- **title**: The title of the song  
- **artist**: The artist of the song  
- **top genre**: The genre of the song  
- **year**: The year the song was in the Billboard  
- **bpm**: Beats per minute - the tempo of the song  
- **nrgy**: The energy of the song - higher values mean more energetic (fast, loud)  
- **dnce**: The danceability of the song - higher values mean it's easier to dance to  
- **dB**: Decibel - the loudness of the song  
- **live**: Liveness - likeliness the song was recorded with a live audience  
- **val**: Valence - higher values mean a more positive sound (happy, cheerful)  
- **dur**: The duration of the song  
- **acous**: The acousticness of the song - likeliness the song is acoustic  
- **spch**: Speechiness - higher values mean more spoken words  
- **pop**: Popularity - higher values mean more popular  

In [5]:
df.describe()

Unnamed: 0.1,Unnamed: 0,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
count,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0,603.0
mean,302.0,2014.59204,118.545605,70.504146,64.379768,-5.578773,17.774461,52.225539,224.674959,14.3267,8.358209,66.52073
std,174.215384,2.607057,24.795358,16.310664,13.378718,2.79802,13.102543,22.51302,34.130059,20.766165,7.483162,14.517746
min,1.0,2010.0,0.0,0.0,0.0,-60.0,0.0,0.0,134.0,0.0,0.0,0.0
25%,151.5,2013.0,100.0,61.0,57.0,-6.0,9.0,35.0,202.0,2.0,4.0,60.0
50%,302.0,2015.0,120.0,74.0,66.0,-5.0,12.0,52.0,221.0,6.0,5.0,69.0
75%,452.5,2017.0,129.0,82.0,73.0,-4.0,24.0,69.0,239.5,17.0,9.0,76.0
max,603.0,2019.0,206.0,98.0,97.0,-2.0,74.0,98.0,424.0,99.0,48.0,99.0


Show the descriptif statistics of the data


In [159]:
len(df)

603

number of lines 

## Data Preprocessing


In [160]:
df.isna().sum()

Unnamed: 0    0
title         0
artist        0
top genre     0
year          0
bpm           0
nrgy          0
dnce          0
dB            0
live          0
val           0
dur           0
acous         0
spch          0
pop           0
dtype: int64

No have a missing values in the data

In [161]:
df.duplicated().sum()

0

No have a duplicates values in the data

## Data Preparation


In [162]:
df['top genre'].unique()

array(['neo mellow', 'detroit hip hop', 'dance pop', 'pop',
       'canadian pop', 'hip pop', 'barbadian pop', 'atl hip hop',
       'australian pop', 'indie pop', 'art pop', 'colombian pop',
       'big room', 'british soul', 'chicago rap', 'acoustic pop',
       'permanent wave', 'boy band', 'baroque pop', 'celtic rock',
       'electro', 'complextro', 'canadian hip hop', 'candy pop',
       'alaska indie', 'folk-pop', 'metropopolis', 'house',
       'australian hip hop', 'electropop', 'australian dance',
       'hollywood', 'canadian contemporary r&b',
       'irish singer-songwriter', 'tropical house', 'belgian edm',
       'french indie pop', 'hip hop', 'danish pop', 'latin',
       'canadian latin', 'electronic trap', 'edm', 'electro house',
       'downtempo', 'brostep', 'contemporary country', 'moroccan pop',
       'escape room', 'alternative r&b'], dtype=object)

df['top genre'].unique() returns an array of unique genre names present in the "top genre" column of the dataset.

In [163]:
df = df[['title', 'top genre']]

Because to make a recomendation game base on genre, we only need title and genre features

In [164]:
df

Unnamed: 0,title,top genre
0,"Hey, Soul Sister",neo mellow
1,Love The Way You Lie,detroit hip hop
2,TiK ToK,dance pop
3,Bad Romance,dance pop
4,Just the Way You Are,pop
...,...,...
598,Find U Again (feat. Camila Cabello),dance pop
599,Cross Me (feat. Chance the Rapper & PnB Rock),pop
600,"No Brainer (feat. Justin Bieber, Chance the Ra...",dance pop
601,Nothing Breaks Like a Heart (feat. Miley Cyrus),dance pop


In [165]:
from sklearn.feature_extraction.text import TfidfVectorizer
 
tf = TfidfVectorizer()

# Melakukan fit lalu ditransformasikan ke bentuk matrix
tfidf_matrix = tf.fit_transform(df['top genre']) 
 
# Melihat ukuran matrix tfidf
tfidf_matrix.shape 

(603, 55)

The code uses TfidfVectorizer() to convert the "top genre" column into a TF-IDF matrix. The tfidf_matrix.shape command then returns the dimensions of the resulting matrix, showing the number of rows (songs) and columns (unique genre terms).

## Model Development dengan Content Based Filtering


In [167]:
df_tfidf = pd.DataFrame(
    tfidf_matrix.toarray(), 
    columns=tf.get_feature_names_out(),  # Pakai vectorizer
    index=df.title
)

# Cek jumlah fitur dan game yang tersedia
num_features = df_tfidf.shape[1]
num_games = df_tfidf.shape[0]

# Ambil sampel dengan jumlah yang valid
df_tfidf.sample(min(22, num_features), axis=1).sample(min(10, num_games), axis=0)

Unnamed: 0_level_0,moroccan,rap,room,barbadian,soul,trap,hip,pop,detroit,irish,...,contemporary,colombian,songwriter,folk,downtempo,band,country,hollywood,neo,danish
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Supplies,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616413,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Body Say,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616413,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
International Love,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616413,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Light It Up (feat. Nyla & Fuse ODG) [Remix],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616413,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
We Are One (Ole Ola) [The Official 2014 FIFA World Cup Song],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616413,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
This Town,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616413,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sexy Bitch (feat. Akon),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616413,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Where Have You Been,0.0,0.0,0.0,0.965627,0.0,0.0,0.0,0.259933,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Waves - Robin Schulz Radio Edit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Wide Awake,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.616413,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The code converts the TF-IDF matrix into a Pandas DataFrame, where the rows represent song titles and the columns represent unique genre terms extracted by the vectorizer. It then calculates the number of features (num_features) and the number of songs (num_games) in the dataset. Finally, it selects a random subset of up to 22 features and 10 songs to sample for inspection.

In [168]:
from sklearn.metrics.pairwise import cosine_similarity
 
# Menghitung cosine similarity pada matrix tf-idf
cosine_sim = cosine_similarity(tfidf_matrix) 
cosine_sim

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 1., 1., 0.],
       ...,
       [0., 0., 1., ..., 1., 1., 0.],
       [0., 0., 1., ..., 1., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization is a technique in Natural Language Processing (NLP) that converts text into numerical representations by considering both the frequency of a term in a document (Term Frequency) and how unique that term is across the entire corpus (Inverse Document Frequency). In the code above, TfidfVectorizer() is used to transform the genre column from df_train["genre"] into a TF-IDF-based feature matrix. The fit_transform() method first learns the text characteristics (by tokenizing and computing TF-IDF values for each term) and then converts the data into a sparse matrix. This matrix can be used in machine learning models for tasks such as text classification or content-based recommendation systems.

Cosine similarity is a metric used to measure the similarity between two vectors by calculating the cosine of the angle between them. In the code above, cosine_similarity(tfidf_matrix, tfidf_matrix) computes the pairwise cosine similarity between all genre representations in tfidf_matrix. Since tfidf_matrix is a numerical representation of text data, this operation results in a similarity matrix where each entry (i, j) represents the similarity between the i-th and j-th game genres. A value close to 1 indicates high similarity, while a value close to 0 means low similarity. This technique is commonly used in content-based recommendation systems to find items with similar characteristics.

In [169]:
def song_recommendations(title, similarity_data=cosine_sim, items=df, k=5):
    indices = pd.Series(df.index, index=df["title"]).drop_duplicates()

    if title not in indices:
        return f"Judul '{title}' tidak ditemukan dalam dataset."

    idx = indices[title]
    sim_scores = list(enumerate(similarity_data[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:k+1]
    song_indices = [i[0] for i in sim_scores]

    return df.iloc[song_indices][["title", "top genre"]]


The song_recommendations() function recommends songs based on genre similarity using cosine similarity. First, it creates a Series containing song indices with their titles as the index, ensuring uniqueness by removing duplicates. If the given title is not found in the dataset, the function returns an error message. Next, it retrieves the song's index, calculates its cosine similarity with all other songs, and sorts the results in descending order. The top five most similar songs (excluding the input song itself) are selected as recommendations. Finally, the function returns a list of recommended songs, including their titles and genres, from the dataset.









In [170]:
recommendations = song_recommendations("Broken Arrows", k=3)
print(recommendations)

                                  title top genre
140                          Wake Me Up  big room
146  Don't You Worry Child - Radio Edit  big room
217                         Hey Brother  big room


The function song_recommendations("Broken Arrows", k=3) retrieves three songs with the highest cosine similarity to "Broken Arrows" based on genre. The output displays a DataFrame containing the recommended songs along with their genres. This ensures that the recommendations are closely related to the input song in terms of musical style.

## Evaluation

In [171]:
def calculate_precision(title, recommendations):
    if recommendations.empty:
        return 0.0  
    
    song_genre = df[df["title"] == title]["top genre"].values[0]
    relevant_recommendations = sum(recommendations["top genre"] == song_genre)
    
    precision = relevant_recommendations / len(recommendations)
    return precision

The calculate_precision() function measures the accuracy of song recommendations by calculating precision. It first checks if the recommendations list is empty; if so, it returns 0.0. Then, it retrieves the genre of the input song and counts how many recommended songs share the same genre. Precision is computed as the ratio of relevant recommendations (same genre) to the total number of recommendations. A higher precision value indicates that the recommendation system is more effective in suggesting songs with similar genres.

In [173]:
# Example Using
title_input = "Broken Arrows"
recommendations = song_recommendations(title_input, k=5)
precision_score = calculate_precision(title_input, recommendations)

print("Recomendation Song:")
print(recommendations)
print(f"\nPrecision: {precision_score:.2f}")

Recomendation Song:
                                  title top genre
140                          Wake Me Up  big room
146  Don't You Worry Child - Radio Edit  big room
217                         Hey Brother  big room
327                       Broken Arrows  big room
340                Heroes (we could be)  big room

Precision: 1.00


The code snippet demonstrates how to use the song recommendation system and evaluate its precision. First, it sets "Broken Arrows" as the input title and retrieves five recommended songs using the song_recommendations() function. Then, it calculates the precision score by comparing the genres of the recommendations with the input song's genre using calculate_precision(). Finally, it prints the recommended songs along with their genres and displays the precision score, formatted to two decimal places, indicating the effectiveness of the recommendation system.

The output shows that all five recommended songs belong to the same genre as the input song, "Broken Arrows", which is "big room". Since every recommended song matches the input song's genre, the precision score is 1.00 (100%), indicating a perfect recommendation accuracy. This means the recommendation system is highly effective in suggesting songs within the same genre.