# Background

Spotify is a digital music streaming service that offers users access to millions of songs, podcasts, and videos from artists all around the world. Started in Sweden in 2006, Spotify has grown exponentially to become one of the leading music streaming platforms globally. As of 2021, it has over 365 million monthly active users, including more than 165 million paying subscribers.

Spotify operates under a freemium model, offering users a basic, ad-supported service for free, while premium subscribers pay for additional features like ad-free listening and offline playback. Spotify's vast catalog, spanning virtually every genre of music, allows users to not only listen to their favorite songs but also discover new music.

Spotify's Recommender Engine

Spotify's recommender system is one of its defining features. It plays a crucial role in curating personalized content and creating a unique listening experience for each user. The engine leverages a mixture of collaborative filtering, natural language processing (NLP), audio analysis, and deep learning techniques to recommend songs and artists to users.

Collaborative filtering identifies patterns in user behavior to make recommendations. It operates under the assumption that if users A and B both like song X, and user A also likes song Y, then it's likely that user B will also enjoy song Y.

Natural language processing (NLP) is used to analyze textual data like playlist titles, song descriptions, and artist biographies to extract meaningful patterns and associations.

Audio analysis focuses on the music itself, extracting features like tempo, rhythm, key, and loudness. This approach allows Spotify to recommend songs that are musically similar.

Deep learning, involving complex neural networks, helps Spotify make recommendations by learning from vast amounts of data and recognizing complex patterns. These networks are trained using both explicit feedback (such as likes and skips) and implicit feedback (such as the amount of time a user spends listening to a song or the time of day they listen).

Spotify's personalized playlists, such as Discover Weekly and Daily Mixes, are examples of the recommender engine in action. These playlists, tailored to each user's unique musical tastes and listening habits, exemplify the power of machine learning in creating personalized experiences.

Before we proceed with the workshop, I highly encourage everyone to watch this video to understand Spotify's recommender system in more detail. This will serve as the foundation of recommneder engine and deep learning for the rest of the week. In this video, Spotify's engineers share insights about the complex algorithms and vast datasets that power their recommendation engine.

https://www.youtube.com/watch?v=pGntmcy_HX8

Now, let's dive into the world of recommender systems by creating a simplified version of a music recommendation engine using Spotify's track data!

# Objective

The main objective of this code is to build a song recommendation engine using Spotify dataset. The dataset contains various features of songs such as acousticness, danceability, energy, instrumentalness, liveness, speechiness, and valence.

We will use these features to group or cluster the songs using the KMeans clustering algorithm, a popular machine learning algorithm for cluster analysis. The algorithm aims to partition the songs into clusters in which each song belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

After the clusters are created, the model will recommend songs to the user based on their favorite songs. The recommendations are made by identifying the cluster that most of the user's favorite songs belong to, and then suggesting other songs from the same cluster. The assumption here is that songs within the same cluster have similar features and therefore the user is likely to enjoy them.

The final output of the code is a list of song recommendations that are similar in nature to the user's favorite songs.

# Import necessary libraries

First, we import the libraries that we will need for this project. This includes pandas for data manipulation and KMeans from sklearn.cluster for the clustering algorithm.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.cluster import KMeans

# Load the Spotify dataset

Next, we load the Spotify dataset into a pandas DataFrame. This assumes that the .csv file is located in the same directory as the Python script. If not, you would need to provide the full path to the file.

In [2]:
# Load the Spotify dataset
# Ensure the .csv file is in the correct directory or provide a full path to the file
tracks_df = pd.read_csv("Spotify_tracks.csv")

In [3]:
tracks_df.describe()

Unnamed: 0.1,Unnamed: 0,acousticness,danceability,disc_number,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,time_signature,track_number,valence
count,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0,101939.0
mean,50969.0,0.352124,0.586015,1.032166,246770.8,0.586479,0.148776,5.270858,0.19764,-9.46272,0.618154,39.782311,0.128841,118.358527,3.875651,4.60806,0.482813
std,29427.398883,0.334855,0.177724,0.566789,190430.3,0.26017,0.304024,3.577679,0.175391,6.198508,0.485841,16.790769,0.203324,30.224074,0.517008,7.181805,0.26169
min,0.0,0.0,0.0,1.0,1155.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,25484.5,0.0407,0.48,1.0,184000.0,0.411,0.0,2.0,0.0956,-11.149,0.0,29.0,0.0364,95.973,4.0,1.0,0.271
50%,50969.0,0.238,0.61,1.0,216893.0,0.629,3.7e-05,5.0,0.124,-7.599,1.0,41.0,0.0506,118.067,4.0,2.0,0.477
75%,76453.5,0.645,0.714,1.0,261055.0,0.798,0.0344,8.0,0.241,-5.509,1.0,52.0,0.104,136.045,4.0,6.0,0.693
max,101938.0,0.996,0.989,81.0,5505831.0,1.0,1.0,11.0,0.999,2.719,1.0,97.0,0.969,244.035,5.0,655.0,0.993


In [4]:
tracks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101939 entries, 0 to 101938
Data columns (total 32 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         101939 non-null  int64  
 1   acousticness       101939 non-null  float64
 2   album_id           101939 non-null  object 
 3   analysis_url       101939 non-null  object 
 4   artists_id         101939 non-null  object 
 5   available_markets  101939 non-null  object 
 6   country            101939 non-null  object 
 7   danceability       101939 non-null  float64
 8   disc_number        101939 non-null  float64
 9   duration_ms        101939 non-null  float64
 10  energy             101939 non-null  float64
 11  href               101939 non-null  object 
 12  id                 101939 non-null  object 
 13  instrumentalness   101939 non-null  float64
 14  key                101939 non-null  float64
 15  liveness           101939 non-null  float64
 16  lo

In [5]:
tracks_df.head()

Unnamed: 0.1,Unnamed: 0,acousticness,album_id,analysis_url,artists_id,available_markets,country,danceability,disc_number,duration_ms,...,preview_url,speechiness,tempo,time_signature,track_href,track_name_prev,track_number,uri,valence,type
0,0,0.294,0D3QufeCudpQANOR7luqdr,https://api.spotify.com/v1/audio-analysis/5qlj...,['3mxJuHRn2ZWD5OofvJtDZY'],"['AD', 'AE', 'AR', 'AT', 'AU', 'BE', 'BG', 'BH...",BE,0.698,1.0,235584.0,...,https://p.scdn.co/mp3-preview/1b05a902da3a251d...,0.0262,115.018,4.0,https://api.spotify.com/v1/tracks/5qljLQuKnNJf...,track_14,1.0,spotify:track:5qljLQuKnNJf4F4vfxQB0V,0.622,track
1,1,0.863,1bcqsH5UyTBzmh9YizdsBE,https://api.spotify.com/v1/audio-analysis/3VAX...,['4xWMewm6CYMstu0sPgd9jJ'],"['AD', 'AE', 'AR', 'AT', 'AU', 'BE', 'BG', 'BH...",BE,0.719,1.0,656960.0,...,https://p.scdn.co/mp3-preview/d8140736a6131cb5...,0.922,115.075,3.0,https://api.spotify.com/v1/tracks/3VAX2MJdmdqA...,track_3,3.0,spotify:track:3VAX2MJdmdqARLSU5hPMpm,0.589,track
2,2,0.75,4tKijjmxGClg4JOLAyo2qE,https://api.spotify.com/v1/audio-analysis/1L3Y...,['3hYaK5FF3YAglCj5HZgBnP'],['GB'],BE,0.466,1.0,492840.0,...,https://p.scdn.co/mp3-preview/c8af28fb15185b18...,0.944,79.565,4.0,https://api.spotify.com/v1/tracks/1L3YAhsEMrGV...,track_4,4.0,spotify:track:1L3YAhsEMrGVvCgDXj2TYn,0.085,track
3,3,0.763,6FeJF5r8roonnKraJxr4oB,https://api.spotify.com/v1/audio-analysis/6aCe...,['2KQsUB9DRBcJk17JWX1eXD'],"['AD', 'AE', 'AR', 'AT', 'AU', 'BE', 'BG', 'BH...",BE,0.719,1.0,316578.0,...,https://p.scdn.co/mp3-preview/7629b8e9f31f6e9b...,0.938,112.822,3.0,https://api.spotify.com/v1/tracks/6aCe9zzoZmCo...,track_9,1.0,spotify:track:6aCe9zzoZmCojX7bbgKKtf,0.533,track
4,4,0.77,4tKijjmxGClg4JOLAyo2qE,https://api.spotify.com/v1/audio-analysis/1Vo8...,['3hYaK5FF3YAglCj5HZgBnP'],['GB'],BE,0.46,1.0,558880.0,...,https://p.scdn.co/mp3-preview/32be593c0eb82868...,0.943,81.26,4.0,https://api.spotify.com/v1/tracks/1Vo802A38tPF...,track_2,2.0,spotify:track:1Vo802A38tPFHmje1h91um,0.0906,track


# Select necessary features for clustering

We then select the features that we will use for clustering. These features are selected based on their relevance to the kind of music a person might like.

In [6]:
# Select only necessary features for clustering
tracks = tracks_df[["acousticness","danceability","energy","instrumentalness","liveness","speechiness","valence"]]

# Initialize and train the KMeans model

Here, we initialize the KMeans model with 5 clusters. This means that we aim to divide our data into 5 different groups. We then train our model using the selected features.

In [7]:
# Initialize the KMeans model with 5 clusters
# We can adjust the number of clusters based on our understanding of the data
kmeans = KMeans(n_clusters=5)

In [8]:
# Train the model with our data
kmeans.fit(tracks)

KMeans(n_clusters=5)

# Assign cluster numbers to each track

After training the model, we add a new column to our DataFrame that represents the cluster each track belongs to. This is done using the labels_ attribute of the KMeans model.

In [9]:
# Add a new column 'type' to the dataframe. This column will contain the cluster number each track belongs to.
tracks_df["type"] = kmeans.labels_

# Save the DataFrame as a new CSV file

Once we have the cluster information added to our DataFrame, we save it as a new CSV file. This way, we don't have to perform the clustering every time we want to make a recommendation.

In [10]:
# Save the dataframe with cluster information as a new CSV file
tracks_df.to_csv("result.csv", index=False)

# Load the updated data with cluster info
tracks = pd.read_csv("result.csv")

# User input for favorite songs

Here, we prompt the user to enter the ids of their favorite songs. The ids should be separated by commas.

In [11]:
# Prompt the user to enter their favorite songs' ids
ids = input('Enter comma-separated ids of your favorite songs\n> ').strip().split(',')

# Filter the dataframe to get the user's favorite songs
favorites = tracks[tracks['id'].isin(ids)]

Enter comma-separated ids of your favorite songs
> 3VAX2MJdmdqARLSU5hPMpm


# Identify the user's favorite cluster

After getting the user's favorite songs, we identify the cluster that these songs most frequently belong to. This is done by counting the occurrences of each cluster number in the 'type' column of the user's favorite songs.

In [12]:
# Find out the most frequent cluster among the user's favorite songs
clusters = favorites['type'].value_counts()
user_favorite_cluster = clusters.idxmax()

print('\nFavorite cluster:', user_favorite_cluster, '\n')


Favorite cluster: 3 



# Make song recommendations

Finally, we make song recommendations based on the user's favorite cluster. We do this by filtering our DataFrame to only include songs that belong to this cluster. We then print the first 5 songs as recommendations.

In [13]:
# Get the songs that belong to the user's favorite cluster
suggestions = tracks[tracks['type'] == user_favorite_cluster]

# Print the first 5 song suggestions
print(suggestions.head())

   Unnamed: 0  acousticness                album_id  \
1           1         0.863  1bcqsH5UyTBzmh9YizdsBE   
2           2         0.750  4tKijjmxGClg4JOLAyo2qE   
3           3         0.763  6FeJF5r8roonnKraJxr4oB   
4           4         0.770  4tKijjmxGClg4JOLAyo2qE   
6           6         0.824  51g5viCaYjOW5XO4qX1RCD   

                                        analysis_url  \
1  https://api.spotify.com/v1/audio-analysis/3VAX...   
2  https://api.spotify.com/v1/audio-analysis/1L3Y...   
3  https://api.spotify.com/v1/audio-analysis/6aCe...   
4  https://api.spotify.com/v1/audio-analysis/1Vo8...   
6  https://api.spotify.com/v1/audio-analysis/1WJz...   

                   artists_id  \
1  ['4xWMewm6CYMstu0sPgd9jJ']   
2  ['3hYaK5FF3YAglCj5HZgBnP']   
3  ['2KQsUB9DRBcJk17JWX1eXD']   
4  ['3hYaK5FF3YAglCj5HZgBnP']   
6  ['4xWMewm6CYMstu0sPgd9jJ']   

                                   available_markets country  danceability  \
1  ['AD', 'AE', 'AR', 'AT', 'AU', 'BE', 'BG', 'BH...   