# 1. Problem Statement:
The Music Recommendation System (Sportify) aims to predict the likelihood that a user will enjoy a song. By analyzing the user's past song history and the properties of the music, the system will generate a list of recommended tracks. The model uses the Spotify dataset which contains a variety of features such as acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, and others.


# 2. Objective of the project:

The primary objectives of this Music Recommendation System project are as follows:

- User Personalization: To create a personalized experience for users by recommending tracks based on their individual tastes and listening habits.

- Feature Utilization: To effectively use the features available in the Spotify dataset, such as acoustic properties and metadata, to inform the recommendation algorithms.

- Model Accuracy: To develop a Machine Learning model that accurately predicts user preferences, aiming for high precision and recall in the recommendations.

- Scalability: To ensure the system can handle a large number of users and songs without a decline in performance.

- User Engagement: To increase user engagement by providing relevant song recommendations that would encourage further interaction with the service.

- Algorithm Diversity: To explore and implement different recommendation algorithms and evaluate their effectiveness for this specific application.

- Data Analysis: To perform comprehensive data analysis to understand user behavior and song popularity, which in turn can improve the recommendation engine.

- Continuous Learning: To implement a system that learns over time, improving its recommendations as it gains more data on user preferences.

These objectives drive the development and iterative improvement of the music recommendation system. By achieving these goals, the project aims to deliver a robust and enjoyable user experience.

Sources: Kaggle




Importing Libraries

In [2]:
# Install the lightfm library
!pip install lightfm

import os
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from scipy.spatial.distance import cdist
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from scipy import sparse
import random
import lightfm
from lightfm import LightFM, cross_validation
from lightfm.evaluation import precision_at_k, auc_score
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings("ignore")
import ast
from scipy.spatial.distance import cosine, euclidean, hamming
from sklearn.preprocessing import normalize
from keras.preprocessing import image
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from time import time
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline

Collecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/316.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━[0m [32m174.1/316.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone
  Created wheel for lightfm: filename=lightfm-1.17-cp310-cp310-linux_x86_64.whl size=808331 sha256=cc82c6a53228e9c741e614e16211d0dfb5c0b68c6c6edf5a376836680c7c07e4
  Stored in directory: /root/.cache/pip/wheels/4f/9b/7e/0b256f2168511d8fa4dae4fae0200fdbd729eb424a912ad636
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


Mounting the drive


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Read the music Data

In [4]:
# Loading the datasets
data = pd.read_csv('drive/MyDrive/Spotify_Recommendation_System/data.csv')
genre_data = pd.read_csv('drive/MyDrive/Spotify_Recommendation_System/data_by_genres.csv')
year_data = pd.read_csv('drive/MyDrive/Spotify_Recommendation_System/data_by_year.csv')

data.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berli...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954
1,0.963,1921,0.732,['Dennis Day'],0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.936
2,0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadi...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
3,0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
4,0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665


In [5]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170653 entries, 0 to 170652
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           170653 non-null  float64
 1   year              170653 non-null  int64  
 2   acousticness      170653 non-null  float64
 3   artists           170653 non-null  object 
 4   danceability      170653 non-null  float64
 5   duration_ms       170653 non-null  int64  
 6   energy            170653 non-null  float64
 7   explicit          170653 non-null  int64  
 8   id                170653 non-null  object 
 9   instrumentalness  170653 non-null  float64
 10  key               170653 non-null  int64  
 11  liveness          170653 non-null  float64
 12  loudness          170653 non-null  float64
 13  mode              170653 non-null  int64  
 14  name              170653 non-null  object 
 15  popularity        170653 non-null  int64  
 16  release_date      17

In [6]:
data['artists'] = data['artists'].map(lambda x: x.lstrip('[').rstrip(']'))
data.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"'Sergei Rachmaninoff', 'James Levine', 'Berlin...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954
1,0.963,1921,0.732,'Dennis Day',0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.936
2,0.0394,1921,0.961,'KHP Kridhamardawa Karaton Ngayogyakarta Hadin...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
3,0.165,1921,0.967,'Frank Parker',0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
4,0.253,1921,0.957,'Phil Regan',0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665


In [7]:
data['artists'] = data['artists'].map(lambda x: x[1:-1])
data.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"Sergei Rachmaninoff', 'James Levine', 'Berline...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,1921,0.0366,80.954
1,0.963,1921,0.732,Dennis Day,0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.936
2,0.0394,1921,0.961,KHP Kridhamardawa Karaton Ngayogyakarta Hadini...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
3,0.165,1921,0.967,Frank Parker,0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
4,0.253,1921,0.957,Phil Regan,0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665


# Read Music Playlist Data
The original dataset is quite large. I only read 1% of rows for faster run.

In [10]:
p = 0.02  # to randomly select 1% of the rows

df_playlist = pd.read_csv('drive/MyDrive/Spotify_Recommendation_System/spotify_dataset.csv', on_bad_lines='skip', skiprows=lambda i: i > 0 and random.random() > p)
df_playlist.head()

Unnamed: 0,user_id,"""artistname""","""trackname""","""playlistname"""
0,9cc0cfd4d7d7885102480dd99e7a90d6,Elvis Costello,Alison,HARD ROCK 2010
1,9cc0cfd4d7d7885102480dd99e7a90d6,Cocktail Slippers,You Do Run,HARD ROCK 2010
2,9cc0cfd4d7d7885102480dd99e7a90d6,Biffy Clyro,God & Satan,IOW 2012
3,07f0fc3be95dcd878966b1f9572ff670,C418,Moog City,C418
4,07f0fc3be95dcd878966b1f9572ff670,C418,Équinoxe,C418


Clean up column names

In [11]:
df_playlist.columns = df_playlist.columns.str.replace('"', '')
df_playlist.columns = df_playlist.columns.str.replace('name', '')
df_playlist.columns = df_playlist.columns.str.replace(' ', '')
df_playlist.columns

Index(['user_id', 'artist', 'track', 'playlist'], dtype='object')

For recommender system, I'm trying to keep the artists with frequency higher than 50

In [12]:
df_playlist = df_playlist.groupby('artist').filter(lambda x : len(x)>=50)

And keeping the users with at least 10 unique artists in their playlists to lessen the impact of cold start problem

In [13]:
df_playlist = df_playlist[df_playlist.groupby('user_id').artist.transform('nunique') >= 10]

group by to get the frequnecy count for each user and artist (# of times that an artist has appeared in playlists created by a use

In [14]:
size = lambda x: len(x)
df_freq = df_playlist.groupby(['user_id', 'artist']).agg('size').reset_index().rename(columns={0:'freq'})[['user_id', 'artist', 'freq']].sort_values(['freq'], ascending=False)
df_freq.head()

Unnamed: 0,user_id,artist,freq
6925,26b51e580277e131f87e4c7ee4c0887a,Vitamin String Quartet,57
30738,b1d4116e7cf150ae7d77413620f5f571,Wolfgang Amadeus Mozart,52
11433,414050deadb38aafd8d4ad22ca634055,Vitamin String Quartet,50
43213,fa849dabeb14a2800ad5130907fc5018,Ella Fitzgerald,48
22156,7ee2b92c5bcf6133b8132363e5bda960,Jamey Aebersold Play-A-Long,42


Create a DF for artists and add artist id

In [15]:
df_artist = pd.DataFrame(df_freq["artist"].unique())
df_artist = df_artist.reset_index()
df_artist = df_artist.rename(columns={'index':'artist_id', 0:'artist'})
df_artist.head()

Unnamed: 0,artist_id,artist
0,0,Vitamin String Quartet
1,1,Wolfgang Amadeus Mozart
2,2,Ella Fitzgerald
3,3,Jamey Aebersold Play-A-Long
4,4,Peggy Lee


Then, we have to get the information of user

In [16]:
def GetInPut(user):
    inputArtist = pd.DataFrame(user)
    #Filtering out the movies by title
    Id = df_artist[df_artist['artist'].isin(inputArtist['artist'].tolist())]
    #Then merging it so we can get the movieId. It's implicitly merging it by title.
    inputArtist = pd.merge(Id, inputArtist)
    #Dropping information we won't use from the input dataframe
    #inputArtist = inputArtist.drop('year', 1)
    return inputArtist

In [17]:
user = [
            {'artist':'Ella Fitzgerald', 'freq':40},
            {'artist':'Frank Sinatra', 'freq':10},
            {'artist':'Lil Wayne', 'freq':3},
            {'artist':"The Rolling Stones", 'freq':5},
            {'artist':'Louis Armstrong', 'freq':5}
         ]

In [18]:
inputArtist = GetInPut(user)

Collaborative Filtering Song Recommendation¶
Similarity of users to input user Next, we are going to compare all users to our specified user and find the one that is most similar. we're going to find out how similar each user is to the input through the Pearson Correlation Coefficient. It is used to measure the strength of a linear association between two variables. The formula for finding this coefficient between sets X and Y with N values can be seen in the image below.

Why Pearson Correlation?

Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales .

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0" />

The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation.

In our case, a 1 means that the two users have similar tastes while a -1 means the opposite.

In [36]:
def ColFilter(inputArtist,df_freq):
    #Filtering out the movies by title
    Id = df_artist[df_artist['artist'].isin(inputArtist['artist'].tolist())]
    #Then merging it so we can get the movieId. It's implicitly merging it by title.
    inputArtist = pd.merge(Id, inputArtist)
    #Dropping information we won't use from the input dataframe
    #inputArtist = inputArtist.drop('year', 1)
    df_freq  = pd.merge(df_freq , df_artist, how='inner', on='artist')
    userSubsetGroup = df_freq.groupby(['user_id'])
    userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)
    userSubsetGroup = userSubsetGroup[0:100]
    #Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
    pearsonCorDict = {}
    #For every user group in our subset
    for name, group in userSubsetGroup:
        #Let's start by sorting the input and current user group so the values aren't mixed up later on
        group = group.sort_values(by='artist_id')
        inputArtist = inputArtist.sort_values(by='artist_id')
        #Get the N for the formula
        n = len(group)
        #Get the review scores for the movies that they both have in common
        temp = inputArtist[inputArtist['artist_id'].isin(group['artist_id'].tolist())]
        #And then store them in a temporary buffer variable in a list format to facilitate future calculations
        tempRatingList = temp['freq'].tolist()
        #put the current user group reviews in a list format
        tempGroupList = group['freq'].tolist()
        #Now let's calculate the pearson correlation between two users, so called, x and y
        Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(n)
        Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(n)
        Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(n)
        #If the denominator is different than zero, then divide, else, 0 correlation.
        if Sxx != 0 and Syy != 0:
            pearsonCorDict[name] = Sxy/sqrt(Sxx*Syy)
        else:
            pearsonCorDict[name] = 0
    pearsonDF = pd.DataFrame.from_dict(pearsonCorDict, orient='index')
    pearsonDF.columns = ['similarityIndex']
    pearsonDF['user_id'] = pearsonDF.index
    pearsonDF.index = range(len(pearsonDF))
    topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
    topUsersRating=topUsers.merge(df_freq, left_on='user_id', right_on='user_id', how='inner')
    topUsersRatingS=topUsersRating
    topUsersRating['weightedFreq'] = topUsersRating['similarityIndex']*topUsersRating['freq']
    #Applies a sum to the topUsers after grouping it up by userId
    tempTopUsersRating = topUsersRating.groupby('artist_id').sum()[['similarityIndex','weightedFreq']]
    tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedFreq']
    #Creates an empty dataframe
    recommendation_df = pd.DataFrame()
    #Now we take the weighted average
    recommendation_df['weighted average freq score'] = tempTopUsersRating['sum_weightedFreq']/tempTopUsersRating['sum_similarityIndex']
    recommendation_df['artist_id'] = tempTopUsersRating.index
    # Sort the recomendation by the weighted average freq score
    recommendation_df = recommendation_df.sort_values(by='weighted average freq score', ascending=False)
    recommendation_final = df_artist.loc[df_artist['artist_id'].isin(recommendation_df.head(10)['artist_id'].tolist())]
    return recommendation_final,topUsersRatings

Get the final recommendation of artists

In [37]:
df_playlist_2  = pd.merge(df_playlist , df_artist, how='inner', on='artist')
df_playlist_2.head()

Unnamed: 0,user_id,artist,track,playlist,artist_id


Prepare the music data for Content-based Filtering

In [38]:
data['song_id']=data.index
df = data[['danceability','energy',"valence","speechiness","instrumentalness","acousticness"]]
df.index = data['song_id']
# normalized  data by columns
df_normalized = pd.DataFrame(normalize(df, axis=1))
df_normalized.columns = df.columns
df_normalized.index = df.index
df_normalized.head()

Unnamed: 0_level_0,danceability,energy,valence,speechiness,instrumentalness,acousticness
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.204439,0.154611,0.043526,0.026819,0.643359,0.719566
1,0.526206,0.219092,0.618725,0.266637,0.0,0.470308
2,0.238274,0.12059,0.028622,0.024627,0.663245,0.698114
3,0.258165,0.290084,0.154899,0.033233,2.6e-05,0.907802
4,0.382654,0.17668,0.231606,0.034787,2e-06,0.876076


Content-based Filtering Song Recommendation

In [39]:
"""
Music Recommender based on different distance calculation approaches

df_normalized: normalized song data
distance_method: distance calculation approach: e.g. cosine, euclidean, hamming
song_id: find similar songs based on the selected song
N: Top N song(s)

return
1) song data of selected spng and Top N recommendation,
2) song id and song name of Top N recommendation
"""

def Content_filter_music_recommender(song_id, N):
    distance_method = cosine
    allSongs = pd.DataFrame(df_normalized.index)
    allSongs = allSongs[allSongs.song_id != song_id]
    allSongs["distance"] = allSongs["song_id"].apply(lambda x: distance_method(df_normalized.loc[song_id], df_normalized.loc[x]))
    # sort by distance then recipe id, the smaller value of recipe id will be picked.
    TopNRecommendation = allSongs.sort_values(["distance"]).head(N).sort_values(by=['distance', 'song_id'])
    #print(data['name'].loc[song_id, :])
    Recommendation = pd.merge(TopNRecommendation , data, how='inner', on='song_id')
    SongName = Recommendation['name']
    return SongName

In [40]:
SongName=Content_filter_music_recommender(3, 5)
SongName

0                     Pause Track - Live
1    StaggerLee Has His Day at the Beach
2                            Pause Track
3                           Silent Track
4                           Magic Window
Name: name, dtype: object