# **Introduction:**

**Research Question**

This paper endeavors to answer the following question: "How might we create a personalized music recommendation system for users based on their listening history, without being invasive or relying on personal data?"




**Background / Relevance for Study**

 In today's digital age, music streaming services such as Spotify are becoming increasingly popular. However, users often struggle to find new music that suits their tastes with such a vast selection of music. Personalized music recommendation systems have become a popular solution to this problem. Given users' listening history, these systems can suggest new music that they may enjoy. Unfortunately, many existing systems rely heavily on users' personal data (eg: age, location, etc.), which raises concerns about privacy. Our proposed model aims to create a personalized music recommendation system that relies on users' listening history without being invasive. Presently, Spotify recommends content based both on the actual content of songs that a user likes, and also the relationship that one track has with other tracks, determined by a broader set of users. 

Our proposed response to our main query is to to create a 
novel music recommendation algorithm that differs from that of Spotify. Spotify present incorporates multiple recommendation methods, chiefly:


1.   Content-Based Recommendation
2.   Collaborative-Based Recommendation
3.Popularity-based recommendation

It is our objective to create a new method which does not incorporate collaborative-based recommendation. The goal of this change is to enhance the privacy of users such that their listening history is not communicated with other users, directly or indirectly.

As an example illustrating this use case, if a user (Bob) has one friend on spotify (Rob), Bob might be aware that Rob is an avid fan of Norwegian death metal if he receives that as a recommendation. Rob may prefer to keep that private, and would feasibly choose to opt into our algorithm which eschews collaborate-based recommendation in favor of his privacy.


**Varaibles, Parameters, and Assumptions**

Our variables will include users' listening history, the genres and artists of the music they listen to, and their interactions with the music streaming service (such as liking, disliking, or skipping songs). We will assume that users' listening history reflects their music preferences to some extent. We will also assume that the music streaming service has access to a large enough database of music to make relevant recommendations.


**Limitations of Data**

We source our data from KaggleSet data (https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify). While KaggleSet provides a large quantity of data, it does not contain every single song on Spotify; thus, not every song on a user's playlist may be represented in the data, renderinig the recommendation algorithm less accurate given its reduced information. In particular, our data source only holds information on: **[Trap, Techno, Techhouse, Trance, Psytrance, Dark Trap, DnB (drums and bass), Hardstyle, Underground Rap, Trap Metal, Emo, Rap, RnB, Pop, Hiphop].** 

Each song has a set of accomanying labels with further data, such as danceability, energy, loudness, musical key, and level of instrumentality (to name a few).

Our project and technical analysis consists of 5 major components, enumerated below:

1. Data Collection and Cleaning: 



## Imports

In [5]:
import pandas as pd
import numpy as np
import json
import re 
import sys
import itertools

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt


import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth
import spotipy.util as util

import warnings
warnings.filterwarnings("ignore")

In [6]:
%matplotlib inline

In [7]:
#Makes using jupyter notebook on laptops much easier
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

# Data Processing
dataset link: https://www.kaggle.com/datasets/ektanegi/spotifydata-19212020

In [13]:
# data is at a song level
spotify_df = pd.read_csv('data.csv')
spotify_df.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.995,['Carl Woitschach'],0.708,158648,0.195,0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,10,0.151,-12.428,1,Singende Bataillone 1. Teil,0,1928,0.0506,118.469,0.779,1928
1,0.994,"['Robert Schumann', 'Vladimir Horowitz']",0.379,282133,0.0135,0,6KuQTIu1KoTTkLXKrwlLPV,0.901,8,0.0763,-28.454,1,"Fantasiestücke, Op. 111: Più tosto lento",0,1928,0.0462,83.972,0.0767,1928
2,0.604,['Seweryn Goszczyński'],0.749,104300,0.22,0,6L63VW0PibdM1HDSBoqnoM,0.0,5,0.119,-19.924,0,Chapter 1.18 - Zamek kaniowski,0,1928,0.929,107.177,0.88,1928
3,0.995,['Francisco Canaro'],0.781,180760,0.13,0,6M94FkXd15sOAOQYRnWPN8,0.887,1,0.111,-14.734,0,Bebamos Juntos - Instrumental (Remasterizado),0,1928-09-25,0.0926,108.003,0.72,1928
4,0.99,"['Frédéric Chopin', 'Vladimir Horowitz']",0.21,687733,0.204,0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,11,0.098,-16.829,1,"Polonaise-Fantaisie in A-Flat Major, Op. 61",1,1928,0.0424,62.149,0.0693,1928


In [14]:
# data is at an artist level
data_w_genre = pd.read_csv('data_w_genres.csv')
data_w_genre.head()

Unnamed: 0,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count,genres
0,"""Cats"" 1981 Original London Cast",0.575083,0.44275,247260.0,0.386336,0.022717,0.287708,-14.205417,0.180675,115.9835,0.334433,38.0,5,1,12,['show tunes']
1,"""Cats"" 1983 Broadway Cast",0.862538,0.441731,287280.0,0.406808,0.081158,0.315215,-10.69,0.176212,103.044154,0.268865,33.076923,5,1,26,[]
2,"""Fiddler On The Roof” Motion Picture Chorus",0.856571,0.348286,328920.0,0.286571,0.024593,0.325786,-15.230714,0.118514,77.375857,0.354857,34.285714,0,1,7,[]
3,"""Fiddler On The Roof” Motion Picture Orchestra",0.884926,0.425074,262890.962963,0.24577,0.073587,0.275481,-15.63937,0.1232,88.66763,0.37203,34.444444,0,1,27,[]
4,"""Joseph And The Amazing Technicolor Dreamcoat""...",0.605444,0.437333,232428.111111,0.429333,0.037534,0.216111,-11.447222,0.086,120.329667,0.458667,42.555556,11,1,9,[]


In [15]:
# checking for genres
data_w_genre.dtypes

artists              object
acousticness        float64
danceability        float64
duration_ms         float64
energy              float64
instrumentalness    float64
liveness            float64
loudness            float64
speechiness         float64
tempo               float64
valence             float64
popularity          float64
key                   int64
mode                  int64
count                 int64
genres               object
dtype: object

In [17]:
# genre value actually a string that looks like a list
data_w_genre['genres'].values[0]

"['show tunes']"

In [18]:
# regex statement to extract the genre and input into a list
data_w_genre['genres_upd'] = data_w_genre['genres'].apply(lambda x: [re.sub(' ','_',i) for i in re.findall(r"'([^']*)'", x)])
data_w_genre['genres_upd'].values[0][0]

'show_tunes'

In [19]:
# extract artists into a list
spotify_df['artists_upd_v1'] = spotify_df['artists'].apply(lambda x: re.findall(r"'([^']*)'", x))
spotify_df['artists'].values[0]

"['Carl Woitschach']"

In [20]:
spotify_df['artists_upd_v1'].values[0][0]

'Carl Woitschach'

In [22]:
# double check: didn't work for artists with an apostrophe in their title enclosed in full quotes
spotify_df[spotify_df['artists_upd_v1'].apply(lambda x: not x)].head(5)

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year,artists_upd_v1
127,0.995,"[""Sam Manning's and His Cole Jazz Orchestra""]",0.664,173333,0.283,0,42WDMm9hX0xCFkkKpt6NOY,0.874,8,0.109,-18.301,0,Bungo,0,1930-01-01,0.0807,99.506,0.688,1930,[]
180,0.984,"[""Scarlet D'Carpio""]",0.4,142443,0.19,0,4Gcc2YB0AAlzPLQhosdyAw,0.9,0,0.182,-12.062,1,Chililin Uth'aja,0,1930,0.0492,81.29,0.402,1930,[]
1244,0.506,"[""Original Broadway Cast Of 'Flahooley""]",0.519,35227,0.475,0,1Qt9zpHUfVqMNr25EU9IFL,0.071,7,0.103,-9.553,0,Prologue,0,1951-01-01,0.107,105.639,0.615,1951,[]
1478,0.809,"[""Cal Tjader's Modern Mambo Quintet""]",0.795,238200,0.386,0,5VeW5QJDW906P5knRgJWzt,0.874,1,0.106,-14.984,1,Dearly Beloved,2,1954-09-11,0.057,119.8,0.807,1954,[]
1944,0.804,"[""Screamin' Jay Hawkins""]",0.574,142893,0.401,0,6MC85zBk1dQqnywRDdzy7h,2e-05,2,0.546,-11.185,1,I Love Paris,14,1958,0.0533,89.848,0.587,1958,[]


In [23]:
# catch the special case above and combine 2
spotify_df['artists_upd_v2'] = spotify_df['artists'].apply(lambda x: re.findall('\"(.*?)\"',x))
spotify_df['artists_upd'] = np.where(spotify_df['artists_upd_v1'].apply(lambda x: not x), spotify_df['artists_upd_v2'], spotify_df['artists_upd_v1'] )