# Song Recommender 

### Part 1 - Define 
#### The Problem 

#### The Goal
[REDO]
As an avid Spotify and music fan, I have always wondered how Spotify is always so good at recommending new music. Here, I try to create a "poor man's version" of a Spotify song recommender, and see if I can get any solid recommendations with it.
#### The Data
##### Personal Lifetime Spotify Listening History 
Through Spotify, you are able to request your lifetime listening history that keeps track of track_name, artist_name, album_name, date, time, and min_played. 
##### Spotify Dataset 1921-2020
This dataset contains audio features of over 600,000 songs from 1921 to 2020. 
Audio features include: popularity, danceability, energy, valence and more. 
* https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks?select=tracks.csv

### Part 2 - Exploratory Data Analysis and Cleaning

In [25]:
# Load libraries
import pandas as pd

# Helper
from ast import literal_eval

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load data
my_spotify_data = pd.read_csv("data/my_spotify_data.csv", encoding='latin-1', low_memory=False)
tracks = pd.read_csv("data/tracks.csv")

#### 2.1 - my_spotify_data

In [3]:
my_spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 328952 entries, 0 to 328951
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   track_name         323663 non-null  object
 1   artist_name        323658 non-null  object
 2   album_name         323657 non-null  object
 3   spotify_track_uri  323653 non-null  object
 4   skipped            49746 non-null   object
 5   date               328942 non-null  object
 6   time               328944 non-null  object
 7   min_played         328940 non-null  object
dtypes: object(8)
memory usage: 20.1+ MB


* drop spotify_track_uri, album_name, skipped, date, time, and min_played because I only need to know the songs that I have listened to.

In [4]:
my_spotify_data.drop(columns = ['album_name','spotify_track_uri','skipped','date','time','min_played'], inplace=True)

In [5]:
my_spotify_data.describe()

Unnamed: 0,track_name,artist_name
count,323663,323658
unique,29653,10475
top,Deja Vu,Drake
freq,573,7065


In [6]:
my_spotify_data.isnull().sum()

track_name     5289
artist_name    5294
dtype: int64

Number of null values is small enough to drop. 

In [7]:
my_spotify_data = my_spotify_data.dropna()
my_spotify_data.head()

Unnamed: 0,track_name,artist_name
0,Heaven - Originally Performed By DJ Sammy & Yanou,It's A Cover Up
2,I Can Do Anything,3OH!3
3,Sail,AWOLNATION
4,Remember the Name (feat. Styles of Beyond),Fort Minor
7,Mon Ange,Darius Denon


#### 2.2 - tracks

In [8]:
tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 586672 entries, 0 to 586671
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                586672 non-null  object 
 1   name              586601 non-null  object 
 2   popularity        586672 non-null  int64  
 3   duration_ms       586672 non-null  int64  
 4   explicit          586672 non-null  int64  
 5   artists           586672 non-null  object 
 6   id_artists        586672 non-null  object 
 7   release_date      586672 non-null  object 
 8   danceability      586672 non-null  float64
 9   energy            586672 non-null  float64
 10  key               586672 non-null  int64  
 11  loudness          586672 non-null  float64
 12  mode              586672 non-null  int64  
 13  speechiness       586672 non-null  float64
 14  acousticness      586672 non-null  float64
 15  instrumentalness  586672 non-null  float64
 16  liveness          58

In [9]:
tracks.describe()

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0,586672.0
mean,27.570053,230051.2,0.044086,0.563594,0.542036,5.221603,-10.206067,0.658797,0.104864,0.449863,0.113451,0.213935,0.552292,118.464857,3.873382
std,18.370642,126526.1,0.205286,0.166103,0.251923,3.519423,5.089328,0.474114,0.179893,0.348837,0.266868,0.184326,0.257671,29.764108,0.473162
min,0.0,3344.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13.0,175093.0,0.0,0.453,0.343,2.0,-12.891,0.0,0.034,0.0969,0.0,0.0983,0.346,95.6,4.0
50%,27.0,214893.0,0.0,0.577,0.549,5.0,-9.243,1.0,0.0443,0.422,2.4e-05,0.139,0.564,117.384,4.0
75%,41.0,263867.0,0.0,0.686,0.748,8.0,-6.482,1.0,0.0763,0.785,0.00955,0.278,0.769,136.321,4.0
max,100.0,5621218.0,1.0,0.991,1.0,11.0,5.376,1.0,0.971,0.996,1.0,1.0,1.0,246.381,5.0


In [10]:
tracks.describe(include='O')

Unnamed: 0,id,name,artists,id_artists,release_date
count,586672,586601,586672,586672,586672
unique,586672,446474,114030,115062,19700
top,35iwgR4jXetI318WEWsa1Q,Summertime,['Die drei ???'],['3meJIgRw7YleJrmbpbJK6S'],1998-01-01
freq,1,101,3856,3856,2893


* drop id and id_artists since id is unique for every row and id_artists is redundant with artists

In [11]:
tracks.drop(columns=['id','id_artists'],inplace=True)

In [12]:
tracks.isnull().sum()

name                71
popularity           0
duration_ms          0
explicit             0
artists              0
release_date         0
danceability         0
energy               0
key                  0
loudness             0
mode                 0
speechiness          0
acousticness         0
instrumentalness     0
liveness             0
valence              0
tempo                0
time_signature       0
dtype: int64

We can drop the missing values in name since there are so little.

In [13]:
tracks = tracks.dropna()
tracks.head()

Unnamed: 0,name,popularity,duration_ms,explicit,artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,Carve,6,126903,0,['Uli'],1922-02-22,0.645,0.445,0,-13.338,1,0.451,0.674,0.744,0.151,0.127,104.851,3
1,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],1922-06-01,0.695,0.263,0,-22.136,1,0.957,0.797,0.0,0.148,0.655,102.009,1
2,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],1922-03-21,0.434,0.177,1,-21.18,1,0.0512,0.994,0.0218,0.212,0.457,130.418,5
3,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918,0.104,0.397,169.98,3
4,Lady of the Evening,0,163080,0,['Dick Haymes'],1922,0.402,0.158,3,-16.9,0,0.039,0.989,0.13,0.311,0.196,103.22,4


#### 2.3 - Combining my_spotify_data and tracks
We will merge these two datasets with track_name and artist_name on name and artist, respectively.

In [14]:
# Convert strings of list into actual lists and extract main artist from the list of artists
tracks['artists'] = tracks['artists'].apply(lambda x: literal_eval(x))
tracks['artist'] = tracks['artists'].apply(lambda x: x[0])
tracks.drop(columns='artists',inplace=True)

In [15]:
data = pd.merge(my_spotify_data, tracks, left_on = ['track_name','artist_name'], right_on = ['name','artist']) 

In [16]:
data.head(5) 

Unnamed: 0,track_name,artist_name,name,popularity,duration_ms,explicit,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artist
0,Sail,AWOLNATION,Sail,74,259093,0,2011-03-15,0.826,0.436,1,-9.583,1,0.0558,0.441,0.615,0.0964,0.272,119.051,4,AWOLNATION
1,Sail,AWOLNATION,Sail,74,259093,0,2011-03-15,0.826,0.436,1,-9.583,1,0.0558,0.441,0.615,0.0964,0.272,119.051,4,AWOLNATION
2,Sail,AWOLNATION,Sail,74,259093,0,2011-03-15,0.826,0.436,1,-9.583,1,0.0558,0.441,0.615,0.0964,0.272,119.051,4,AWOLNATION
3,Sail,AWOLNATION,Sail,74,259093,0,2011-03-15,0.826,0.436,1,-9.583,1,0.0558,0.441,0.615,0.0964,0.272,119.051,4,AWOLNATION
4,Sail,AWOLNATION,Sail,74,259093,0,2011-03-15,0.826,0.436,1,-9.583,1,0.0558,0.441,0.615,0.0964,0.272,119.051,4,AWOLNATION


In [17]:
data.duplicated().sum() 

294183

There are many duplicated values becuase my_spotify_data keeps track of how many times we have listened to a song. We want to keep the duplicate values to differentiate a normal song from a favorite song as our target variable.

#### 2.4 - Create Target Variable 

We need to create a target variable to predict. In our case, we would like to categorize a song as 'favorite' or 'not a favorite'. We will use the top 20% of number of times listened to a song as 'favorite' and the bottom 80% as 'not a favorite'. 

In [45]:
data.groupby(by=['track_name','artist_name'])['name'].count().sort_values(ascending=False).quantile(.8) # value of .8 quantile 

37.0

In [54]:
track_freq = data.groupby(by=['track_name','artist_name'])['name'].count().reset_index()
track_freq.rename(columns = {'name': 'track_freq'},inplace=True)

In [56]:
# function that creates 'favorite_song' column with 1 = True, 0 = False
def favorite_song(row):
    if x < 37.0:
        return 0
    else:
        return 1 

In [None]:
track_freq['favorite_song'] = track_freq['track_freq'].apply(lambda x: favorite_song(x