## Spotify Music Recommender 
After doing some research on clustering and recommendation algorithms used by Netflix, Spotify and other media content driven companies I was interested in making a recommendation algorithm myself.  I love music, mainly southern rock, rock, bluegrass, and jazz/blues and figured that using Spotify data to do something like this was a great idea.  I am also a avid Spotify user and have had a account for 6+ years! 

So the general overiew here is to import some data from spotify or kaggle...wherever I can find a good bit of data, do some pre-processing (i.e. normalize the data were need be, investigate correlation between features etc.), then cluster the data to find similar groups.  After this, the next part is to contruct the model and train it. I am going to use kNN and DBSCAN models to do the clustering. I'll mess around with parameters and compare their results using appropriate preformance parameters. I might do some research into other ML recommendation models and try to implement something of my own and see how it compares as well.  Finally, I plan to use this predictive algorithm to produce a "small batch playlist" of songs when given a user song input. The small batch idea came from a friend, Tabb Carneal, where he produces daily small batch playlist for users to listen too.  Check him out @dittylint on Instagram. Taking this further, would be to make unique playlist based on a user recommended song input and return a unique playlist of songs that the user has lightly listened to.  This would have to take into account what songs the user has listned to, liked songs etc. and formulating a playlist of songs that the user has not yet listened too much. 

In [35]:
import pandas as pd
import numpy as np

music = pd.read_csv("spotify.csv")
music.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


In [36]:
#checking for missing values 
music.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174389 entries, 0 to 174388
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   acousticness      174389 non-null  float64
 1   artists           174389 non-null  object 
 2   danceability      174389 non-null  float64
 3   duration_ms       174389 non-null  int64  
 4   energy            174389 non-null  float64
 5   explicit          174389 non-null  int64  
 6   id                174389 non-null  object 
 7   instrumentalness  174389 non-null  float64
 8   key               174389 non-null  int64  
 9   liveness          174389 non-null  float64
 10  loudness          174389 non-null  float64
 11  mode              174389 non-null  int64  
 12  name              174389 non-null  object 
 13  popularity        174389 non-null  int64  
 14  release_date      174389 non-null  object 
 15  speechiness       174389 non-null  float64
 16  tempo             17

In [37]:
# removing all columns of type object since they are associated with artist and albumn etc. 
features = music.select_dtypes(exclude='object')0
features.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence,year
0,0.991,0.598,168333,0.224,0,0.000522,5,0.379,-12.628,0,12,0.0936,149.976,0.634,1920
1,0.643,0.852,150200,0.517,0,0.0264,5,0.0809,-7.261,0,7,0.0534,86.889,0.95,1920
2,0.993,0.647,163827,0.186,0,1.8e-05,0,0.519,-12.098,1,4,0.174,97.6,0.689,1920
3,0.000173,0.73,422087,0.798,0,0.801,2,0.128,-7.311,1,17,0.0425,127.997,0.0422,1920
4,0.295,0.704,165224,0.707,1,0.000246,10,0.402,-6.036,0,2,0.0768,122.076,0.299,1920


In [49]:
# since using K-mean for clustering need to standardize/normalize all data 
# k-means relys on distance and thus having values that are much larger than 
# others would effect the cluster of the data

col_names = list(features)
x = features.loc[:,:].values

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

x = StandardScaler().fit_transform(x)
#x = MinMaxScaler().fit_transform(x)
frame = pd.DataFrame(x, columns = col_names)
frame.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo,valence,year
0,1.294358,0.347919,-0.434495,-0.948791,-0.270401,-0.588004,-0.058354,0.930106,-0.154111,-1.536239,-0.62605,-0.066549,1.089753,0.413903,-2.120635
1,0.378411,1.790898,-0.556689,0.12571,-0.270401,-0.510657,-0.058354,-0.721489,0.788862,-1.536239,-0.854645,-0.287113,-0.995485,1.608718,-2.120635
2,1.299622,0.626289,-0.46486,-1.088146,-0.270401,-0.589511,-1.479502,1.705763,-0.060991,0.65094,-0.991803,0.37458,-0.64145,0.621861,-2.120635
3,-1.313529,1.097814,1.275491,1.156204,-0.270401,1.804534,-0.911043,-0.460536,0.780077,0.65094,-0.397454,-0.346918,0.363273,-1.823729,-2.120635
4,-0.537536,0.950107,-0.455446,0.822485,3.698207,-0.588829,1.362794,1.057535,1.004092,-1.536239,-1.083241,-0.158725,0.167564,-0.852753,-2.120635


In [53]:
from sklearn.decomposition import PCA
pca = PCA(n_components=len(x[0]))
principalComponents = pca.fit_transform(x)
#principalDf = pd.DataFrame(data = principalComponents)
   #          , columns = ['principal component 1', 'principal component 2'])
