## Spotify Music Recommender 
After doing some research on clustering and recommendation algorithms used by Netflix, Spotify and other media content driven companies I was interested in making a recommendation algorithm myself.  I love music, mainly southern rock, rock, bluegrass, and jazz/blues and figured that using Spotify data to do something like this was a great idea.  I am also a avid Spotify user and have had a account for 6+ years! 

So the general overiew here is to import some data from spotify or kaggle...wherever I can find a good bit of data, do some pre-processing (i.e. normalize the data were need be, investigate correlation between features etc.), then cluster the data to find similar groups.  After this, the next part is to contruct the model and train it. I am going to use kNN and DBSCAN models to do the clustering. I'll mess around with parameters and compare their results using appropriate preformance parameters. I might do some research into other ML recommendation models and try to implement something of my own and see how it compares as well.  Finally, I plan to use this predictive algorithm to produce a "small batch playlist" of songs when given a user song input. The small batch idea came from a friend, Tabb Carneal, where he produces daily small batch playlist for users to listen too.  Check him out @dittylint on Instagram. Taking this further, would be to make unique playlist based on a user recommended song input and return a unique playlist of songs that the user has lightly listened to.  This would have to take into account what songs the user has listned to, liked songs etc. and formulating a playlist of songs that the user has not yet listened too much. 

### Data Processing and EDA 

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf 

music = pd.read_csv("spotify.csv")
music.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,0.000522,5,0.379,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
1,0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,0.0264,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.95,1920
2,0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.8e-05,0,0.519,-12.098,1,Golfing Papa,4,1920,0.174,97.6,0.689,1920
3,0.000173,['Oscar Velazquez'],0.73,422087,0.798,0,19Lc5SfJJ5O1oaxY0fpwfh,0.801,2,0.128,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,17,1920-01-01,0.0425,127.997,0.0422,1920
4,0.295,['Mixe'],0.704,165224,0.707,1,2hJjbsLCytGsnAHfdsLejp,0.000246,10,0.402,-6.036,0,Xuniverxe,2,1920-10-01,0.0768,122.076,0.299,1920


In [2]:
#checking for missing values 
music.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174389 entries, 0 to 174388
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   acousticness      174389 non-null  float64
 1   artists           174389 non-null  object 
 2   danceability      174389 non-null  float64
 3   duration_ms       174389 non-null  int64  
 4   energy            174389 non-null  float64
 5   explicit          174389 non-null  int64  
 6   id                174389 non-null  object 
 7   instrumentalness  174389 non-null  float64
 8   key               174389 non-null  int64  
 9   liveness          174389 non-null  float64
 10  loudness          174389 non-null  float64
 11  mode              174389 non-null  int64  
 12  name              174389 non-null  object 
 13  popularity        174389 non-null  int64  
 14  release_date      174389 non-null  object 
 15  speechiness       174389 non-null  float64
 16  tempo             17

In [3]:
# removing all columns of type object since they are associated with artist and albumn etc. 
features = music.select_dtypes(exclude='object')

#removed the popularity column because it is not apart of a songs "audio features"
#removed explicit since it is binary and does not contribute to the "audio features" of a song
features = features.drop(columns = ['popularity', 'explicit', 'year'], axis = 1)
features.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
0,0.991,0.598,168333,0.224,0.000522,5,0.379,-12.628,0,0.0936,149.976,0.634
1,0.643,0.852,150200,0.517,0.0264,5,0.0809,-7.261,0,0.0534,86.889,0.95
2,0.993,0.647,163827,0.186,1.8e-05,0,0.519,-12.098,1,0.174,97.6,0.689
3,0.000173,0.73,422087,0.798,0.801,2,0.128,-7.311,1,0.0425,127.997,0.0422
4,0.295,0.704,165224,0.707,0.000246,10,0.402,-6.036,0,0.0768,122.076,0.299


This information was taken from Spotify's Web API Reference page (https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features). 

Feature Descriptions: 
    
    Acusticness - A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    
    Danceability - Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    
    Energy - Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
    
    Instrumentalness - Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    
    Key - The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
    
    Liveness - Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
    
    Loudness - The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
    
    Mode - Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
    
    Speechiness - Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    
    Tempo -  The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

    Valence - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

 ### Principal Component Analysis 
 Going to try to implement this on my own using tensorflow to do some of the math and then with sklearn as well. Will plan to train and test on these models as well to compare preformance. 

#### Normilization

In [4]:
col_names = list(features)
x = features.loc[:,:].values

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

x = MinMaxScaler().fit_transform(x)
#x = StandardScaler().fit_transform(x)
frame = pd.DataFrame(x, columns = col_names)
frame.head()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
0,0.99498,0.605263,0.030637,0.224,0.000522,0.454545,0.379,0.741868,0.0,0.096395,0.6159,0.634
1,0.645582,0.862348,0.027237,0.517,0.0264,0.454545,0.0809,0.825918,0.0,0.054995,0.356823,0.95
2,0.996988,0.654858,0.029792,0.186,1.8e-05,0.0,0.519,0.750168,1.0,0.179197,0.40081,0.689
3,0.000174,0.738866,0.078215,0.798,0.801,0.181818,0.128,0.825135,1.0,0.043769,0.52564,0.0422
4,0.296185,0.712551,0.030054,0.707,0.000246,0.909091,0.402,0.845102,0.0,0.079094,0.501324,0.299


#### Finding Covariance Matrix

In [9]:
def cov(x):
    mean_x = x.mean(axis = 0)
    covarMatrix = np.ones((12,12))
    for i in range(len(mean_x)): 
        for j in range(len(mean_x)): 
            num = np.dot((x[:,i] - mean_x[i]), (x[:,j] - mean_x[j]))
            den = len(x) - 1
            covar = num/(len(x) - 1) 
            covarMatrix[i,j] = covar
    #for every column:
        #for every other column:
            #subtract the means 
            #multiply them together 
            #add all up 
            #divide by n -1 
    return covarMatrix

covarMatrix = tf.convert_to_tensor(cov(x))

#### Computing Eigenvalues and Eigenvectors 

#### Sklearn Principal Component Analysis

In [8]:
from sklearn.decomposition import PCA
pca = PCA(n_components=len(x[0]))
principalComponents = pca.fit_transform(x)
#principalDf = pd.DataFrame(data = principalComponents)
   #          , columns = ['principal component 1', 'principal component 2'])
