In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# 1. Genre-based Recommendation Engine
1. Determine features of a song, and which columns are considered obsolete for decision-making
2. Cluster based on the predetermined features to deterime `genre`
3. Analyze the data based on the new clusters and label them as genres. List all the unique genres.
4. Apply PCA in order to project the newfound clusters into two-dimensional space.
5. Take an input which is one of the unique genres in step 3. Return random 10 songs of that genre.

## Analyzing the sourced raw data
We import the data and see the columns in the .csv file, as well as its corresponding data types (string or numerics). Via the Spotify official developer website, we can understand and take not the overall representation for each column (i.e. what each column means and how they relate to each other)

In [2]:
raw_data = pd.read_csv("data/data.csv")
raw_data.head()

Unnamed: 0,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
0,0.995,['Carl Woitschach'],0.708,158648,0.195,0,6KbQ3uYMLKb5jDxLF7wYDD,0.563,10,0.151,-12.428,1,Singende Bataillone 1. Teil,0,1928,0.0506,118.469,0.779,1928
1,0.994,"['Robert Schumann', 'Vladimir Horowitz']",0.379,282133,0.0135,0,6KuQTIu1KoTTkLXKrwlLPV,0.901,8,0.0763,-28.454,1,"Fantasiestücke, Op. 111: Più tosto lento",0,1928,0.0462,83.972,0.0767,1928
2,0.604,['Seweryn Goszczyński'],0.749,104300,0.22,0,6L63VW0PibdM1HDSBoqnoM,0.0,5,0.119,-19.924,0,Chapter 1.18 - Zamek kaniowski,0,1928,0.929,107.177,0.88,1928
3,0.995,['Francisco Canaro'],0.781,180760,0.13,0,6M94FkXd15sOAOQYRnWPN8,0.887,1,0.111,-14.734,0,Bebamos Juntos - Instrumental (Remasterizado),0,1928-09-25,0.0926,108.003,0.72,1928
4,0.99,"['Frédéric Chopin', 'Vladimir Horowitz']",0.21,687733,0.204,0,6N6tiFZ9vLTSOIxkj8qKrd,0.908,11,0.098,-16.829,1,"Polonaise-Fantaisie in A-Flat Major, Op. 61",1,1928,0.0424,62.149,0.0693,1928


In [3]:
raw_data.info() # To see whether the numerical data are already represented as numerics and not strings

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169909 entries, 0 to 169908
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   acousticness      169909 non-null  float64
 1   artists           169909 non-null  object 
 2   danceability      169909 non-null  float64
 3   duration_ms       169909 non-null  int64  
 4   energy            169909 non-null  float64
 5   explicit          169909 non-null  int64  
 6   id                169909 non-null  object 
 7   instrumentalness  169909 non-null  float64
 8   key               169909 non-null  int64  
 9   liveness          169909 non-null  float64
 10  loudness          169909 non-null  float64
 11  mode              169909 non-null  int64  
 12  name              169909 non-null  object 
 13  popularity        169909 non-null  int64  
 14  release_date      169909 non-null  object 
 15  speechiness       169909 non-null  float64
 16  tempo             16

From [this page](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/), we can derive the meaning for each column above. I'll write them down in this cell so we won't need to look back and forth on that page.
1. `Acousticness` represents a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
2. `Danceability` represents how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. 1.0 is most danceable. 
3. `Energy` represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. 1.0 is most energetic
4. `Instrumentalness` represents the instrumental frequency in the track. 1.0 represents tracks with no vocals.
5. `Liveness` represents the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live
6. `Loudness` represents the loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Values typical range between -60 and 0 db. 
7. `Speechiness` represents the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 are tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks
8. `Valence` represents the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric).
9. `Tempo` represents the overall estimated tempo of a track in beats per minute (BPM).
10. `Mode` 0 = Minor, 1 = Major
11. `Explicit` 0 = No explicit content, 1 = Explicit content\
12. `Key` represents the estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
---
Despite the data has been cleaned, we can double-check whether a column has any missing `NaN` data. It should return the column name along with the number of missing data besides it.

In [4]:
for colname in raw_data.columns.tolist():
    print("Column {}".format(colname) + " has {} missing values.".format(raw_data[colname].isnull().sum()))

Column acousticness has 0 missing values.
Column artists has 0 missing values.
Column danceability has 0 missing values.
Column duration_ms has 0 missing values.
Column energy has 0 missing values.
Column explicit has 0 missing values.
Column id has 0 missing values.
Column instrumentalness has 0 missing values.
Column key has 0 missing values.
Column liveness has 0 missing values.
Column loudness has 0 missing values.
Column mode has 0 missing values.
Column name has 0 missing values.
Column popularity has 0 missing values.
Column release_date has 0 missing values.
Column speechiness has 0 missing values.
Column tempo has 0 missing values.
Column valence has 0 missing values.
Column year has 0 missing values.


In [5]:
df = raw_data[['id', 'name', 'artists', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'explicit', 'key', 'liveness', 'loudness', 'mode', 'popularity','speechiness', 'tempo', 'valence']]

## Begin to cluster the data
Before we instantiate the clustering on our feature set, we need to determine which value of k is optimal using the elbow method.

In [None]:
from sklearn.cluster import KMeans # Clustering library
%matplotlib inline

features = df[list(df.columns)[3:]]
distortions = []

k_values = range(10,30)
for k in k_values:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(features)
    distortions.append(kmeanModel.inertia_)
plt.figure(figsize=(16,8))
plt.plot(k_values, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()