# Decision Exercise: Mods 19 & 20

## Clustering Songs from Spotify

**Background**

The popular music platform, Spotify, is known for heavily leveraging data to drive decision making and innovation. One innovative feature of Spotify is the tailor made playlists the platform provides to their users. The [Discover Weekly](https://www.spotify.com/us/discoverweekly/) and [Made For You](https://support.spotify.com/us/article/made-for-you-playlists/) playlists are two of such examples. These playlists are curated through a finely tuned algorithm based on implicit and explicit feedback provided by the user.

>**Side Discussion**
>
>As it turns out, data generated from listening habits don't just inform about a user's taste in music - the data can reveal quite a bit about a user's lifestyle and personal life. For example, let's suppose a Spotify user typically listens to upbeat music early in the morning between 5-7am, Disney singalong songs from 7-9am, and classical music during the day. You could reasonably assume that this user:
>
>    - Exercises (upbeat music 5-7am)
>    - Has young kids (Disney songs 7-9am), thus is likely between 18-40 years old
>    - Works/student full time (classical music)
>
>You could speculate further by investigating the deeper associations between users and listening behavior, but that is a topic for another time.

To generate these playlists, Spotify needs to find songs that are similar in profile to the user's "Liked" songs, but also dissimilar enough to ensure the playlists contains novel recommendations that the user has not encountered before. For example, if a user provides feedback that they enjoy John Lennon's solo "Imagine", it's reasonable to assume they are already aware of his work with _The Beatles_, so recommending those songs would not be helpful for discovering new music.

**Prompt**

What strategies are available for profiling music for the purpose of finding similar songs? How would clustering and predictive modeling be helpful for this use case?

**Data Source**


The data for this exercise comes from the Kaggle: The Spotify Hit Predictor Dataset. A description of each variable can be found here: [Kaggle: The Spotify Hit Predictor Dataset](https://www.kaggle.com/theoverman/the-spotify-hit-predictor-dataset)

Import our packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Define our data files into the variable datasets

In [2]:
path = './datasets/'
datasets = [
    'dataset-of-60s.csv',
    'dataset-of-70s.csv',
    'dataset-of-80s.csv',
    'dataset-of-90s.csv',
    'dataset-of-00s.csv',
    'dataset-of-10s.csv'
]

Read our datasets into a DataFrame with some adjustments such as changing the column name `target` to `hit_song`, and then we view the head of the DataFrame

In [3]:
df_dict = {}
for d_idx, d in enumerate(datasets):
    d = path+d
    df_dict[d_idx] = pd.read_csv(d)
    df_dict[d_idx]['decade'] = d[22:25]
    df_dict[d_idx] = df_dict[d_idx].rename({'target': "hit_song"}, axis=1)

df = pd.DataFrame()
for d_idx, d in enumerate(df_dict):
    df = pd.concat([df, df_dict[d_idx]])

df.head()

Unnamed: 0,track,artist,uri,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,chorus_hit,sections,hit_song,decade
0,Jealous Kind Of Fella,Garland Green,spotify:track:1dtKN6wwlolkM8XZy2y9C1,0.417,0.62,3,-7.727,1,0.0403,0.49,0.0,0.0779,0.845,185.655,173533,3,32.94975,9,1,60s
1,Initials B.B.,Serge Gainsbourg,spotify:track:5hjsmSnUefdUqzsDogisiX,0.498,0.505,3,-12.475,1,0.0337,0.018,0.107,0.176,0.797,101.801,213613,4,48.8251,10,0,60s
2,Melody Twist,Lord Melody,spotify:track:6uk8tI6pwxxdVTNlNOJeJh,0.657,0.649,5,-13.392,1,0.038,0.846,4e-06,0.119,0.908,115.94,223960,4,37.22663,12,0,60s
3,Mi Bomba Sonó,Celia Cruz,spotify:track:7aNjMJ05FvUXACPWZ7yJmv,0.59,0.545,7,-12.058,0,0.104,0.706,0.0246,0.061,0.967,105.592,157907,4,24.75484,8,0,60s
4,Uravu Solla,P. Susheela,spotify:track:1rQ0clvgkzWr001POOPJWx,0.515,0.765,11,-3.515,0,0.124,0.857,0.000872,0.213,0.906,114.617,245600,4,21.79874,14,0,60s


Option to save your DataFrame to the csv file `spotify-songs.csv`

In [4]:
df.to_csv('./datasets/spotify-songs.csv', index=False)