# Data Munging and Modeling

The following code shows the process for cleaning and processing data from a series of songs into a list that is sorted by similarity to a reference song. In this case, we have a sample of 194 songs, one of which is our reference song, "Clocks" by Coldplay.

## Imports

In [1]:
import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.simplefilter(action='ignore')

## Data

In [2]:
tracks = pd.read_csv('./data/tracks.csv')
tracks.head()

Unnamed: 0,name,acousticness,analysis_url,artist,danceability,duration_ms,energy,genre_1,genre_10,genre_2,...,mode,n_genres,popularity,speechiness,tempo,time_signature,track_href,type,uri,valence
0,Clocks,0.599,https://api.spotify.com/v1/audio-analysis/0BCP...,Coldplay,0.577,307880,0.749,permanent wave,,pop,...,0,2,75,0.0279,130.969,4,https://api.spotify.com/v1/tracks/0BCPKOYdS2jb...,audio_features,spotify:track:0BCPKOYdS2jbQ8iyB56Zns,0.261
1,Stranger Things (feat. OneRepublic),0.234,https://api.spotify.com/v1/audio-analysis/4sJq...,Kygo,0.604,221440,0.661,edm,,pop,...,1,3,69,0.0375,107.016,4,https://api.spotify.com/v1/tracks/4sJqSKPc5fZ5...,audio_features,spotify:track:4sJqSKPc5fZ5OZ8JiVI44N,0.506
2,Save Tonight (feat. Solamay),0.292,https://api.spotify.com/v1/audio-analysis/2PWK...,Robin Schulz,0.631,216933,0.74,dance pop,,deep euro house,...,1,6,55,0.112,121.962,4,https://api.spotify.com/v1/tracks/2PWKdiDjYkkO...,audio_features,spotify:track:2PWKdiDjYkkOdInPxGrP51,0.396
3,Burning Bridges,0.051,https://api.spotify.com/v1/audio-analysis/3qd6...,OneRepublic,0.579,257573,0.739,dance pop,,neo mellow,...,1,7,47,0.0258,97.976,4,https://api.spotify.com/v1/tracks/3qd6wKpFQf4J...,audio_features,spotify:track:3qd6wKpFQf4JljGU7y18CJ,0.235
4,My Shadow,0.0222,https://api.spotify.com/v1/audio-analysis/7f20...,Keane,0.49,289360,0.416,neo mellow,,piano rock,...,1,4,43,0.0301,98.135,4,https://api.spotify.com/v1/tracks/7f208BAaf1Hw...,audio_features,spotify:track:7f208BAaf1Hw1dCKBwZe8M,0.0449


In [3]:
tracks.shape

(194, 32)

In [4]:
# The reference song
tracks[(tracks['artist'] == 'Coldplay')]

Unnamed: 0,name,acousticness,analysis_url,artist,danceability,duration_ms,energy,genre_1,genre_10,genre_2,...,mode,n_genres,popularity,speechiness,tempo,time_signature,track_href,type,uri,valence
0,Clocks,0.599,https://api.spotify.com/v1/audio-analysis/0BCP...,Coldplay,0.577,307880,0.749,permanent wave,,pop,...,0,2,75,0.0279,130.969,4,https://api.spotify.com/v1/tracks/0BCPKOYdS2jb...,audio_features,spotify:track:0BCPKOYdS2jbQ8iyB56Zns,0.261


## Cleaning

In [5]:
# Drop duplicate songs
tracks.drop_duplicates(subset='name', inplace=True)

In [6]:
# Index by song titles
tracks.set_index('name', inplace=True)

In [7]:
# A list of features to use in a model
features_list = ['acousticness', 
                 'danceability', 
                 'energy', 
                 'instrumentalness', 
                 'liveness', 
                 'loudness', 
                 'mode', 
                 'popularity', 
                 'speechiness', 
                 'tempo', 
                 'time_signature', 
                 'valence']

The number of genres under a given song can range anywhere from 1 to over 10 different genres. For this reason there are multiple columns for 'genre_1', 'genre_2', etc. up to the whatever song has the most genres in the dataframe. This also means that there are lots of blank values in the higher columns for songs with less genres. The end goal in cleaning up all of this genre information is to find all the unique genres that exist in the entire dataframe, and make a consolidated column for each of those genres. This way each song will have a 1 under each genre that it belongs to and a 0 for any genre it does not belong to.

In [8]:
# This list will become a set of all the genres that exist in the dataframe
genre_set = []
# Loop through each genre column
for i in range(tracks['n_genres'].max()):
    # Add all the genres that exist in the column to the total list of genres
    genre_set.extend(tracks[f'genre_{i + 1}'].unique())
    # Create dummy columns from the original genre column
    tracks = pd.get_dummies(tracks, columns=[f'genre_{i + 1}'])

# Convert the total list to a set of unique genres
genre_set = set(genre_set)
# Remove the null value in this set
genre_set.remove(np.nan)
genre_set

{'acoustic pop',
 'alternative dance',
 'alternative rock',
 'ambient worship',
 'art rock',
 'australian dance',
 'australian electropop',
 'australian pop',
 'big room',
 'boy band',
 'britpop',
 'celtic rock',
 'chamber pop',
 'dance pop',
 'dance-punk',
 'deep euro house',
 'deep house',
 'downtempo',
 'dream pop',
 'edm',
 'electro house',
 'electropop',
 'europop',
 'folk-pop',
 'gauze pop',
 'indie anthem-folk',
 'indie folk',
 'indie pop',
 'indie rock',
 'indietronica',
 'irish pop',
 'irish rock',
 'latin alternative',
 'latin rock',
 'melancholia',
 'metropopolis',
 'mexican rock',
 'modern rock',
 'neo mellow',
 'neo-singer-songwriter',
 'neo-synthpop',
 'new rave',
 'new wave pop',
 'oxford indie',
 'permanent wave',
 'piano rock',
 'pop',
 'pop rap',
 'pop rock',
 'post-grunge',
 'post-teen pop',
 'progressive electro house',
 'rock',
 'rock en espanol',
 'singer-songwriter',
 'stomp and holler',
 'tropical house',
 'uk pop',
 'viral pop'}

In [9]:
# Loop through each unique genre
for genre in genre_set:
    # A list to store all column names that belong to a specific genre
    same_genre_cols = []
    # Loop through columns in the dataframe
    for col in tracks.columns:
        # Filter only the genre dummy columns
        if col[:5] == 'genre':
            # Filter only the columns which begin with genre_1 through genre_9
            if col[7] == '_':
                # Filter only the columns which are part of the unique genre
                if genre == col[8:]:
                    # Add those column names to same_genre_cols
                    same_genre_cols.append(col)

            # Filter only the columns which begin with genre_10 or higher
            else:
                # Filter only the columns which are part of the unique genre
                if genre == col[9:]:
                    # Add those column names to same_genre_cols
                    same_genre_cols.append(col)
    
    # Create a new column of all zeros for the unique genre
    tracks[genre] = 0
    # Loop through the columns belonging to that genre
    for col in same_genre_cols:
        # Add the each column to the final genre column
        tracks[genre] += tracks[col]
    
    # Add the name of the newly compiled genre column to the features list
    features_list.append(genre)

## Preprocessing

In [10]:
# A datframe with only the relevent features
model_data = tracks[features_list]
model_data.head()

Unnamed: 0_level_0,acousticness,danceability,energy,instrumentalness,liveness,loudness,mode,popularity,speechiness,tempo,...,indie rock,irish pop,boy band,electro house,pop rap,neo-singer-songwriter,australian electropop,celtic rock,australian dance,acoustic pop
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Clocks,0.599,0.577,0.749,0.0112,0.183,-7.215,0,75,0.0279,130.969,...,0,0,0,0,0,0,0,0,0,0
Stranger Things (feat. OneRepublic),0.234,0.604,0.661,0.0,0.0951,-5.914,1,69,0.0375,107.016,...,0,0,0,0,0,0,0,0,0,0
Save Tonight (feat. Solamay),0.292,0.631,0.74,0.0,0.244,-6.206,1,55,0.112,121.962,...,0,0,0,0,0,0,0,0,0,0
Burning Bridges,0.051,0.579,0.739,0.0,0.083,-5.266,1,47,0.0258,97.976,...,0,0,0,0,1,0,0,0,0,0
My Shadow,0.0222,0.49,0.416,9.5e-05,0.0938,-8.025,1,43,0.0301,98.135,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# Scale the popularity ratings between 0 and 1
model_data['popularity'] = model_data['popularity'] / 100

In [12]:
# Scale the time_signature, tempo, and loudness between 0 and 1
minmax = MinMaxScaler()
model_data[['time_signature', 'tempo', 'loudness']] = minmax.fit_transform(model_data[['time_signature', 'tempo', 'loudness']])
model_data.head()

Unnamed: 0_level_0,acousticness,danceability,energy,instrumentalness,liveness,loudness,mode,popularity,speechiness,tempo,...,indie rock,irish pop,boy band,electro house,pop rap,neo-singer-songwriter,australian electropop,celtic rock,australian dance,acoustic pop
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Clocks,0.599,0.577,0.749,0.0112,0.183,0.522962,0,0.75,0.0279,0.518852,...,0,0,0,0,0,0,0,0,0,0
Stranger Things (feat. OneRepublic),0.234,0.604,0.661,0.0,0.0951,0.675375,1,0.69,0.0375,0.329993,...,0,0,0,0,0,0,0,0,0,0
Save Tonight (feat. Solamay),0.292,0.631,0.74,0.0,0.244,0.641167,1,0.55,0.112,0.447836,...,0,0,0,0,0,0,0,0,0,0
Burning Bridges,0.051,0.579,0.739,0.0,0.083,0.751289,1,0.47,0.0258,0.258716,...,0,0,0,0,1,0,0,0,0,0
My Shadow,0.0222,0.49,0.416,9.5e-05,0.0938,0.428069,1,0.43,0.0301,0.25997,...,0,0,0,0,0,0,0,0,0,0


## Modeling

In [13]:
# Convert the dataframe to a sparse matrix
tracks_sparse = sparse.csr_matrix(model_data)
# Calculate the cosine similarity between each song
recommender = pairwise_distances(tracks_sparse, metric='cosine')
# Convert the results to a dataframe
recommender_df = pd.DataFrame(recommender, columns=model_data.index, index=model_data.index)
recommender_df.head()

name,Clocks,Stranger Things (feat. OneRepublic),Save Tonight (feat. Solamay),Burning Bridges,My Shadow,Fake It,Keep on Walking,Corazón Atómico,Here's To Us,Higher Than The Sun,...,Stop For A Minute,Hielo,Who Am I,Lend Me Some Light,Red Eye,The Driver,Looking Back,Infinito - Live,Bend & Break,Torn Apart (Bastille Vs. GRADES)
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Clocks,0.0,0.356805,0.458474,0.523509,0.668838,0.296777,0.584784,0.540068,0.462764,0.584541,...,0.572466,0.55184,0.340328,0.576059,0.45039,0.35456,0.543407,0.614317,0.559869,0.32378
Stranger Things (feat. OneRepublic),0.356805,0.0,0.173444,0.466076,0.567497,0.409591,0.533171,0.633259,0.408747,0.508981,...,0.613521,0.622392,0.451092,0.536534,0.387045,0.4384,0.610339,0.541588,0.478555,0.427203
Save Tonight (feat. Solamay),0.458474,0.173444,0.0,0.456216,0.638686,0.501072,0.521431,0.693,0.393459,0.586684,...,0.682847,0.687031,0.540492,0.608877,0.48725,0.532084,0.665936,0.606291,0.567017,0.515232
Burning Bridges,0.523509,0.466076,0.456216,0.0,0.293379,0.545375,0.470167,0.737672,0.430403,0.29668,...,0.355334,0.732858,0.479292,0.639796,0.402216,0.560047,0.381381,0.643526,0.276917,0.552874
My Shadow,0.668838,0.567497,0.638686,0.293379,0.0,0.685834,0.603685,0.74839,0.762543,0.046323,...,0.125809,0.745485,0.580525,0.711265,0.455696,0.709223,0.16168,0.609056,0.045919,0.695295


Now that we have all the cosine similarities, we can rank all the songs by their similarity to "Clocks" by Coldplay. 

In [14]:
# Create a list ordering the songs by how similar they are to the reference song
ranked_choices = recommender_df['Clocks'].sort_values()[1:]

In [15]:
# Final result
ranked_choices

name
Leaving It Up to You                                             0.242107
Hold My Girl                                                     0.279054
Sugarcoat                                                        0.287675
Four Walls (The Ballad Of Perry Smith)                           0.287968
Fake It                                                          0.296777
Go Gentle                                                        0.300129
Wonder of the World                                              0.302137
Pretty Shining People                                            0.302926
Stupid Love Song - 2006 Demo                                     0.304166
Things We Lost In The Fire                                       0.305939
Way Back When                                                    0.306193
These Streets                                                    0.310803
An Act Of Kindness                                               0.310846
Secret Tattoo                    