In this first notebook, I am using Spotify data to see which songs are similar to a given song. Similar songs are simply those songs which are close to the vector embedding of the given song. I construct the vector embedding using (scaled version of) attributes like "acousticness", ..., "valence".

In [1]:
import numpy as np
import pandas as pd
import glob
import tqdm
from collections import OrderedDict
import langid # I want english songs

In [3]:
csv = pd.read_csv(*glob.glob("public/data/tracks.csv"))

In [4]:
def scale_0_1(df, col):
    df.loc[:, col] = (df[col]-df[col].mean())/df[col].std()

In [5]:
attrs = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence"]
for attr in attrs:
    scale_0_1(csv, attr) # Rescale these attributes to be N(0, 1)
csv.loc[:, "name"] = csv["name"].str.lower()
csv = csv.drop_duplicates(subset=attrs)

In [6]:
# langid.classify is a fast way to check the language of the song name
def check_en(name):
    try:
        return langid.classify(name)[0] in ["en"]
    except:
        return False

tqdm.tqdm.pandas()
csv = csv[csv["name"].progress_apply(lambda x: check_en(x))]

100%|██████████| 566113/566113 [07:06<00:00, 1325.91it/s]


In [7]:
song_name = "i'll do anything"
rows = csv[["name", "id", "artists"]+attrs][csv["name"].str.contains(song_name)]
vec = rows[attrs].iloc[2].to_numpy()
rows

Unnamed: 0,name,id,artists,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
56455,i'll do anything for you,3CuKcfH8BJ5The5G3llaKD,['Denroy Morgan'],-0.730608,1.013868,1.091461,-0.425109,-0.713601,0.714253,1.006913,-0.110027,0.976856
458469,i'll do anything,2mmqUU64P1BhJNvH1jtofT,['D-Train'],-1.164622,1.519579,0.075277,-0.425099,-0.817765,0.149738,-0.155446,0.047277,0.775049
555315,i'll do anything,0hK8yn7I0oqrqXljVVGPla,['Jason Mraz'],-0.948475,0.038568,0.670697,-0.42512,0.749028,1.188775,-0.193246,1.500604,0.375314


In [17]:
csv2 = csv[["acousticness","danceability","energy","instrumentalness","valence","liveness","speechiness"]]

In [18]:
from sklearn.manifold import TSNE

In [19]:
csv2.to_numpy().shape

(215035, 7)

In [20]:
X_embedded = TSNE(n_components=3, verbose=1, n_jobs=8).fit_transform(csv2[:10000].to_numpy())

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 10000 samples in 0.013s...
[t-SNE] Computed neighbors for 10000 samples in 0.375s...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for sample 10000 / 10000
[t-SNE] Mean sigma: 0.303204
[t-SNE] KL divergence after 250 iterations with early exaggeration: 83.139069
[t-SNE] KL divergence after 1000 iterations: 1.429928


In [21]:
X_embedded.shape

(10000, 3)

In [22]:
csv3 = csv[:10000]

In [26]:
csv3['x'] = X_embedded[:, 0].tolist()
csv3['y'] = X_embedded[:, 1].tolist()
csv3['z'] = X_embedded[:, 2].tolist()