# Find similar songs

- Compute the Euclidean and Manhattan distances between all of the songs.
- Look at whether the distances between each pair of songs match your perceived differences.

You’ll notice how some of the features are on a much larger scale than others. Loudness ranges from -33 to 0, while energy only ranges between 0 and 1. This means that when computing a distance, loudness will matter about 33 times more than energy! To fix this, we need to scale the data and compute the distances again.

To scale our data we will transform it using Scikit-Learn.

Scikit-Learn is Python’s most popular module to get started with Machine Learning. Virtually any data transformation and any algorithm that a Data Science learner might want to implement is available on Scikit-Learn. Just like Pandas, it has been built on top of Numpy and Matplotlib, and that means it generally will have a good integration with our main exploratory tools, as well as a speedy performance.

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
cwd = os.getcwd()

In [15]:
csv_file = cwd + "/df_audio_features_10.csv"
df = pd.read_csv(csv_file, index_col='song_name', nrows=10)
# df.drop(['artist', 'id', 'html'], axis=1, inplace=True)

In [16]:
df = df[['energy', 'tempo', 'valence']]
df

Unnamed: 0_level_0,energy,tempo,valence
song_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
My Band,0.849,120.014,0.844
The Real Slim Shady,0.661,104.504,0.76
Águas De Março,0.339,143.418,0.491
The Girl From Ipanema,0.14,129.318,0.388
"Paint It, Black",0.795,158.691,0.612
Sultans Of Swing,0.794,148.174,0.931
Space Raiders - Charlotte de Witte Remix,0.731,131.997,0.0598
In Silence,0.845,128.009,0.198
"Wiegenlied, Op. 49, No. 4 (Arr. for Cello and Piano) [Brahms Lullaby]",0.00833,61.541,0.211
Nocturne en mi bémol majeur opus 9 n°2: Ballade en Sol Mineur No.1,0.0451,61.494,0.071


In [20]:
# scale our data

from sklearn.preprocessing import MinMaxScaler

# initialize the transformer (optionally, set parameters)
#     we are passing the parameter feature_range as (0,1), so that all of the features get scaled between 0 and 1.
transformer = MinMaxScaler(feature_range=(0,1))

# fit the transformer to the data. fit tells the scaler to store all min and max values
transformer.fit(df)

# use the transformer to transform the data
scaled_audio_features = transformer.transform(df)

# reconvert the transformed data back to a DataFrame
scaled_df = pd.DataFrame(scaled_audio_features,index=df.index,columns=df.columns)

scaled_df

Unnamed: 0_level_0,energy,tempo,valence
song_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
My Band,1.0,0.602076,0.900138
The Real Slim Shady,0.776369,0.442503,0.803719
Águas De Março,0.393341,0.842866,0.494949
The Girl From Ipanema,0.156625,0.697799,0.376722
"Paint It, Black",0.935766,1.0,0.633838
Sultans Of Swing,0.934576,0.891797,1.0
Space Raiders - Charlotte de Witte Remix,0.859636,0.725362,0.0
In Silence,0.995242,0.684332,0.158632
"Wiegenlied, Op. 49, No. 4 (Arr. for Cello and Piano) [Brahms Lullaby]",0.0,0.000484,0.173554
Nocturne en mi bémol majeur opus 9 n°2: Ballade en Sol Mineur No.1,0.043739,0.0,0.012856
