# 02 - Preprocessing and Feature Selection

In this notebook, we define the preprocessing steps and will pick the audio features that will be used for clustering. We will avoid redoing the full analysis from the previous notebook and focus on:

* Dropping duplicates and rows with missing values
* Selecting certain audio features for clustering
* Demonstrating feature scaling with StandardScaler

In [1]:
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler

DATASET_PATH = Path("..") / "data" / "raw" / "spotify_tracks.csv"

df = pd.read_csv(DATASET_PATH)
df.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


## 1. Remove duplicates and handle missing values in certain audio features

In [6]:
len_before = len(df)
dup_count = df.duplicated().sum()
print(f"Entries in dataset: {len_before}\nDuplicate count: {int(dup_count)}")

Entries in dataset: 114000
Duplicate count: 0


In [None]:
# Drop any duplicate rows
df = df.drop_duplicates()
len_after = len(df)
print(f"Entries in dataset before: {len_before}\nEntries in dataset after: {len_after}")

Entries in dataset before: 114000
Entries in dataset after: 114000


### Select audio feature columns

We will select the Spotify audio features that will be used as input to the clustering algorithm. These values are numeric and describe the sound, not labels such as popularity or genre.

In [15]:
audio_feature_columns = [
    "danceability",
    "energy",
    "loudness",
    "speechiness",
    "acousticness",
    "instrumentalness",
    "liveness",
    "valence",
    "tempo",
    "duration_ms",
]

# We will only get features that are in the dataset columns
available_features = [c for c in audio_feature_columns if c in df.columns]
missing_features = [c for c in audio_feature_columns if c not in df.columns]

available_features, missing_features

(['danceability',
  'energy',
  'loudness',
  'speechiness',
  'acousticness',
  'instrumentalness',
  'liveness',
  'valence',
  'tempo',
  'duration_ms'],
 [])

In [17]:
# Drop any rows that are not part of the selected audio features
len_before_na = len(df)
df_clean = df.dropna(subset=available_features)
len_after_na = len(df_clean)
print(f"Length before cleaning: {len_before_na}\nLength after cleaning: {len_after_na}")

Length before cleaning: 114000
Length after cleaning: 114000


We can see that there is no change with the cleaning. This just means that all of our audio feature columns are also in the dataset.

## 2. Prepare feature matrix and apply scaling

In [None]:
X = df_clean[audio_feature_columns].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled[:3] # Output the first few rows of the scaled feature matrix

array([[ 0.62924424, -0.71714792,  0.30082834,  0.55184753, -0.85020151,
        -0.50410861,  0.75874327,  0.92930586, -1.14186279,  0.02457516],
       [-0.84590798, -1.88997974, -1.78474412, -0.07899331,  1.8317324 ,
        -0.50409391, -0.59121068, -0.79868969, -1.48971712, -0.73085898],
       [-0.74218634, -1.12266943, -0.2932884 , -0.27382571, -0.31549883,
        -0.50411187, -0.50716686, -1.36568823, -1.528312  , -0.16033174]])

In [21]:
# Check approximate means and standard deviations after scaling
import numpy as np
np.mean(X_scaled, axis=0), np.std(X_scaled, axis=0)

(array([ 4.06879209e-16, -2.11417628e-16, -1.27648379e-16,  7.97802370e-18,
        -9.57362844e-17, -1.59560474e-17,  1.13686838e-16,  1.59560474e-16,
        -4.98626481e-16,  2.19395652e-17]),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

This will scale them to have a mean very close to 0 and standard deviation close to 1. This is important since we are using K-Means with Euclidean distance so that features with larger ranges won't dominate the distance computation.

## Summary
In this notebook we:
* Dropped duplicate rows from the raw dataset.
* Pick a set of audio features for clustering: danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and duration.
* Removed rows with missing values in these features
* Scaled the selected features using StandardScalar

These steps match the preprocessing that will be implemnented in the project code and this will be a reference for how clustering input is constructed.