# PCA

Principal Component Analysis is a technique for dimensionality reduction when a data set is very high-dimensional (i.e., contains many features). The goal is to simplify the data set while retaining as much information or variability in the data as possible.

Clustering is an unsupervised machine learning technique that creates groups in the data. For this clustering analysis, we attempt to cluster based on time signature.

Clustering is susceptible to a phenomenon known as the curse of dimensionality, in which data set is so high-dimensional and complex that clustering is difficult to perform and largely inaccurate, as the more complex a data set is, the less meaningful distance metrics become. Thus, reducing the dimension of the data set may aid in clustering effectiveness. Our approach involves performing k-means clustering on the data, then performing principal component analysis to create a transformed (simpler) data set, performing k-means clustering again on the new PCA-transformed data, and finally comparing evaluation metrics for the two clustering schemes.

In [1]:
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score, silhouette_samples, rand_score, adjusted_rand_score
from preprocessing import preprocessing_data
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

df_raw = pd.read_csv("hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")
df = preprocessing_data(df_raw)
genres = df['track_genre'].unique()

X_train, X_test, y_train, y_test = train_test_split(df.drop(['track_genre'], axis=1),
                                                    df.track_genre, test_size=0.3)

In [2]:
df_clustering = df.drop(columns='track_genre')
# Create a standardized version of the data for modeling purposes after EDA
num_cols = df_clustering.columns.values.tolist()
num_cols.remove('time_signature')
df_clustering[num_cols] = df_clustering.drop(columns='time_signature')

In [3]:
set(df_clustering['time_signature'])

{-6.727453200452815,
 -4.982634088861116,
 -1.4929958656777174,
 0.25182324591398175,
 1.9966423575056809}

In [4]:
kmeans = KMeans(n_clusters=5)
y_kmeans = kmeans.fit_predict(df_clustering.drop(columns='time_signature'))

  super()._check_params_vs_input(X, default_n_init=10)


In [5]:
for i in np.arange(0, len(kmeans.labels_)):
  if kmeans.labels_[i] > 1:
    kmeans.labels_[i] += 1
set(kmeans.labels_)

{0, 1, 3, 4, 5}

In [6]:
print(silhouette_score(df_clustering.drop(columns='time_signature'), kmeans.labels_))

0.12571039801491907


In [7]:
pca_U, pca_d, pca_V = np.linalg.svd(df_clustering.drop(columns='time_signature'))

In [8]:
prop_var = np.square(pca_d) / sum(np.square(pca_d))
scree_data = pd.DataFrame(
{"PC": 1 + np.arange(0, prop_var.shape[0]),
"variability_explained": prop_var.round(4),
"cumulative_variability_explained": prop_var.cumsum().round(4)
})
scree_data.head(20)

Unnamed: 0,PC,variability_explained,cumulative_variability_explained
0,1,0.2228,0.2228
1,2,0.14,0.3627
2,3,0.104,0.4667
3,4,0.085,0.5516
4,5,0.0749,0.6266
5,6,0.0634,0.6899
6,7,0.0603,0.7502
7,8,0.0584,0.8086
8,9,0.0546,0.8632
9,10,0.0466,0.9098


In [9]:
px.line(x=np.arange(14),
y=scree_data.iloc[range(14), :].loc[:, 'variability_explained'],
labels={"x": "PC",
"y": "Proportion explained"},
width=600, height=400)

We attempted to perform Principal Component Analysis on the data to prepare for clustering based on time signature. The scree plot indicates that the first two principal components capture about 34.16% of the variability in the data, and after that, each principal component makes a small, consistent contribution. Unfortunately, keeping only the first two principal components would simply result in a data set that does not capture nearly enough information from the original data to be usable. Moreover, if we want to retain most of the information in the original data, let's say 90%, then we would need to keep the first ten principal components, which is not a very successful dimensionality reduction down from fourteen original features.

In [10]:
X_train_pca = np.dot(df_clustering.drop(columns='time_signature'), pca_V[np.arange(0, 9)].T)
X_train_pca = pd.DataFrame(X_train_pca, columns=['PC' + str(x) for x in np.arange(1, 10)])
X_train_pca.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9
0,0.412704,0.04323,-1.101601,-0.461544,-1.519312,0.408231,2.863886,0.878602,-0.041124
1,-2.925138,0.138393,-1.32063,-1.543834,-0.46879,0.636707,0.820311,0.487203,-0.099153
2,-1.394686,-0.255342,-0.014123,-2.01123,-0.554694,0.661321,1.360074,0.917805,0.60141
3,-2.851148,-0.529071,-0.547554,-2.795799,-1.290943,0.265698,0.400213,-2.066422,-1.358516
4,-1.062244,-0.629026,-0.890981,-2.168618,-1.663503,0.997489,1.327824,-0.362974,0.622166


In [11]:
kmeans_new = KMeans(n_clusters=5)
y_kmeans = kmeans_new.fit_predict(X_train_pca)





In [12]:
for i in np.arange(0, len(kmeans_new.labels_)):
  if kmeans_new.labels_[i] > 1:
    kmeans_new.labels_[i] += 1
set(kmeans_new.labels_)

{0, 1, 3, 4, 5}

In [13]:
print(silhouette_score(X_train_pca, kmeans_new.labels_))

0.1561980917030609


Keeping the first nine principal components and clustering with the new PCA transformed data set, we obtain only a slightly higher silhouette score of 0.1296, compared to a score of 0.1098 before performing PCA. Silhouette score is a measure of how tightly and distinctly the data is clustered, where 1 is tightly clustered and 0 is loosely (and indistinctively) clustered. It is one measure of effectiveness for a clustering algorithm. The calculated silhouette scores for both clustering schemes suggests that the effects of PCA are minimal for this data set, so we will not employ it in the main analysis of our classification of track genres.