# Visualizing Spotify 01. Part 1 : Principal Component Analysis

Esta notebook tiene el propósito de documentar todos los pasos necesarios para crear la visualización final

> This notebook has the porpouse to document all the necessary steps to create the final data visualization

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import altair as alt

In [2]:
# Remove later
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

## Load the data

In [3]:
# Change duration from ms to min
def convert_miliseconds(miliseconds, conversion = 'minutes'):
    if conversion == 'minutes':
        return miliseconds / 60000
    elif conversion == 'seconds':
        return miliseconds / 1000

In [4]:
# Rene Perez Joglar a.k.a Residente was the vocalist and 
# songwriter of Calle 13. In this project I consider both 
# artists the same person.
def find_residente_calle13(artist):
    condition_1 = artist == 'Calle 13'
    condition_2 = artist == 'Residente'
    if condition_1 or condition_2:
        return 'Residente/Calle 13'
    else: 
        return artist

In [5]:
df = pd.read_csv("https://raw.githubusercontent.com/isaacarroyov/spotify_anomalies_kmeans-lof/main/data/songs_atributtes_my_top_100_2016-2021.csv")

# Apply initial functions
df['artist'] = df['artist'].apply(find_residente_calle13)
df['duration_minutes'] = df['duration_ms'].apply(convert_miliseconds, args = ('minutes',))
df = df.drop(columns='duration_ms')

# Remove duplicated songs + artists
# For example: Growing Pains is duplicated and the artist is Alessia Cara (in both songs)
df = df.drop_duplicates(subset=['name', 'artist'], keep='first')
df = df.reset_index(drop=True)

print(f'This dataset has {df.shape[1]} attributes and {df.shape[0]} instances')

df.head()

This dataset has 18 attributes and 504 instances


Unnamed: 0,name,artist,album,URI,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artist_popularity,duration_minutes
0,On My Wave,Keiynan Lonsdale,Rainbow Boy,01iaXaqUjZVsdLp2yF3OQ9,0.486,0.773,6,-4.199,1,0.245,0.132,7e-06,0.084,0.439,120.145,4,44,4.1177
1,96000,Anthony Ramos,In The Heights (Original Motion Picture Soundt...,0CpE5SeQkHQPYiWX0psxf4,0.479,0.614,7,-7.001,1,0.336,0.013,0.0,0.162,0.406,171.93,4,72,5.764517
2,Pessimist,Greta Isaac,Pessimist,0IkBQt9vSLLwoX0knkusSl,0.539,0.524,4,-8.279,1,0.236,0.0473,0.000349,0.0806,0.729,174.047,3,39,3.20045
3,Girl Next Door,Alessia Cara,The Pains Of Growing,0JjJGeUbFqCRe4nKNVCAz9,0.705,0.516,0,-5.77,1,0.12,0.583,0.0,0.078,0.422,115.74,4,81,3.376667
4,Somebody Else,Alessia Cara,In The Meantime,0O9ijXvoYpXDjqhaYZtA2X,0.718,0.721,4,-5.525,1,0.0429,0.019,0.0,0.144,0.963,108.07,4,81,3.5769


## Feature selection

In [6]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm', vmin=-1, vmax=1)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artist_popularity,duration_minutes
danceability,1.0,0.127612,0.021153,0.176819,-0.090161,0.07542,-0.201326,-0.215876,-0.048589,0.444468,-0.184825,0.227778,0.038105,-0.190073
energy,0.127612,1.0,0.001113,0.767766,-0.117527,-0.040563,-0.705972,-0.173311,0.200344,0.383683,0.048875,0.209125,0.060131,-0.001182
key,0.021153,0.001113,1.0,0.039409,-0.162088,0.065226,-0.030663,-0.042879,0.020586,0.050317,-0.000699,-0.006607,0.026773,0.021748
loudness,0.176819,0.767766,0.039409,1.0,-0.109441,-0.108657,-0.601774,-0.369076,0.162241,0.335391,0.048199,0.112241,0.173017,-0.072898
mode,-0.090161,-0.117527,-0.162088,-0.109441,1.0,-0.05309,0.133898,-0.033003,-0.107783,-0.030758,0.056471,-0.04477,-0.095725,0.020642
speechiness,0.07542,-0.040563,0.065226,-0.108657,-0.05309,1.0,0.031802,-0.092995,0.068295,0.159426,0.154653,0.024017,-0.083269,-0.027039
acousticness,-0.201326,-0.705972,-0.030663,-0.601774,0.133898,0.031802,1.0,0.201692,-0.164617,-0.200288,-0.092962,-0.201023,-0.073937,0.019949
instrumentalness,-0.215876,-0.173311,-0.042879,-0.369076,-0.033003,-0.092995,0.201692,1.0,-0.062551,-0.202694,-0.068367,-0.058719,-0.157964,0.009066
liveness,-0.048589,0.200344,0.020586,0.162241,-0.107783,0.068295,-0.164617,-0.062551,1.0,0.082365,0.04009,0.050474,-0.032819,-0.069929
valence,0.444468,0.383683,0.050317,0.335391,-0.030758,0.159426,-0.200288,-0.202694,0.082365,1.0,-0.044667,0.133129,-0.041914,-0.177563


> In the project ["Unsupervised Anomaly Detection on Spotify data 🎵: K-Means vs Local Outlier Factor,"](https://github.com/isaacarroyov/spotify_anomalies_kmeans-lof) I chose 6 variables that weren't highly correlated. For this data visualization I'm going to use features that are highly correlated. Why? PCA minimizes the lost of information and highly correlated features have a lot of information.


In [7]:
df_pca = df[['energy', 'loudness', 'acousticness', 'valence']]
df_pca.head()

Unnamed: 0,energy,loudness,acousticness,valence
0,0.773,-4.199,0.132,0.439
1,0.614,-7.001,0.013,0.406
2,0.524,-8.279,0.0473,0.729
3,0.516,-5.77,0.583,0.422
4,0.721,-5.525,0.019,0.963


## Dimensionality Reduction: Principal Component Analysis

In [8]:
from sklearn.decomposition import PCA

> I'm going to reduce the n-dimensional dataset into a two-dimensional one.

In [9]:
pca = PCA(n_components = 2)

# Create new features
pca.fit(df_pca.values)
df[['z1','z2']] = pca.transform(df_pca.values)

# Percentage of variance explained by each of the components
z1_explained_variance, z2_explained_variance = pca.explained_variance_ratio_

## Create visualization

> This data visualization has the only porpouse to showcase the entire distribution of the dataset in a eye pleasing way. The variables selected to add colour and size were chosen to present it aesthetically.

In [10]:
top_10_artists = df.groupby('artist').count().reset_index()[['artist','name']]\
                 .sort_values("name", ascending=False)\
                 .rename(columns={"name":"numbers_songs"}).head(10)
list_top_10_artist = top_10_artists.artist.values.tolist()

In [11]:
def label_top_ten_artist(artist, list_top_ten_artists):
    if artist in list_top_ten_artists:
        return artist
    else: return "Other"

df['artist_top_ten'] = df['artist'].apply(label_top_ten_artist, args=(list_top_10_artist,))

In [12]:
# Create layers of information
chart = alt.Chart(data=df).mark_circle()\
        .encode(x= alt.X(shorthand='z1:Q', axis= alt.Axis(title=[f'Z₁ (explains {round(z1_explained_variance * 100,1)}% ','of the variance)'])),
                y= alt.Y(shorthand='z2:Q', axis= alt.Axis(title=[f'Z₂ (explains {round(z2_explained_variance * 100,1)}% ','of the variance)'])),
                opacity = alt.Opacity(shorthand='valence:Q',
                                      scale= alt.Scale(range=[0.2,0.9]),
                                      legend= alt.Legend(orient='none', direction='horizontal', title="Song's Positivity",legendX=450,legendY=-100)),
                size= alt.Size(shorthand='acousticness:Q',scale= alt.Scale(range=[50,550]),
                               legend= alt.Legend(orient='none', direction='horizontal',title='Acousticness',legendX= 450, legendY=-45)),
                color= alt.Color(shorthand='artist_top_ten',legend= alt.Legend(orient='none', title="Top 10 artists", columns=3,legendX=-160, legendY=-100),
                                 scale = alt.Scale(domain=list_top_10_artist + ["Other"],
                                                   range = ['#F94144','#F3722C','#F8961E','#F9844A','#F9C74F','#90BE6D',
                                                            '#43AA8B','#7AB8B6','#8CA5BA','#ADD8EB','#70E5FF']
                                                  )
                                
                                ),
                tooltip = [alt.Tooltip(shorthand='name', title= "Song"), alt.Tooltip(shorthand='artist', title= "Artist"),
                           alt.Tooltip(shorthand='energy', title= "Energy"), alt.Tooltip(shorthand='loudness', title= "Loudness"),
                           alt.Tooltip(shorthand='acousticness', title= "Acousticness"), alt.Tooltip(shorthand='valence', title= "Song's Positivity")]
        )
# Configure plot
chart = chart.properties(width=1080*0.8, height=1080*0.6, 
                         title = alt.TitleParams(text=["Visualizing Spotify: The 5, 4, 3, 2 and 1"],
                                                 anchor="start", offset=30,
                                                 fontSize=45,
                                                 subtitle= ["5 years of music: All the songs I listened to during my time in college (2016 - 2021)",
                                                            "4 spotify audio features: Energy, Loudness, Acousticness and Song's Positivity",
                                                            "3 encodings: Song's positivity (opacity), Acousticness (size) and whether the artist of the song is part of",
                                                            "the top 10 or not (colour)",
                                                            "2-dimensional representation: Principal Component Analysis (PCA) was used for",
                                                            "Dimensionality Reduction",
                                                            "1 user: Me"," ", "Visualization by Isaac Arroyo (@unisaacarroyov)"], 
                                                 subtitleFontSize= 25, subtitlePadding=10,
                                                 color = 'white', subtitleColor='#white'
                            
                                                )
)\
.configure(font='Palatino', background='#001219',)\
.configure_view(stroke=None)\
.configure_legend(titleFontSize=23, labelFontSize=20, titleColor='white', labelColor='white',
                  symbolFillColor='white', symbolStrokeColor='white', symbolStrokeWidth=0.5 )\
.configure_axisY(titleAngle=0, titlePadding=80)\
.configure_axisX(titleX=200)\
.configure_axis(grid=False, ticks=False, labels=False, domain=False, titleFontSize = 30, titleColor='white')

chart = chart.interactive()
chart

In [13]:
import datapane as dp

dp.Report(dp.Plot(chart)).upload(name="Visualizing Spotify: The 5, 4, 3, 2 and 1", visibility= dp.Visibility.PORTFOLIO)

Uploading report and associated data - *please wait...*

Your report only contains a single element - did you know you can include additional plots, tables and text in a single report? More info <a href='https://docs.datapane.com/reports/blocks/layout-pages-and-selects' target='_blank'>here</a>

Report successfully uploaded. View and share your report <a href='https://datapane.com/u/unisaacarroyov/reports/E7Pwzy3/visualizing-spotify-the-5-4-3-2-and-1/' target='_blank'>here</a>, or edit your report <a href='https://datapane.com/u/unisaacarroyov/reports/E7Pwzy3/visualizing-spotify-the-5-4-3-2-and-1/edit/' target='_blank'>here</a>.