# "Spotify Album Data Visualization"
> ""

- toc:true- branch: master
- badges: true
- comments: true
- author: Hamel Jon Walbrin
- categories: [fastpages, jupyter, spotify, spotipy]

## Overview

This script creates nice data visualizations from the track data (i.e. spotify audio features) and cover art for a given album, obtained from the Soptify Web API. Specifically, it uses several packages (spotipy, sklearn, plotly etc.) to:

1. Extract album data from Spotify 
2. Perform k-means clustering on album cover image pixels to approximate the 3 most dominant colors that are used to set the color  properties of data plots 
3. Visualizes the data as a track dissimilarity matrix and polar plots that display audio features for each track

There are a lot of detailed spotipy tutorials out there already, and so the goal here was to take a slightly more artistic approach and to create some nice data visualizations that can be easily turned into a poster, birthday card etc.

Here I used 1998 classic 'American Water' by Silver Jews. RIP David Berman.

Note: Because fastpages does not support figure printing, those shown are here are screen-shots (simply un-comment out the 'fig.show()' lines at the end of each code section to see them when running the script).

# 1. Extract and format album data

## Basic set-up

To run, you'll need to have already installed the packages shown below, and obtained your client/secret codes for the Soptify Web API (e.g. see here: https://developer.spotify.com).


In [1]:
import numpy as np
import pandas as pd

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

import matplotlib.image as img
import urllib.request
from PIL import Image

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import euclidean_distances
from scipy.cluster.vq import kmeans
from scipy.cluster.vq import vq

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

# Spotify client code flow
client = add-your-client-ID
secret= add-your-secret-key
spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(client_id=client,client_secret=secret))

## Extract and format album data

Load album data (artist and album name should be exactly as they appear on Spotify). First, we obtain the unique album ID, and use this to get the album track audio feature data, and then create abbreviated track names (for tidier plotting later on), and then put this information in a pandas dataframe.

The script was set up for album cover art with at least 3 distinct colors. Optionally, you can change the 'kClusters' variable to see if this gives a better solution for a particular album cover (but note that the script is set up to always apply the 3 most dominant colors to data plots, regardless of total cluster number). 

In [3]:
## User presets ##
kClusters = 3
artist_name = 'Silver Jews'
album_name = 'American Water'

##

# Search and return album info from spotify
search_string = artist_name.replace(' ', '+') + '+' + album_name.replace(' ', '+')
results = spotify.search(q=search_string, type='album')

# Get album ID
album_idx = []
for t in range(len(results['albums']['items'])):
    if results['albums']['items'][t]['name'] == album_name:
        album_idx = t
album_id = results['albums']['items'][album_idx]['id']
  
# Get track IDs, initials of tracknames
album_tracks = spotify.album_tracks(album_id, limit=50, offset=0, market=None)
track_id_list = len(album_tracks['items'])*[None] 
track_name_initials = len(album_tracks['items'])*[None]
for ti in range(len(album_tracks['items'])): 
    track_id_list[ti] = album_tracks['items'][ti]['id']
    track_name_initials [ti] = album_tracks['items'][ti]['name']
    temp_words = track_name_initials[ti].split() 
    temp_first_characters = "".join([word[0] for word in temp_words])
    track_name_initials[ti]  = temp_first_characters.upper()

# Get track features, create pandas dataframe that includes initialized track names
track_feats = spotify.audio_features(track_id_list)
track_name_initials_df =pd.DataFrame(track_name_initials)
track_feats_df= pd.DataFrame(track_feats)
track_feats_df= pd.concat([track_name_initials_df,track_feats_df],axis=1)

# Show track_feats_df
track_feats_df.head()

Unnamed: 0,0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,RR,0.575,0.48,0,-10.517,1,0.0237,0.468,0.000539,0.0727,0.536,105.384,audio_features,6lquLzGE5CRoq2htyr2QGS,spotify:track:6lquLzGE5CRoq2htyr2QGS,https://api.spotify.com/v1/tracks/6lquLzGE5CRo...,https://api.spotify.com/v1/audio-analysis/6lqu...,238333,4
1,S&JF,0.629,0.513,9,-10.643,0,0.0292,0.738,0.113,0.0799,0.465,102.568,audio_features,4DYR2Y7YkUyQSKzn9GUppR,spotify:track:4DYR2Y7YkUyQSKzn9GUppR,https://api.spotify.com/v1/tracks/4DYR2Y7YkUyQ...,https://api.spotify.com/v1/audio-analysis/4DYR...,198467,4
2,NS,0.378,0.928,7,-8.804,1,0.0412,0.0654,0.622,0.228,0.69,105.335,audio_features,5PhkyQqXlkLMXiDniiouT1,spotify:track:5PhkyQqXlkLMXiDniiouT1,https://api.spotify.com/v1/tracks/5PhkyQqXlkLM...,https://api.spotify.com/v1/audio-analysis/5Phk...,137093,4
3,FD,0.552,0.342,6,-11.895,1,0.0331,0.413,0.127,0.0989,0.195,136.85,audio_features,5E4AIOCPNZ8EGg6ebjFCPs,spotify:track:5E4AIOCPNZ8EGg6ebjFCPs,https://api.spotify.com/v1/tracks/5E4AIOCPNZ8E...,https://api.spotify.com/v1/audio-analysis/5E4A...,241440,4
4,P,0.72,0.779,9,-9.232,1,0.0368,0.321,0.0508,0.189,0.508,109.447,audio_features,13PfXqO69Am4goxVchrzej,spotify:track:13PfXqO69Am4goxVchrzej,https://api.spotify.com/v1/tracks/13PfXqO69Am4...,https://api.spotify.com/v1/audio-analysis/13Pf...,283387,4


## Subset and scale features of interest

Here we select audio features of interest (i.e. excluding 'key' and 'mode' as they are not continuous), and min-max scale them (i.e. to values within the range [0-1]; this is important for euclidean distance estimation (below), and helps with visualizing plotting polar plots). From this, we create a feature array for subsequent analysis/plotting.

In [4]:
# Specify features of interest, create numpy array 
features_of_interest=['danceability','energy','loudness','speechiness','acousticness','instrumentalness',
               'liveness','valence','tempo','duration_ms']
feat_array = np.array(track_feats_df[features_of_interest])

# Minmax-scale features
scaler = MinMaxScaler()
feat_array = scaler.fit_transform(feat_array)

# Check array dimensions (tracks x features)
feat_array.shape

(12, 10)

## Download album cover image and extract pixel-wise RGB values

Next, album cover image pixel-wise RGB values are stored as a pandas dataframe (N pixels x 3 RGB values).

In [5]:
# Download album cover image
album_cover_url = results['albums']['items'][0]['images'][0]['url'] # 0-2 idx: 640, 300, 64 pixel versions, respectively
urllib.request.urlretrieve(
  album_cover_url,'albumNew.jpg')

# Create data frame of pixel-wise RGB values
album_rgb = img.imread('albumNew.jpg',0)
album_rgb = album_rgb.astype(np.double) # double needed for kmeans, below

r = []
g = []
b = []
for row in album_rgb:
#     for temp_r, temp_g, temp_b, temp in row:
    for temp_r, temp_g, temp_b in row:
        r.append(temp_r)
        g.append(temp_g)
        b.append(temp_b)        
album_rgb_df = pd.DataFrame({'red' : r,
                             'green' : g,
                             'blue' : b}) 
# Show dataframe header
album_rgb_df.head()

Unnamed: 0,red,green,blue
0,167.0,172.0,178.0
1,172.0,176.0,185.0
2,173.0,177.0,186.0
3,171.0,175.0,184.0
4,172.0,174.0,186.0


# 2. K-means clustering 

Clustering is performed on the pixel-wise RGB data for the cover image. The centroid RGB values for each of the k-clusters are taken as an approximation of the most dominant colors. This works well with images with a few, relatively homogenous regions of color, less so with more complex images. The dominant/centroid colors are plotted to see how well clustering has done. This implementation was based on a GeeksForGeeks post, so credit to them for this (https://www.geeksforgeeks.org/extract-dominant-colors-of-an-image-using-python/). Dominant color info is then put into descending order (i.e. cluster with most pixels first), and then plotted for quick visualization. Original album cover is printed too.


In [10]:
# Get k-means centroids and cluster sizes
centroids, _ = kmeans(album_rgb_df[['red',
                                    'green',
                                    'blue']], kClusters)
idx, _ =  vq(album_rgb_df,centroids)
cluster_sizes= np.bincount(idx)

# Get dominant colors (centroid RGB values)
dc = []   
for cluster_center in centroids:
    red, green, blue = cluster_center
    dc.append((
        int(round(red)),
        int(round(green)),
        int(round(blue))))    

# Descending sort cluster info ready for plotting
dc_ord_idx = cluster_sizes.argsort()[::-1] # sort cluster size info
dc = [dc[i] for i in dc_ord_idx]
bar_colors = []   
for c in dc:
    bar_colors.append(("rgb(%i, %i, %i)" % (c)))  # sort cluster RGBs
    
# Plot dominant color bars
fig = go.Figure(go.Bar(y=cluster_sizes[dc_ord_idx],
                marker_color=bar_colors),
               )
fig.update_layout(title_text='Pixel count per dominant color',
                 height = 300,
                 width = 400,
                 )
fig.update_xaxes(tickmode='linear',tick0=1,dtick=1)
# fig.show()

# Show cover image
fig=px.imshow(album_rgb)
fig.layout.xaxis.showticklabels = False
fig.layout.yaxis.showticklabels = False
# fig.show()

![](publishedfigures/PixelCountPlot.PNG)

![](publishedfigures/AlbumCover.PNG)

# 3. Data visualization

## Track-to-track dissimilarity

Here we convert the feature array into a euclidean distance matrix (symmetrical N tracks x N tracks matrix) that provides a measure of track-to-track dissimilarity (i.e. each matrix entry shows the dissimilarity for any pair of tracks, based on their audio features). The color map is defined by the 3 most dominant colors from the cover image (high dissimilarity is depicted with most dominant color, and low dissimilarity with the 3rd most dominant color). Here we visualize with plotly.

In [13]:
# Create dissimilarity matrix (euclidean distance)
eucDist_feat = euclidean_distances(feat_array,feat_array)

# Plot dissimilarity matrix
fig = go.Figure(data=[go.Heatmap(z=eucDist_feat,
                                 colorscale = [
                                    [0, "rgb(%i, %i, %i)" % (dc[2][0],dc[2][1], dc[2][2])],
                                    [0.5, "rgb(%i, %i, %i)" % (dc[1][0],dc[1][1], dc[1][2])],
                                    [1, "rgb(%i, %i, %i)" % (dc[0][0],dc[0][1], dc[0][2])]])],
               )

fig.update_yaxes(autorange="reversed")
fig.layout.xaxis.showticklabels = False

fig.update_layout(title = 'Track dissimilarity',
                  title_x = 0.5,
                  title_xref = "container",
                  autosize=False,
                  width=540,
                  height=500,
                  yaxis = dict(
                               tickmode = "array",
                               ticktext = track_name_initials,
                               tickvals = list(range(0,len(track_name_initials))),     
                               )
                 )
fig.update_traces(colorbar_tick0=-1,
                  colorbar_dtick='L0.5', 
                  colorbar_tickmode='linear',
                  selector=dict(type='heatmap')
                 )

# fig.show()

![](publishedfigures/TrackDissimilarityMatrix.PNG) 

## Track-wise feature plotting

Finally, we plot the audio features for each track as a polar bar plot (i.e. each bar is an audio feature, where bar length shows the scaled value for each feature). Plotting area is constrained to 4 columns, and the minimum number of required rows (to fit the plot for each track) is determined from the number of album tracks. Prior to using the plotly command 'make_subplots', we need to format the 'specs' parameter based on the number of rows/columns required for the plotting area.

In [14]:
# Use most dominant color 
t_color_str = ["rgb(%i, %i, %i)" % (dc[0][0],dc[0][1], dc[0][2])]

# Generate subplot indices based on maximum of 4 columns and N tracks
n_sub_col = 4
n_sub_row = -(-len(track_name_initials) // n_sub_col)
temp_sub_indices = np.indices((n_sub_row,n_sub_col))
sub_indices_row = temp_sub_indices[0].reshape(n_sub_col*n_sub_row)
sub_indices_col = temp_sub_indices[1].reshape(n_sub_col*n_sub_row)

# Make specs variable for below (based on N rows as defined above)
temp_specs = n_sub_col*[None]
for i in range(n_sub_col):
    temp_specs[i] = {'type': 'barpolar'}
specs = []
for j in range(n_sub_row):
    specs.append(temp_specs)
specs

# Make sub-plots
fig = make_subplots(
    rows=n_sub_row, cols=n_sub_col,
    specs=specs,         
    subplot_titles = track_name_initials)

# Iteratively add the polar plot for each track 
ti_count = 0
for ti in range(len(track_name_initials)):
    fig.add_trace(
        go.Barpolar(
        r=feat_array[ti],
        theta=features_of_interest,
        marker_color=t_color_str*len(track_name_initials)),
        row=sub_indices_row[ti]+1,col=sub_indices_col[ti]+1,
    )
    
# Figure formatting
fig.update_polars(angularaxis_tickvals = [" "]*12,
                  angularaxis_showgrid=False,radialaxis_showgrid=False,
                  radialaxis_showline=False, radialaxis_showticklabels=False,
                  radialaxis_autorange=False,
                  bgcolor = "#FFFFFF",
                 )
fig.update_annotations(font_size=12)
fig.update_layout(showlegend=False,
                  title = 'Feature plots',
                  title_x = 0.5,
                  title_xref = "container",
                 )
# fig.show()

![](publishedfigures/FeaturePlots.PNG)

## Feature plot legend

To keep things tidy, let's plot a separate 'legend' polar plot with the feature labels. 

In [15]:
# Define color as 3rd most dominant (to set apart from track-wsie plots)
t_color_str = ["rgb(%i, %i, %i)" % (dc[2][0],dc[2][1], dc[2][2])]

# Create plot, assign features of interest as feature labels, assign 
fig = go.Figure(go.Barpolar(
    r= np.array([.5,.25,.5,.25,.5,.25,.5,.25,.5,.25]), # dummy values 
    theta=features_of_interest,
     marker_color=t_color_str*len(features_of_interest),
    ))
fig.update_polars(angularaxis_showgrid=False,radialaxis_showgrid=False,
                  radialaxis_showline=False, radialaxis_showticklabels=False,
                  bgcolor = "#FFFFFF",
                 )
fig.update_layout(autosize=False,
                  width=400,
                  height=400,
                  margin_l=150,
                 )
# fig.show()

![](publishedfigures/PolarPlotLegend.PNG)