
<span class = "myhighlight">Objective.</span> Using Python, the project goal is to implement a k-means clustering algorithm, a technique often used in machine learning, and use it for data analysis. We write various functions making use of lists, sets, dictionaries, sorting, and graph data structures for computational problem solving and analysis.


In [1]:
import csv
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from operator import index

First, we create a Client Credentials Flow Manager used in server-to-server authentication by passing the necessary parameters to the [Spotify OAuth](https://github.com/spotipy-dev/spotipy/blob/master/spotipy/oauth2.py#L261) class. We provide a client id and client secret to the constructor of this authorization flow, which does not require user interaction.
    

In [2]:
# Set client id and client secret
client_id = '4cf3afdca2d74dc48af9999b1b7c9c61'
client_secret = 'f6ca08ad37bb41a0afab5ca1dc74b208'

# Spotify authentication
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

Now, we want to get the full details of the tracks of a playlist based on a playlist ID, URI, or URL. The following function takes a playlist and gets information from each individual song.


In [3]:
# Get playlist song features and artist info
def playlistTracks(id, artist_ids):
    meta = sp.track(id)
    features = sp.audio_features(id)
    artist_info = sp.artist(artist_ids)

    # Metadata
    name = meta['name']
    album = meta['album']['name']
    artist = meta['album']['artists'][0]['name']
    release_date = meta['album']['release_date']
    length = meta['duration_ms']
    popularity = meta['popularity']

    # Main artist name, popularity, genre
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]

    # Track features
    acousticness = features[0]['acousticness']
    danceability = features[0]['danceability']
    energy = features[0]['energy']
    instrumentalness = features[0]['instrumentalness']
    liveness = features[0]['liveness']
    loudness = features[0]['loudness']
    speechiness = features[0]['speechiness']
    tempo = features[0]['tempo']
    valence = features[0]['valence']
    key = features[0]['key']
    mode = features[0]['mode']
    time_signature = features[0]['time_signature']

    return [name, album, artist, release_date, length, popularity, 
            artist_pop, artist_genres, acousticness, danceability, 
            energy, instrumentalness, liveness, loudness, speechiness, 
            tempo, valence, key, mode, time_signature]

Choose a specific playlist to analyze by copying the URL from the Spotify Player interface. Using that link, the following code uses the playlist_tracks method to retrieve a list of IDs and corresponding artists for each track from the playlist. 



In [107]:
# Spotify playlist url
playlist_link = "https://open.spotify.com/playlist/4lSykOrQfnAiCgtHKVudTT"
playlist_link = "https://open.spotify.com/playlist/45wMUm1iuvHyPyzN9Lm9oL?si=f335973792784554"
playlist_link = "https://open.spotify.com/playlist/7JJd5q4ZPK0P1Q4atTcpkR?si=4feffccdf23846a7"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

# Extract song ids and artists from playlist
track_ids = [x1["track"]["id"]
             for x1 in sp.playlist_tracks(playlist_URI)["items"]]
artist_uris = [x2["track"]["artists"][0]["uri"]
               for x2 in sp.playlist_tracks(playlist_URI)["items"]]

The following code loops through each track ID in the playlist and extracts additional song information by calling the function we created above. From there, we can create a pandas data frame by passing in the extracted information and giving the column header names we want. 

In [108]:
# Loop over track ids
tracks = []
for i in range(len(track_ids)):
    time.sleep(.5)
    track = playlistTracks(track_ids[i], artist_uris[i])
    tracks.append(track)

In [109]:
# Create dataframe
df = pd.DataFrame(
    tracks, columns=['name', 'album', 'artist', 'release_date',
                     'length', 'popularity', 'artist_pop', 'artist_genres',
                     'acousticness', 'danceability', 'energy',
                     'instrumentalness', 'liveness', 'loudness',
                     'speechiness', 'tempo', 'valence', 'key', 'mode',
                     'time_signature'])

# Save to csv file
df.to_csv("spotify.csv", sep=',')

------------------------------------------------------

### The Data

In [65]:
# df = pd.read_csv('playlists.csv', encoding_errors='ignore', index_col=0, header=0)
# df_copy = df.copy(deep = True)
# df_copy = df_copy.drop(df2[df2.playlist_name != 'but my feet in bottega'].index)
df.head(2)

Unnamed: 0,name,album,artist,release_date,length,popularity,artist_pop,artist_genres,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,key,mode,time_signature
0,Sweet Dream,Sweet Dream,Alessia Cara,2021-07-15,181881,49,76,"[canadian contemporary r&b, canadian pop, danc...",0.252,0.76,0.532,0.0,0.252,-9.177,0.127,123.064,0.5,0,1,4
1,Worry No More,California,Diplo,2018-03-23,202563,46,79,"[edm, electro house, house, moombahton, ninja,...",0.0185,0.592,0.731,0.0,0.726,-5.794,0.0404,166.028,0.392,7,0,4






[**Metadata.**](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-track)
- `name`: The name of the track.
- `album`: The name of the album on which the track appears.
- `artist`: The name of the artist who performed the track.
- `release_date`: The date the album was first released.
- `length`: The track length in milliseconds.
- `popularity`: The popularity of the track. Values are between 0 and 100. The popularity is calculated by an algorithm based on the total number of plays the track has had and how recent those plays are.


[**Artists.**](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-an-artist)
- `artist_pop`: The popularity of the artist. The value will be between 0 and 100, with 100 being the most popular. The artist's popularity is calculated from the popularity of all the artist's tracks.
- `artist_genres`: A list of the genres the artist is associated with.


[**Audio Features.** ](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features)
- `acousticness`: A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
- `danceability`: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm, beat strength, and regularity.
- `energy`: A measure from 0.0 to 1.0 that represents a perceptual measure of intensity and activity.
- `instrumentalness`: Predicts whether a track contains no vocals. The closer the value is to 1.0, the more likely the track contains no vocal content.
- `liveness`: Detects the presence of an audience in the recording. Higher values represent an increased probability that the track was performed live.
- `loudness`: The overall loudness of a track in decibels (dB). Values are averaged across entire track, ranging between -60 and 0 db.
- `speechiness`: Detects the presence of spoken words in a track. The more speech-like the recording, the closer to 1.0.
- `tempo`: The overall estimated speed or pace of a track in beats per minute (BPM).
- `valence`: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. High valence sound more positive (e.g. happy, cheerful, euphoric).
- `key`: The key the track is in. If no key was detected, the value is -1.
- `mode`: The modality (major or minor) of a track. Major is represented by 1 and minor is 0.
- `time_signature`: An estimated time signature (how many beats are in each measure), ranging from 3 to 7 indicating time signatures of "3/4", to "7/4".



How many songs do we have?

In [66]:
# Number of rows and columns
rows, cols = df.shape
print(f'Number of songs: {rows}')
print(f'Number of attributes per song: {cols}')

Number of songs: 100
Number of attributes per song: 20


In [67]:
# Get a song string search
def getMusicName(elem):
    return f"{elem['artist']} - {elem['name']}"

# Select song and get track info
anySong = df.loc[15]
anySongName = getMusicName(anySong)
print('name:', anySongName)

name: filous - Already Gone (feat. Emily Warren)


-----------------------

### Spotify Songs - Similarity Search




Below, we create a query to retrieve similar elements based on Euclidean distance. In mathematics, the Euclidean distance between two points is the length of the line segment between the two points. In this sense, the closer the distance is to 0, the more similar the songs are.



#### [KNN Algorithm](https://www.kaggle.com/code/leomauro/spotify-songs-similarity-search/notebook)


The k-Nearest Neighbors (KNN) algorithm searches for k similar elements based on a query point at the center within a predefined radius. 



In [68]:
# K-query
def knnQuery(queryPoint, arrCharactPoints, k):

    # Copy of dataframe indices and data
    tmp = arrCharactPoints.copy(deep = True)
    query_vals = queryPoint.tolist()
    dist_vals = []

    # Iterate through each row and select
    for index, row in tmp.iterrows():
        feature_vals = row.values.tolist()
        sum_diff_sqr = sum(
            abs(feature_vals[i] - query_vals[i]) ** 2 for i in range(len(query_vals)))
        
        euc_dist = sum_diff_sqr ** 0.5
        dist_vals.append(euc_dist)

    tmp['distance'] = dist_vals
    tmp = tmp.sort_values('distance')

    return tmp.head(k).index, tmp.tail(k).index

In [69]:
# Execute KNN removing the query point
def querySimilars(df, columns, idx, func, param):
    arr = df[columns].copy(deep = True)
    queryPoint = arr.loc[idx]
    arr = arr.drop([idx])
    return func(queryPoint, arr, param)

**KNN Query Example.** 

Our function allows us to create personalized query points and modify the columns to explore other options. For example, the following code selects a specific set of song attributes, including danceability, energy, and valence. The function then searches for the $k$ most high values of these attributes such that danceability = 1, energy = 1, and valence = 1. 

Let's search for  $k=3$  similar songs to a query point songIndex = 5. 


In [210]:
# Selecting song and attributes
songIndex = 1 # query point, selected song
columns = ['acousticness', 'danceability', 'energy', 'speechiness', 'valence','tempo']

# Selecting query parameters
func, param = knnQuery, 5

# Querying
response = querySimilars(df, columns, songIndex, func, param)

In [211]:
similar_songs = {}
nonsimilar_songs = {}

for song_index in df.index:
    
    # Select query point and get song name 
    query_song = df.loc[song_index]
    query_song_name = getMusicName(query_song)

    # Querying
    response = querySimilars(df, columns, song_index, func, param)
    similar_ids = response[0]
    nonsimilar_ids = response[1]

    for idx in similar_ids:
        song_name = getMusicName(df.loc[idx])
        if song_name in similar_songs:
            similar_songs[song_name] += 1
        else:
            similar_songs[song_name] = 1

    for idx in nonsimilar_ids:
        song_name = getMusicName(df.loc[idx])
        if song_name in nonsimilar_songs:
            nonsimilar_songs[song_name] += 1
        else:
            nonsimilar_songs[song_name] = 1

NON SIMILAR SONG COUNT:

In [215]:
nonsimilar_songs = dict(
    sorted(nonsimilar_songs.items(), key=lambda item: item[1], reverse=True))
print('NON SIMILAR SONG COUNT:')
for song, song_count in nonsimilar_songs.items():
    if song_count >= 8:
        print(song, ':', song_count)

NON SIMILAR SONG COUNT:
Landon Cube - Eighties : 57
K.Flay - Giver : 53
Lana Del Rey - Radio : 53
Big Gigantic - You’re The One : 51
NITTI - All In (feat. Jimmy Levy) : 50
Cautious Clay - Cold War : 50
Post Malone - I Cannot Be (A Sadder Song) (with Gunna) : 49
Julia Michaels - Jump (with Trippie Redd) : 47
Arizona Zervas - HOLY TRINITY (feat. Rich The Kid) : 47
Lxst - Talk To Me : 43


SIMILAR SONG COUNT:

In [216]:
similar_songs = dict(sorted(similar_songs.items(), key=lambda item: item[1], reverse=True))
print('SIMILAR SONG COUNT')
for song, song_count in similar_songs.items():
    if song_count >= 8:
        print(song, ':', song_count)

SIMILAR SONG COUNT
Felix Cartal - Stop Being Yourself : 9
K.Flay - Good News : 9
filous - Already Gone (feat. Emily Warren) : 8
Juice WRLD - Rockstar In His Prime : 8
Dan Talevski - If I Ain't Got You : 8
24kGoldn - In My Head (feat. Travis Barker) : 8
44phantom - freak : 8


-------------------------------------

In [126]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re

In [127]:
#for i in df['artist_genres']:

def ohe_prep(df, column, new_name): 
    ''' 
    Create One Hot Encoded features of a specific column
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    column (str): Column to be processed
    new_name (str): new column name to be used
        
    Output: 
    tf_df: One-hot encoded features 
    '''
    
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df

In [136]:
# TF-IDF implementation
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(df['artist_genres'].apply(lambda x: " ".join(x)))
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
#genre_df.drop(columns='genre|unknown') # Drop unknown genre
genre_df.reset_index(drop = True, inplace=True)
genre_df.iloc[0]



genre|acoustic       0.0
genre|alt            0.0
genre|alternative    0.0
genre|art            0.0
genre|atl            0.0
                    ... 
genre|uk             0.0
genre|underground    0.0
genre|vapor          0.0
genre|viral          0.0
genre|wave           0.0
Name: 0, Length: 93, dtype: float64

In [137]:
# Normalization
pop = df[["artist_pop"]].reset_index(drop = True)
scaler = MinMaxScaler()
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

Unnamed: 0,artist_pop
0,0.758065
1,0.806452
2,0.080645
3,0.419355
4,0.806452


------------------------------------------

#### Neo4js Visuals

In [217]:
# %%cmd
# python example.py