# Spotify Playlist Analysis


<span class = "myhighlight">Objective.</span> Using Python, the project goal is to implement a k-means clustering algorithm, a technique often used in machine learning, and use it for data analysis. We write various functions making use of lists, sets, dictionaries, sorting, and graph data structures for computational problem solving and analysis.


In [89]:
import csv
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from operator import index

First, we create a Client Credentials Flow Manager used in server-to-server authentication by passing the necessary parameters to the [Spotify OAuth](https://github.com/spotipy-dev/spotipy/blob/master/spotipy/oauth2.py#L261) class. We provide a client id and client secret to the constructor of this authorization flow, which does not require user interaction.
    

In [90]:
# Set client id and client secret
client_id = '4cf3afdca2d74dc48af9999b1b7c9c61'
client_secret = 'f6ca08ad37bb41a0afab5ca1dc74b208'

# Spotify authentication
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

Now, we want to get the full details of the tracks of a playlist based on a playlist ID, URI, or URL. The following function takes a playlist and gets information from each individual song.


In [91]:
# Get playlist song features and artist info
def playlistTracks(id, artist_ids):
    meta = sp.track(id)
    features = sp.audio_features(id)
    artist_info = sp.artist(artist_ids)

    # Metadata
    name = meta['name']
    track_id = meta['id']
    album = meta['album']['name']
    artist = meta['album']['artists'][0]['name']
    artist_id = meta['album']['artists'][0]['id']
    release_date = meta['album']['release_date']
    length = meta['duration_ms']
    popularity = meta['popularity']

    # Main artist name, popularity, genre
    artist_pop = artist_info["popularity"]
    artist_genres = artist_info["genres"]

    # Track features
    acousticness = features[0]['acousticness']
    danceability = features[0]['danceability']
    energy = features[0]['energy']
    instrumentalness = features[0]['instrumentalness']
    liveness = features[0]['liveness']
    loudness = features[0]['loudness']
    speechiness = features[0]['speechiness']
    tempo = features[0]['tempo']
    valence = features[0]['valence']
    key = features[0]['key']
    mode = features[0]['mode']
    time_signature = features[0]['time_signature']

    return [name, track_id, album, artist, artist_id, release_date, length, popularity, 
            artist_pop, artist_genres, acousticness, danceability, 
            energy, instrumentalness, liveness, loudness, speechiness, 
            tempo, valence, key, mode, time_signature]

Choose a specific playlist to analyze by copying the URL from the Spotify Player interface. Using that link, the following code uses the playlist_tracks method to retrieve a list of IDs and corresponding artists for each track from the playlist. 



In [92]:
# Spotify playlist url
playlist_link = "https://open.spotify.com/playlist/4lSykOrQfnAiCgtHKVudTT"
playlist_link = "https://open.spotify.com/playlist/1nvpVNmzL7Vi1pXcQEiaLx?si=a62187de23924f4c"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

# Extract song ids and artists from playlist
track_ids = [x1["track"]["id"]
             for x1 in sp.playlist_tracks(playlist_URI)["items"]]
artist_uris = [x2["track"]["artists"][0]["uri"]
               for x2 in sp.playlist_tracks(playlist_URI)["items"]]

The following code loops through each track ID in the playlist and extracts additional song information by calling the function we created above. From there, we can create a pandas data frame by passing in the extracted information and giving the column header names we want. 

In [93]:
# Loop over track ids
tracks = []
for i in range(len(track_ids)):
    time.sleep(.5)
    track = playlistTracks(track_ids[i], artist_uris[i])
    tracks.append(track)
    
# Save the playlist name for each song
playlist_name = sp.user_playlist(user = None, playlist_id = playlist_URI, fields="name")
for x in tracks:
    x.append(playlist_name['name'])

In [94]:
# Create dataframe
df = pd.DataFrame(
    tracks, columns=['name', 'track_id', 'album', 'artist', 'artist_id','release_date',
                     'length', 'popularity', 'artist_pop', 'artist_genres',
                     'acousticness', 'danceability', 'energy',
                     'instrumentalness', 'liveness', 'loudness',
                     'speechiness', 'tempo', 'valence', 'key', 'mode',
                     'time_signature', 'playlist'])
# Save to csv file
df.to_csv("spotify.csv", sep=',')

------------------------------------------------------

### The Data






[**Metadata.**](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-track)
- `name`: The name of the track.
- `album`: The name of the album on which the track appears.
- `artist`: The name of the artist who performed the track.
- `release_date`: The date the album was first released.
- `length`: The track length in milliseconds.
- `popularity`: The popularity of the track. Values are between 0 and 100. The popularity is calculated by an algorithm based on the total number of plays the track has had and how recent those plays are.


[**Artists.**](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-an-artist)
- `artist_pop`: The popularity of the artist. The value will be between 0 and 100, with 100 being the most popular. The artist's popularity is calculated from the popularity of all the artist's tracks.
- `artist_genres`: A list of the genres the artist is associated with.


[**Audio Features.** ](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features)
- `acousticness`: A confidence measure from 0.0 to 1.0 of whether the track is acoustic.
- `danceability`: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm, beat strength, and regularity.
- `energy`: A measure from 0.0 to 1.0 that represents a perceptual measure of intensity and activity.
- `instrumentalness`: Predicts whether a track contains no vocals. The closer the value is to 1.0, the more likely the track contains no vocal content.
- `liveness`: Detects the presence of an audience in the recording. Higher values represent an increased probability that the track was performed live.
- `loudness`: The overall loudness of a track in decibels (dB). Values are averaged across entire track, ranging between -60 and 0 db.
- `speechiness`: Detects the presence of spoken words in a track. The more speech-like the recording, the closer to 1.0.
- `tempo`: The overall estimated speed or pace of a track in beats per minute (BPM).
- `valence`: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. High valence sound more positive (e.g. happy, cheerful, euphoric).
- `key`: The key the track is in. If no key was detected, the value is -1.
- `mode`: The modality (major or minor) of a track. Major is represented by 1 and minor is 0.
- `time_signature`: An estimated time signature (how many beats are in each measure), ranging from 3 to 7 indicating time signatures of "3/4", to "7/4".



In [95]:
# df = pd.read_csv('playlists.csv', encoding_errors='ignore', index_col=0, header=0)
df.head(2)

Unnamed: 0,name,track_id,album,artist,artist_id,release_date,length,popularity,artist_pop,artist_genres,...,instrumentalness,liveness,loudness,speechiness,tempo,valence,key,mode,time_signature,playlist
0,2 AM,3g3RCV5ImXwzHpKwM2iunc,Pure Infinity,SwaVay,29gIYsdyccGoUc6qgkZeTK,2019-05-24,198577,55,51,"[atl hip hop, indie hip hop, underground hip hop]",...,9.8e-05,0.362,-12.353,0.0727,126.799,0.184,7,1,4,but my feet in bottega
1,Golden Child,04QWC97Dvd9g0IEDoyUDBX,Lady Wrangler,Shaboozey,3y2cIKLjiOlp1Np37WiUdH,2018-10-05,177773,46,56,[pop rap],...,2e-06,0.36,-8.848,0.29,151.029,0.365,0,1,4,but my feet in bottega


How many songs do we have?

In [96]:
# Number of rows and columns
rows, cols = df.shape
print(f'Number of songs: {rows}')
print(f'Number of attributes per song: {cols}')

Number of songs: 100
Number of attributes per song: 23


In [97]:
# Get a song string search
def getMusicName(elem):
    return f"{elem['artist']} - {elem['name']}"

# Select song and get track info
anySong = df.loc[15]
anySongName = getMusicName(anySong)
print('name:', anySongName)

name: Lil Yachty - Yacht Club (feat. Juice WRLD)


-----------------------

## Spotify Songs - Similarity Search




Below, we create a query to retrieve similar elements based on Euclidean distance. In mathematics, the Euclidean distance between two points is the length of the line segment between the two points. In this sense, the closer the distance is to 0, the more similar the songs are.



#### [KNN Algorithm](https://www.kaggle.com/code/leomauro/spotify-songs-similarity-search/notebook)


The k-Nearest Neighbors (KNN) algorithm searches for k similar elements based on a query point at the center within a predefined radius. 



In [98]:
def knnQuery(queryPoint, arrCharactPoints, k):
    queryVals = queryPoint.tolist()
    distVals = []
    
    # Copy of dataframe indices and data
    tmp = arrCharactPoints.copy(deep = True)  
    for index, row in tmp.iterrows():
        feat = row.values.tolist()
        
        # Calculate sum of squared differences
        ssd = sum(abs(feat[i] - queryVals[i]) ** 2 for i in range(len(queryVals)))
        
        # Get euclidean distance
        distVals.append(ssd ** 0.5)
        
    tmp['distance'] = distVals
    tmp = tmp.sort_values('distance')
    
    # K closest and furthest points
    return tmp.head(k).index, tmp.tail(k).index

In [99]:
# Execute KNN removing the query point
def querySimilars(df, columns, idx, func, param):
    arr = df[columns].copy(deep = True)
    queryPoint = arr.loc[idx]
    arr = arr.drop([idx])
    return func(queryPoint, arr, param)

**KNN Query Example.** 

Our function allows us to create personalized query points and modify the columns to explore other options. For example, the following code selects a specific set of song attributes and then searches for the $k$ highest values of these attributes set equal to one.

Let's search for  $k=3$  similar songs to a query point $\textrm{songIndex} = 6$. 

In [100]:
# Select song and column attributes
songIndex = 6 # query point
columns = ['acousticness', 'danceability', 'energy', 'speechiness', 'valence','tempo']

# Set query parameters
func, param = knnQuery, 3

# Implement query
response = querySimilars(df, columns, songIndex, func, param)

print("---- Query Point ----")
print(getMusicName(df.loc[songIndex]))
print('---- k = 3 similar songs ----')
for track_id in response[0]:
    track_name = getMusicName(df.loc[track_id])
    print(track_name)
print('---- k = 3 nonsimilar songs ----')
for track_id in response[1]:
    track_name = getMusicName(df.loc[track_id])
    print(track_name)

---- Query Point ----
YG - Sober (feat. Roddy Ricch & Post Malone)
---- k = 3 similar songs ----
Future - WAIT FOR U (feat. Drake & Tems)
Juice WRLD - Fast
The Game - Eazy
---- k = 3 nonsimilar songs ----
Fresco Trey - Key To My Heart
6LACK - Switch
Post Malone - Candy Paint


The code below implements the same idea as above, but queries each track in a given playlist instead of a single defined query point.

In [101]:
similar = {} # Similar songs count
nonsimilar = {} # Non-similar songs count

for trackID in df.index:
    response = querySimilars(df, columns, trackID, func, param)

    # Get similar song ids and info
    for similarID in response[0]:
        track = getMusicName(df.loc[similarID])
        if track in similar:
            similar[track] += 1
        else:
            similar[track] = 1

    # Get non-similar song ids and info
    for nonsimilarID in response[1]:
        track = getMusicName(df.loc[nonsimilarID])
        if track in nonsimilar:
            nonsimilar[track] += 1
        else:
            nonsimilar[track] = 1

NON SIMILAR SONG COUNT:

In [102]:
nonsimilar = dict(sorted(nonsimilar.items(), key=lambda item: item[1], reverse=True))
print('---- NON SIMILAR SONG COUNTS ----')
for song, songCount in nonsimilar.items():
    if songCount >= 8:
        print(song, ':', songCount)

---- NON SIMILAR SONG COUNTS ----
Post Malone - Internet : 55
AG Club - Memphis : 54
Post Malone - Candy Paint : 50
Boslen - DENY (feat. Tyla Yaweh) : 50
6LACK - Switch : 46
Fresco Trey - Key To My Heart : 45


SIMILAR SONG COUNT:

In [103]:
similar = dict(sorted(similar.items(), key=lambda item: item[1], reverse=True))
print('---- SIMILAR SONG COUNTS ----')
for song, song_count in similar.items():
    if song_count >= 8:
        print(song, ':', song_count)

---- SIMILAR SONG COUNTS ----
Lil Mosey - Noticed : 8
Post Malone - Paranoid : 8


-------------------------------------

##  Spotify Song Recommendation System

We implement a content-based filtering approach for Spotify Song recommendation based on a [medium article](https://towardsdatascience.com/part-iii-building-a-song-recommendation-system-with-spotify-cf76b52705e7) for building a Spotify song recommendation system series.

In [104]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re

First, we want to check for song duplicates in the playlist. The following code uses the `drop_duplicates` function in **pandas** to drop duplicate songs while building an underlying dataframe with all unique content.

In [105]:
# Duplicates of songs accross playlists
playlistDF = df.copy(deep = True)
playlistDF[['artist','name','playlist']].head(3)

Unnamed: 0,artist,name,playlist
0,SwaVay,2 AM,but my feet in bottega
1,Shaboozey,Golden Child,but my feet in bottega
2,The Game,Eazy,but my feet in bottega


In [109]:
# Drop song duplicates
def drop_duplicates(df):
    df['artists_song']=df.apply(lambda row: row['artist']+' - '+row['name'],axis=1)
    return df.drop_duplicates('artists_song')

songDF = drop_duplicates(playlistDF)
print(len(pd.unique(songDF.artists_song)) == len(songDF))

True


For the audio features, we can categorize each attribute into four general categories as follows.

- **Mood**: Danceability, Energy, Tempo, Valence
- **Properties**: Instrumentalness, Loudness, Speechiness
- **Context**: Acousticness, Liveness
- **Metadata**: key, mode, time_signature

    

In [110]:
songDF = songDF[[
    'name', 'track_id', 'release_date', 'popularity', # Track Metadata
    'artist', 'artist_id', 'artist_pop', 'artist_genres', # Artist Info
    'danceability', 'energy', 'valence', 'tempo', # Audio Features - Mood
    'instrumentalness', 'loudness', 'speechiness', # Audio Features - Properties
    'acousticness', 'liveness', # Audio Features - Context
    'key', 'mode', 'time_signature' # Audio Features - Metadata
]]

### Feature Generation


Data feature engineering methods are an integral part of recommender systems. We implement the following process into the feature generation pipeline. 

#### 1. Sentiment Analysis

The following code performs a simple sentiment analysis using the subjectivity and polarity forms from the TextBlob package. Subjectivity, on a scale from 0 to 1, is the amount of personal opinion and factual information in the text. Polarity, on a scale from -1 to 1, is the degree of sentimentality that leads to negation.

In [111]:
# Get subjectivity & polarity using textblob
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity
def getPolarity(text):
    return TextBlob(text).sentiment.polarity

# Categorize polarity & subjectivity score
def getAnalysis(score, task="polarity"):
    if task == "subjectivity":
        if score < 1/3:
            return "low"
        elif score > 1/3:
            return "high"
        else:
            return "medium"
    else:
        if score < 0:
            return 'Negative'
        elif score == 0:
            return 'Neutral'
        else:
            return 'Positive'

# Perform sentiment analysis on text
def sentiment_analysis(df, text_col):
    df['subjectivity'] = df[text_col].apply(getSubjectivity).apply(lambda x: getAnalysis(x,"subjectivity"))
    df['polarity'] = df[text_col].apply(getPolarity).apply(getAnalysis)
    return df

In [112]:
sentimentDF = sentiment_analysis(songDF, "name")
sentimentDF[['name', 'artist', 'subjectivity', 'polarity']].head(3)

Unnamed: 0,name,artist,subjectivity,polarity
0,2 AM,SwaVay,low,Neutral
1,Golden Child,Shaboozey,high,Positive
2,Eazy,The Game,low,Neutral


#### 2. One-Hot Encoding

We now use one-hot encoding to include the sentiment of a song as input. One-hot encoding converts categorical variables into a syntactic form that machines can understand. The first step involves converting each category into a column representing either True or False. 


![](https://iq.opengenus.org/content/images/2022/01/TW5m0aJ.png)

In [113]:
# Create One Hot Encoded features of a specific column
def ohe_prep(df, column, new_name):
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    
    return tf_df # One-hot encoded features 

In [114]:
# One-hot encoding for the subjectivity 
subject_ohe = ohe_prep(sentimentDF, 'subjectivity','subject')
subject_ohe.iloc[0]

subject|high    0
subject|low     1
Name: 0, dtype: uint8

#### 3. TF-IDF



Spotify's genres are imbalanced, with some more prevalent than others. Therefore, we weigh the importance of each genre to prevent overemphasizing some types and underestimating others. 

The Term Frequency-Inverse Document Frequency (TF-IDF) quantifies words in a set of documents, showing the importance of a word in the corpus: $ \text{Term Frequency}\times\text{Inverse Document Frequency}$.


The term frequency (TF) is the number of times a term appears in each document divided by the total word count, and the inverse document frequency (IDF) is the log value of the document frequency.


In [115]:
# TF-IDF implementation
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(songDF['artist_genres'].apply(lambda x: " ".join(x)))

# Genres dataframe
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
genre_df.reset_index(drop = True, inplace=True)
genre_df.iloc[0]



genre|afrofuturism    0.000000
genre|alternative     0.000000
genre|atl             0.303252
genre|australian      0.000000
genre|bass            0.000000
                        ...   
genre|vapor           0.000000
genre|viral           0.000000
genre|virginia        0.000000
genre|west            0.000000
genre|york            0.000000
Name: 0, Length: 68, dtype: float64

#### 4. Normalization


We need to normalize the popularity variable and audio features from 0 to 1. We use the MinMaxScaler function from scikit-learn, which automatically scales all values in min and max to the range 0 to 1.


In [116]:
# artist_pop distribution descriptive stats
print(songDF['artist_pop'].describe())

count    100.00000
mean      77.03000
std       12.23297
min       46.00000
25%       71.00000
50%       80.50000
75%       88.00000
max       97.00000
Name: artist_pop, dtype: float64


Next, we apply hyperparameter tuning to the audio features of a song to improve the prediction. Specifically, the normalization of this data stems from the maximum and minimum values of each attribute.


In [117]:
# Normalization
pop = songDF[["artist_pop"]].reset_index(drop = True)
scaler = MinMaxScaler()
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

Unnamed: 0,artist_pop
0,0.098039
1,0.196078
2,0.54902
3,0.705882
4,0.098039


#### Feature Generation

Finally, we use the following code to generate all the above features and concatenate all the variables into a new data frame. We define the following function to process and create a final set of features to generate recommendations.


In [149]:
def create_feature_set(df, float_cols):
    
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['artist_genres'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis
    df = sentiment_analysis(df, "name")

    # One-hot encoding
    subject_ohe = ohe_prep(df, 'subjectivity','subject') * 0.3
    polar_ohe = ohe_prep(df, 'polarity','polar') * 0.5
    key_ohe = ohe_prep(df, 'key','key') * 0.5
    mode_ohe = ohe_prep(df, 'mode','mode') * 0.5

    # Normalization - scale popularity columns
    pop = df[["artist_pop","popularity"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) * 0.2 

    # Scale audio feature columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, subject_ohe, polar_ohe, key_ohe, mode_ohe], axis = 1)
    final.insert(loc=0, column='track_id', value=df['track_id'].values) # Add song name
    
    return final # Final set of features 

In [150]:
# Save data and generate features
float_cols = songDF.dtypes[songDF.dtypes == 'float64'].index.values
complete_feature_set = create_feature_set(songDF, float_cols=float_cols)

# songDF.to_csv("../data/allsong_data.csv", index = False)
#complete_feature_set.to_csv("../data/complete_feature.csv", index = False)
complete_feature_set.head(3)



Unnamed: 0,track_id,genre|afrofuturism,genre|alternative,genre|atl,genre|australian,genre|bass,genre|baton,genre|bc,genre|brooklyn,genre|cali,...,key|4,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1
0,3g3RCV5ImXwzHpKwM2iunc,0.0,0.0,0.303252,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5
1,04QWC97Dvd9g0IEDoyUDBX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5
2,6Ab81Bs9fcOwaTYuBsUUpI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5


### Content-based Filtering Recommendation


The next step is to perform content-based filtering based on the song features. To do so, we concatenate all songs in a playlist into one summarization vector. Then, we find the similarity between the summarized playlist vector with all songs (not including the songs in the playlist) in the database. Then, we use the similarity measure retrieved the most relevant song that is not in the playlist to recommend it.



#### Choose Playlist


In this part, we test the data with *Mom's playlist* in the dataset.


In [151]:
testDF = playlistDF[playlistDF['playlist'] == "but my feet in bottega"]

#### Extract features

The next step is to generate the features. We need to first use the `id` to differentiate songs that are in the playlist and those that are not. Then, we simply add the features for all songs in the playlist together as a summary vector.




In [156]:
# Summarize playlist into a single vector
def generate_playlist_feature(feat_set, playlist_df):
    
    # Find song features in the playlist
    feat_set_playlist = feat_set[feat_set['track_id'].isin(playlist_df['track_id'].values)]    
    
    # Find all non-playlist song features
    feat_set_nonplaylist = feat_set[~feat_set['track_id'].isin(playlist_df['track_id'].values)]
    feat_set_playlist_final = feat_set_playlist.drop(columns = "track_id")
    
    # Single vector feature summarizing playlist
    return feat_set_playlist_final.sum(axis = 0), feat_set_nonplaylist

> In other words, this vector describes the whole playlist as if it is one song.



In [182]:
# Generate the features
feat_set_pl, feat_set_nonpl = generate_playlist_feature(complete_feature_set, testDF)
# Non-playlist features feat_set_nonpl.head()

100


In [185]:
# Summarized playlist features
complete_feature_set
feat_set_pl

genre|afrofuturism     0.596351
genre|alternative      0.627411
genre|atl              6.138455
genre|australian       0.848195
genre|bass             0.331550
                        ...    
key|9                  2.500000
key|10                 4.500000
key|11                 2.500000
mode|0                18.500000
mode|1                31.500000
Length: 98, dtype: float64


#### Find similarity

In our code, we used the `cosine_similarity()` function from `scikit learn` to measure the similarity between each song and the summarized playlist vector.


In [None]:
# Generated recommendation based on songs in aspecific playlist
def generate_playlist_recos(df, features, nonplaylist_features):
    '''
    features (pandas series): summarized playlist feature (single vector)
    nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
    '''
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    # Find cosine similarity between the playlist and the complete song set
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    
    # Top 40 recommendations for that playlist
    return non_playlist_df_top_40

----------------------------------------------------

In [None]:
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import cluster, decomposition

songs = pd.read_csv("data/spotify.dat")
labels = songs.values[:,1]
X = songs.values[:,2:]

kmeans = cluster.AffinityPropagation(preference=-200)
kmeans.fit(X)

predictions = {}
for p,n in zip(kmeans.predict(X),labels):
	if not predictions.get(p):
		predictions[p] = []

	predictions[p] += [n]

for p in predictions:
	print "Category",p
	print "-----"
	for n in predictions[p]:
		print n
	
	print ""

---------------------------------------------------------------


### Similar Artists Web Visual


First, we want to find the most frequently occurring artist in a given playlist. We use the value_counts function to get a sequence containing counts of unique values sorted in descending order. 


In [25]:
# pandas count distinct values in column
tallyArtists = df.value_counts(["artist", "artist_id"]).reset_index(name='counts')
topArtist = tallyArtists['artist_id'][1]
tallyArtists.head(4)

Unnamed: 0,artist,artist_id,counts
0,Juice WRLD,4MCBfE4596Uoi2O4DtmEMz,9
1,Post Malone,246dkjvS1zLTtiykXe5h60,9
2,SAINt JHN,0H39MdGGX6dbnnQPt6NQkZ,4
3,Quavo,0VRj0yCOv2FXJNP47XQnx5,3


#### Links Dataset

I can retrieve artist and artist-related data using the following code, passing the artist ID to the artist and artist-related artist functions under the spotipy package. The returned list of similar artists is sorted by similarity score based on the listener data.

In [26]:
# create links table
a = sp.artist(topArtist)
ra = sp.artist_related_artists(topArtist)

# dictionary of lists 
links_dict = {"source_name":[],"source_id":[],"target_name":[],"target_id":[]};
for artist in ra['artists']:
    links_dict["source_name"].append(a['name'])
    links_dict["source_id"].append(a['id'])
    links_dict["target_name"].append(artist['name'])
    links_dict["target_id"].append(artist['id'])

Let’s take it a step further and query the API for similar artists for those similar to the most frequent artist in the given playlist. In other words, we generate two generations of the most similar artists.

In [27]:
for i in range(0, 4):
    a = sp.artist(links_dict['target_id'][i])
    ra = sp.artist_related_artists(links_dict['target_id'][i])
    time.sleep(.5)
    for artist in ra['artists']:
        links_dict["source_name"].append(a['name'])
        links_dict["source_id"].append(a['id'])
        links_dict["target_name"].append(artist['name'])
        links_dict["target_id"].append(artist['id'])

# Convert links dict to dataframe
links = pd.DataFrame(links_dict) 

# Export to excel sheet             
links.to_excel("links.xlsx", index = False)

In [28]:
links.head()

Unnamed: 0,source_name,source_id,target_name,target_id
0,Post Malone,246dkjvS1zLTtiykXe5h60,Rae Sremmurd,7iZtZyCzp3LItcw1wtPI3D
1,Post Malone,246dkjvS1zLTtiykXe5h60,Huncho Jack,6extd4B6hl8VTmnlhpl2bY
2,Post Malone,246dkjvS1zLTtiykXe5h60,Tyla Yaweh,1MXZ0hsGic96dWRDKwAwdr
3,Post Malone,246dkjvS1zLTtiykXe5h60,A$AP Ferg,5dHt1vcEm9qb8fCyLcB3HL
4,Post Malone,246dkjvS1zLTtiykXe5h60,6LACK,4IVAbR2w4JJNJDDRFP3E83


#### Points Dataset

In [29]:
# create "points" table             
all_artist_ids = list(set(links_dict['source_id'] + links_dict['target_id']))

In [30]:
# dictionary of lists 
points_dict = {"id":[],"name":[],"followers":[],"popularity":[],"url":[],"image":[]};

for id in all_artist_ids:
    time.sleep(.5)
    a = sp.artist(id)
    points_dict['id'].append(id)
    points_dict['name'].append(a['name'])
    points_dict['followers'].append(a['followers']['total'])
    points_dict['popularity'].append(a['popularity'])
    points_dict['url'].append(a['external_urls']['spotify'])
    points_dict['image'].append(a['images'][0]['url'])

# Convert links dict to dataframe
points = pd.DataFrame(points_dict) 

# Export to excel sheet             
points.to_excel("points.xlsx", index = False)

In [31]:
points.head()

Unnamed: 0,id,name,followers,popularity,url,image
0,7mX72Bq2iXNr8fZdu23fQL,Boslen,50072,54,https://open.spotify.com/artist/7mX72Bq2iXNr8f...,https://i.scdn.co/image/ab6761610000e5ebc1e495...
1,5zctI4wO9XSKS8XwcnqEHk,Lil Mosey,4882787,71,https://open.spotify.com/artist/5zctI4wO9XSKS8...,https://i.scdn.co/image/ab6761610000e5ebe1ca9d...
2,6oMuImdp5ZcFhWP0ESe6mG,Migos,12799222,78,https://open.spotify.com/artist/6oMuImdp5ZcFhW...,https://i.scdn.co/image/ab6761610000e5ebf4593f...
3,7pFeBzX627ff0VnN6bxPR4,Desiigner,3416681,64,https://open.spotify.com/artist/7pFeBzX627ff0V...,https://i.scdn.co/image/ab6761610000e5ebc527ef...
4,2RDOrhPqAM4jzTRCEb19qX,Sheck Wes,1362631,66,https://open.spotify.com/artist/2RDOrhPqAM4jzT...,https://i.scdn.co/image/ab6761610000e5eb170952...


The following visualization is based on the [Spotify Similiar Artists API](https://unboxed-analytics.com/data-technology/visualizing-rap-communities-wtih-python-spotifys-api/) article and created with flourish studio.


In [34]:
%%html

<div class="flourish-embed flourish-network" data-src="visualisation/12232729"><script src="https://public.flourish.studio/resources/embed.js"></script></div>

------------------------------------------

#### Neo4js Visuals