# Spotify Song Recommendation

For my program, I use 3 following libraries:

- Spotipy for retrieving data and interacting with Spotify Official APIs
- Pandas for data analysis
- Seaborn for data visualization


In [6]:
import spotipy
import pandas as pd
import matplotlib.pyplot as plot
import seaborn as sns
from spotipy import util
from spotipy.oauth2 import SpotifyClientCredentials
from matplotlib import style

# Change Pandas display settings
pd.set_option("display.width", 800)
pd.set_option("display.max_columns", 50)

# Change Seaborn default settings
sns.set_context('talk')
style.use('ggplot')

First, I register my app and provide the Spotipy library with my Spotify ID and secret token.

In [2]:
API_LIMIT = 50 # Spotify API call limit

user_id = "xx" # Spotify ID
secret = "xx" # Spotify secret token

# Create client by feeding credentials to Spotipy library
client_credentials_manager = SpotifyClientCredentials(client_id=user_id, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In this program, I will perform analysis on a Spotify playlist that I like. Note that for the purpose of this project, all songs on the playlist are of the same genre. 

Spotify APIs are called to get the playlist id, all the tracks in it with basic details such as track name, artist name and id, album name and track popularity.

In [None]:
username = 'xx'  # Spotify username
playlist_name = 'asia indie'  # playlist name

playlists_results = sp.user_playlists(username) # Get all my playlists
playlist_ids = [playlist['id'] for playlist in playlists_results['items']
                if playlist['name'] == playlist_name] # Get id of the playlist I want to use

# Raise exception when playlist is not found
if not playlist_ids:
    raise Exception("Playlist {} not found".format(playlist_name))
    
# Get tracks from playlist
tracks_results = sp.user_playlist(username, playlist_ids[0])

# Note that we only store name, id and popularity of the main artist
df_tracks = pd.DataFrame([[t["track"]["id"], t["track"]["name"], t["track"]["artists"][0]["id"],
                           t["track"]["artists"][0]["name"], t["track"]["album"]["name"], t["track"]["popularity"]]
                          for t in tracks_results['tracks']['items']],
                        columns=["id", "song_name", "artist_id", "artist_name", "album_name", "popularity"])

# Normalize popularity by scaling down by 100
df_tracks["popularity_norm"] = df_tracks["popularity"] / 100.

Apart from details about the tracks, I also want to get details about the artists such as music genres and popularity.

In [None]:
def _get_artists_df(sp, artist_ids):
    # A helper method to get artist's information with pagination (since API call limit is 50)
    artist_list = []

    while artist_ids:
        artists_results = sp.artists(artist_ids[:API_LIMIT])
        artist_list += [[t["id"], t["genres"], t["popularity"]] for t in artists_results["artists"]]

        artist_ids = artist_ids[API_LIMIT:] # to move on with the next 50 ids

    df_artists = pd.DataFrame(artist_list, columns=["artist_id", "artist_genres", "artist_popularity"])

    df_artists["artist_popularity_norm"] = df_artists["artist_popularity"] / 100.

    return df_artists


artist_ids = df_tracks["artist_id"].unique().tolist() # get ids of artists whose songs appear in my playlist
df_artists = _get_artists_df(sp, artist_ids) # store info about these artists in a dataframe

Next, I call the Spotify API to get detailed features of my tracks:

* acousticness: a confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

* danceability: how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

* duration_ms: duration of the track in milliseconds.

* energy: a measure from 0.0 to 1.0 that represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

* instrumentalness: predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

* key: the key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

* liveness: detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

* loudness: the overall loudness of a track in decibels (dB). Values typical range between -60 and 0 db.

* mode: indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

* speechiness: detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. alues above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

* tempo: overall estimated tempo of a track in beats per minute (BPM).

* time_signature: stimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

* valence: a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

All these details were then stored inside different Pandas dataframes and later merged into one single dataframe to simplify the analyzing process.


In [None]:
def _get_features_df(sp, track_ids):
    # A helper method to get track's features with pagination and return a DataFrame

    feature_list = []
    while track_ids:
        features_results = sp.audio_features(track_ids[:API_LIMIT])

        feature_list += features_results

        track_ids = track_ids[API_LIMIT:]

    df_features = pd.DataFrame(feature_list)[["id", "analysis_url", "duration_ms", "acousticness", "danceability",
                                              "energy", "instrumentalness", "liveness", "loudness", "valence",
                                              "speechiness", "key", "mode", "tempo", "time_signature"]]
    # normalize tempo from 24-200 to 0-176
    df_features["tempo_norm"] = (df_features["tempo"] - 24) / 176.

    return df_features

track_ids = df_tracks["id"].unique().tolist() # get rid of duplicates
df_features = _get_features_df(sp, track_ids) # get tracks's features

# Create a df for current playlist to include track features and artist info
df_cur = df_features.merge(df_tracks, on="id")
df_cur = df_cur.merge(df_artists, on="artist_id")

# Create a new column with full name of the song
df_cur["full_name"] = df_cur["artist_name"] + " -- " + df_cur["song_name"]

# Sort songs by popularity
df_cur.sort_values("popularity", inplace=True, ascending=False)

df_cur["time_signature"] = df_cur["time_signature"].astype(pd.api.types.CategoricalDtype(categories=[1, 2, 3, 4, 5]))
df_cur["key"] = df_cur["key"].astype(pd.api.types.CategoricalDtype(categories=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]))

After getting all the information I need, I use Seaborn to plot some graphs in order to visualize the data and hopefully discover some insights into my music taste.

In [None]:
"""Create distplot charts using Seaborn"""
def _distribution_plot(df, key, label, x_limits):
    ax = sns.distplot(df[[key]], bins=30, label=label)
    if x_limits is not None:
        ax.set_xlim(*x_limits)
    plot.title(key)
    plot.legend()
    plot.show()

x_limits = {"duration_ms": None, "loudness": (-60, 0), "tempo": (24, 200), "popularity": (0, 100),
            "artist_popularity": (0, 100)}

for key in ["duration_ms", "acousticness", "danceability", "energy", "instrumentalness", "liveness",
                    "loudness", "valence", "speechiness", "tempo", "popularity", "artist_popularity"]:
    _distribution_plot(df_cur, key, label="My Indie Playlist", x_limits=x_limits.get(key, (0, 1)))


"""Create countplot charts using Seaborn"""
def _count_plot(df, key, label):
    ax = sns.countplot(data=df, x=key, palette="tab20")
    ax.set_title(label)
    plot.show()

for key in ["key", "time_signature", "mode"]:
    _count_plot(df_cur, key, label="My Indie Playlist")

"""Create boxplot using Seaborn"""
ax = sns.boxplot(data=df_cur[["acousticness", "danceability", "energy", "instrumentalness", "liveness",
                              "valence", "speechiness", "artist_popularity_norm", "popularity_norm", "tempo_norm"]])
ax.set_title("My Indie Playlist")
plot.show()


From the graphs (included in README), it is easy to recognize that I tend to prefer songs with low instrumentalness/speechness/liveness, medium tempo, medium-to-high danceability and duration of around 220 seconds. Song popularity and artist popularity span on a pretty wide range, implying that I do not really have preference regarding these. Valence is mainly distributed at around 0.4, suggesting that I listen to both cheerful and sad songs. 

I wonder how my playlist compares against the Spotify's playlist of the same genre, so I made some calls into the Spotify API to get a sample of the "Indie Hiphop" genre that I like. Due to the scale of the project, I restricted the sample to Korean and Vietnamese songs only. The original playlist that I used also included songs from these two countries.

In [None]:
""" Get 1000 tracks of my favorite genres from Spotify API to later compare audio features"""
number_of_tracks = 1000
genres = {"vietnamese hip hop", "k-indie"} # my favorite genres

search_runs = int(number_of_tracks / API_LIMIT) # number of times API is called

search_list = []
for i in range(search_runs):
    for genre in genres:
        search_results = sp.search('genre:"{}"'.format(genre), type="track", limit=API_LIMIT, offset=API_LIMIT*i)

        search_list += [[t["id"], t["name"], t["artists"][0]["id"], t["artists"][0]["name"],
                            t["album"]["name"], t["popularity"]]
                           for t in search_results['tracks']['items']]

df_search = pd.DataFrame(search_list,
                         columns=["id", "song_name", "artist_id", "artist_name", "album_name", "popularity"])
df_search["popularity_norm"] = df_search["popularity"] / 100 # normalize popularity
track_ids = df_search["id"].unique().tolist() # get unique track ids
df_features = _get_features_df(sp, track_ids) # get songs' features
artist_ids = df_search["artist_id"].unique().tolist() # get unique artist ids
df_artists = _get_artists_df(sp, artist_ids) # get artist info

df_sample = df_features.merge(df_search, on="id")
df_sample = df_sample.merge(df_artists, on="artist_id")
df_sample["full_name"] = df_sample["artist_name"] + " -- " + df_sample["song_name"]
df_sample.sort_values("popularity", inplace=True, ascending=False) # sort by song popularity

# Convert time_signature and key to category
df_sample["time_signature"] = df_sample["time_signature"].astype(pd.api.types.CategoricalDtype(categories=[1, 2, 3, 4, 5]))
df_sample["key"] = df_sample["key"].astype(pd.api.types.CategoricalDtype(categories=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]))


# Repeat the same plot analysis on both my playlist and the Indie Playlist sample from Spotify
def _distribution_plot_2(df, df_other, key, labels, x_limits):
    ax = sns.distplot(df[[key]], bins=30, label=labels[0])
    if x_limits is not None:
        ax.set_xlim(*x_limits)
    ax = sns.distplot(df_other[[key]], bins=30, label=labels[1])
    if x_limits is not None:
        ax.set_xlim(*x_limits)
    plot.title(key)
    plot.legend()
    plot.show()

def _count_plot_2(df, df_other, key, labels):
    fig, ax = plot.subplots(1, 2)
    sns.countplot(data=df, x=key, ax=ax[0], palette="tab20")
    ax[0].set_title(labels[0])
    sns.countplot(data=df_other, x=key, ax=ax[1], palette="tab20")
    ax[1].set_title(labels[1])
    plot.show()

for key in ["duration_ms", "acousticness", "danceability", "energy", "instrumentalness", "liveness",
            "loudness", "valence", "speechiness", "tempo", "popularity", "artist_popularity"]:
    _distribution_plot_2(df_cur, df_sample, key,
              labels = ["My Indie Playlist", "1000 Indie rock songs"],
              x_limits = x_limits.get(key, (0, 1)))

for key in ["key", "time_signature", "mode"]:
    _count_plot_2(df_cur, df_sample, key, labels=["My Indie Playlist", "1000 Indie songs"])

fig, ax = plot.subplots(2, 1)
sns.boxplot(data=df_cur[["acousticness", "danceability", "energy", "instrumentalness", "liveness",
                         "valence", "speechiness", "artist_popularity_norm", "popularity_norm",
                         "tempo_norm"]], ax=ax[0])
ax[0].set_title("My Indie Playlist")
sns.boxplot(data=df_sample[["acousticness", "danceability", "energy", "instrumentalness", "liveness",
                           "valence", "speechiness", "artist_popularity_norm", "popularity_norm",
                           "tempo_norm"]], ax=ax[1])
ax[1].set_title("1000 Indie songs")
plot.show()


The graphs show that my playlist differs from the 1000 Indie songs in these aspects:

- I like songs with slightly higher acousticness/speechiness/loudness/artist popularity

- I like songs with relatively lower danceablilty/tempo/valence

- I prefer songs in key (1, 4, 5) and like songs in key (2, 11) least


## Creating a new playlist

After gaining some insights into my own music preferences, I applied the filters below to the 1000 songs dataframes in order to keep only tracks that I potentially like:

- 0.4 <= danceability <= 0.8
- duration between 10% quartile of original playlist duration and 90% quartile
- instrumentalness <= 0.1
- key in (1,4,5,6,8,10)
- 80 <= tempo <= 110
- 0.3 <= valence <= 0.5

In [None]:
def _apply_filters(df, condition, label):
    before = len(df)
    dropped_songs = df[~condition]["full_name"].head().tolist()
    df = df[condition]
    return df

df_new_sample = df_sample.drop_duplicates(["full_name"], keep="first")

# drop songs already in my original playlist
df_new_sample = _apply_filters(df_new_sample,
                          condition=~(df_new_sample["full_name"]).isin((df_cur["full_name"]).tolist()),
                          label="name")

df_new_sample = _apply_filters(df_new_sample, 
                          condition=(df_new_sample["danceability"].between(0.4, 0.8)),
                          label="acousticness")

df_new_sample = _apply_filters(df_new_sample,
                          condition=(df_new_sample["instrumentalness"] <= 0.1),
                          label="energy")

df_new_sample = _apply_filters(df_new_sample,
                          condition=(df_new_sample["tempo"].between(80, 110)),
                          label="tempo")

df_new_sample = _apply_filters(df_new_sample,
                          condition=(df_new_sample["key"].isin([1, 4, 5, 6, 8, 10])),
                          label="key")

df_new_sample = _apply_filters(df_new_sample,
                          condition=(df_new_sample["duration_ms"].between(*df_cur["duration_ms"].quantile([0.10, 0.90]))),
                          label="duraton_ms")

After running this code, I got a dataframe of 197 tracks. I then again used Spotipy to create a new Spotify playlist called "Asia Indie Trial" from these 197 tracks.

In [None]:
playlist_name_new = "Asia Indie Trial"

track_ids = df_new_sample["id"].unique().tolist() # get unique track ids of 197 songs

# set up authentication token
scope = 'playlist-modify-public'
token = util.prompt_for_user_token(username, scope, redirect_uri='http://localhost:8888/callback/')

if token:
    sp = spotipy.Spotify(auth=token)
    sp.trace = False
    playlists = sp.user_playlist_create(username, playlist_name_new, public=True,description='')
    playlist_id = playlists["id"]
    
    while track_ids:
        results = sp.user_playlist_add_tracks(username, playlist_id, track_ids[:API_LIMIT])
        print(results)
        track_ids = track_ids[API_LIMIT:]

else:
    print("Can't get token for", username)
