# Song Gathering  
**JC Nacpil 2021/09/06**

In this notebook, we will build a database of Kpop songs with audio features using the Spotify Web API and Spotipy package. The output files will be used for `KpopSongRecommender`.

## Set-up

### Importing libraries

In [None]:
# Library for accessing Spotify API
import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth

# Scientific and vector computation for python
import numpy as np

# Data manipulation and analysis
import pandas as pd

# Library for this notebook providing utilitiy functions
from utils import repeatAPICall

# Progress bar
from tqdm import tqdm

# Cosine similarity calculation
from sklearn.metrics.pairwise import cosine_similarity

# Deep copy of python data structures
from copy import deepcopy

# Plotting library
import matplotlib.pyplot as plt


### Setting up Spotify API

The following are the Spotiy API credentials `CLIENT_ID` and `CLIENT_SECRET` for our application. This allows us to access data from Spotify through the <a href='https://developer.spotify.com/documentation/web-api/'>Web API</a>. It is recommended to register your own application and manage these credentials at <a href = "https://developer.spotify.com/dashboard/">My Dashboard</a>.

In [None]:
CLIENT_ID = "dc7ef763416f49aca20c740e46bd1f79"
CLIENT_SECRET = "056f146106544a828574e8e903286fb7"

token = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
cache_token = token.get_access_token(as_dict = False)
sp = spotipy.Spotify(cache_token)

### Utility functions

For this notebook, we will use `repeatAPICall`, which is a function that repeatedly makes API calls until a successful request is reached. 

In [None]:
def repeatAPICall(func, args, max_retry = 5):
    """
    Repeatedly calls spotipy func until a successful API request is made.
    
    Parameters:
        func : func
            Spotipy client function for making api calls
        args: dict
            Arguments to pass to func; Key: parameter, Value: parameter value
            Check Spotipy API of specified func for details
        max_retry: int
            Maximum iterations before prompting user to retry or skip
        
    Returns:
        result: dict
            Result of a successful API call, or none
        success: bool
            True if API call is successful, False otherwise
    """

    success = False
    res = None
    
    i = 0
    while i < max_retry:
        try: 
            res = func(**args)
            success = True
            return res, success
        except:
            print("Error in API call; retrying")
            i += 1
            pass
        
        if i >= max_retry:
            reset_loop = input("Max retry limit reached. Retry {} more time/s?".format(max_retry)).upper()
            reset_loop = True if reset_loop == 'Y' else False
            
            # Reset the index of the loop if user chooses to reset
            i = 0 if reset_loop else max_retry
    return res, success

## Step 1: Getting playlists of a given category

In this step, we will gather playlists that are categorized as k-pop. We can use this as a starting point to gather an initial list of kpop artists.

This cell gets the list of playlist categories (with ID) available in Spotify. Let's set the country code to PH so we can get PH-specific results. 

In [None]:
all_categories = sp.categories(limit = 50, country = 'PH')

categories = all_categories['categories']['items']
for cat in categories:
    print("Category: {} | ID : {}".format(cat['name'],cat['id']))

This indicates that K-pop categories has `id = kpop`!

Note for the future implementation: OPM has `id = opm`

This next cell gathers the playlists for the `kpop` category and saves it to a DataFrame.

In [None]:
kpop_playlists_result = sp.category_playlists('kpop', country='PH', limit = 50)
kpop_playlists = kpop_playlists_result['playlists']['items']
while kpop_playlists_result['playlists']['next']:
    kpop_playlists_result = sp.next(kpop_playlists_result['playlists'])
    kpop_playlists.extend(kpop_playlists_result['playlists']['items'])

playlists = [playlist['name'] for playlist in kpop_playlists]
playlist_uris = [playlist['uri'] for playlist in kpop_playlists]

playlist_df = pd.DataFrame(zip(playlists,playlist_uris),columns = ["playlist", "playlist_uri"]).drop_duplicates().reset_index(drop=True)

In [None]:
# Save the playlist list to .csv file
filename = 'Data/playlists.csv'
playlist_df.to_csv(filename, index=False)

## Step 2: Collecting artists from playlists

In this step, we will use the list of playlists and gather all the artists that appear in them. Since we're using k-pop playlists, we assume that most of the artists we get from this step are k-pop. 

**Note:** Usually there will be some non-kpop artists appearing in this list, such as Dua Lipa or Ed Sheeran. These are usually artists that appear on k-pop collabs (ex. Dua Lipa and BLACKPINK - Kiss and Make Up)

In [None]:
# Load the existing playlist data
playlist_dir = 'Data/playlists.csv'
playlist_df = pd.read_csv(playlist_dir)

In [None]:
# Loop through playlists to build a list of artists

# Get the list of unique identifiers for each playlist
playlists = playlist_df.playlist.values
playlist_uris = playlist_df.playlist_uri.values

# Create dataframe to store artist data
artist_cols = ['artist','artist_uri']
artist_df = pd.DataFrame(columns = artist_cols)

for playlist,uri in zip(playlists, playlist_uris):
    
    print("Current playlist: {}".format(playlist))
    
    playlist_result, success = repeatAPICall(sp.playlist_tracks,{'playlist_id':uri})
    if not success: 
        print("Error in playlist {}".format(playlist))
        continue

    # Remove value in playlist_result['items']  when track is listed as None object
    playlist_result['items'] = [track for track in playlist_result['items'] if track['track'] is not None]

    # Skip the playlist if there are any errors
    try:
        artist_uris = [track['track']['artists'][0]['uri'] for track in playlist_result['items']]
        artists = [track['track']['artists'][0]['name'] for track in playlist_result['items']]
    except: 
        print("Error in playlist {}".format(playlist))
    
    temp_df = pd.DataFrame(zip(artists,artist_uris),columns = artist_cols)
    artist_df = pd.concat([artist_df.reset_index(drop=True), temp_df.reset_index(drop=True)]).drop_duplicates()

# Reset the index of our resulting dataframe
artist_df = artist_df.drop_duplicates().reset_index(drop=True)
artist_df

At this point, we now have 900~ artists in our database! We save our current output as `artists.csv`. In the next succeeding cells, we will extend the list by getting related acts for every artist in our current list. 

In [None]:
# Save the existing artist data
artists_dir = 'Data/artists.csv'
artist_df.to_csv(artists_dir, index = False)

## Step 3: Extending artist data by gathering related artists

In this step, we will extended our current list of artists by adding related artists to our current list. This will run for a set number of iterations, so after getting the initial list of related artists, we can get then gather even more artists from this new batch. 

There are two important steps to improve runtime and avoid repeating processes. First, we label each artist with `temp_processed` (bool), which indicates whether we have already processed that artist's related artists. We set this initally to `False` and update it to `True` when an iteration has finished. 

Second, we only filter out artists of certain genres that we are interested in. `sp.artist_related_artists()` returns 20 related artists for a given artist, which can blow up our list exponentially and add artists that we don't want. For example, `BLACKPINK` and `BTS` are related to `Dua Lipa` and `Halsey`, respectively, as they feature together on collabs. However, if we keep these two results and get additional related artists based on them, we are likely to get more pop artists (~20) unrelated to the genre we are looking for. `genre_filter` is a list of substrings that we use to match to an artist's own genre list to decide whether to keep that artist in our list. 

In [None]:
# Load the existing artist data to extend
# For testing: randomly sample rows
artists_dir = 'Data/artists.csv'
artist_extended_df = pd.read_csv(artists_dir)
artist_extended_df

In the next cell, the code will go through each artist and get a list of related artists. 

In [None]:
# Add a 'processed' column to artist_df indicating if its already been processed by this loop
artist_extended_df['temp_processed'] = False

# Filter for genres
# We use 'k-pop' and 'k-rap' as the genre substrings
# We can also add 'korean' to match korean artists that are not considered k-pop (ex. OSTs)
genre_filter = ['k-pop', 'k-rap']
# genre_filter = ['k-','korean']

# Keep track of iteration progress (see sense check section below)
artist_count = [len(artist_extended_df.artist_uri.values)]
iter_count = [0]
removed_count = [0]

# Set maximum iterations
max_iter = 15

for i in range(max_iter):
    print("Current iter: {}".format(i+1))
    rel_artists = []
    rel_artist_uris = []
    
    # Create temporary df to score artists to be processed
    temp_df = artist_extended_df.copy()
    temp_df = temp_df[temp_df.temp_processed == False]
    
     # If temp df is empty, end the loop
    if temp_df.empty:
        print("No more artists to be processed! Breaking loop.")
        break
    
    artists = temp_df.artist.values
    artist_uris = temp_df.artist_uri.values
    
    print("Iter: {} | Artists count: {} | To be processed : {}".format(i+1, len(artist_extended_df.artist_uri.values), len(artist_uris) ))
    
    # Track number of artists removed per iteration
    total_removed = 0
    
    # Loop through artists 
    for artist,uri in zip(artists,artist_uris):
        
        # Get related artists for the current artist
        rel_artists_result, success = repeatAPICall(sp.artist_related_artists,{'artist_id':uri})
        if not success: 
            print("Skipping to next artist.")
            continue    
        
        # Remove artists whose genres do not contain the substrings in genre_filter
        old_count = len(rel_artists_result['artists'])
        rel_artists_result['artists'] = [rel_artist for rel_artist in rel_artists_result['artists'] if any(genre in ''.join(rel_artist['genres']) for genre in genre_filter)]
        new_count = len(rel_artists_result['artists'])
        
        # Track number of removed artists
        removed = old_count - new_count
        total_removed += removed
        
        rel_artists.extend([artist['name'] for artist in rel_artists_result['artists']])
        rel_artist_uris.extend([artist['uri'] for artist in rel_artists_result['artists']])

    # Create dataframe of related artists that were gathered
    rel_artist_df = pd.DataFrame(zip(rel_artists,rel_artist_uris),columns = ["artist", "artist_uri"]).drop_duplicates()
    rel_artist_df['temp_processed'] = False
    
    # At this step, all the entries in artist_df has been processed and labelled accordingly
    artist_extended_df['temp_processed'] = True
    
    # Combine artist_extended_df and rel_artist_df
    # Drop duplicates and keep first value
    # This ensures that we keep the firtst duplicate songs 
    # between artist and rel_artist_df (with different temp_processed values) 
    
    artist_extended_df = pd.concat([artist_extended_df.reset_index(drop=True), rel_artist_df.reset_index(drop=True)]).drop_duplicates(subset = ['artist', 'artist_uri'], keep = 'first')
    
    # Add metrics to array
    iter_count.append(i+1)
    artist_count.append(len(artist_extended_df.artist_uri.values))
    removed_count.append(total_removed)

print("Done! Final count: {}".format(artist_count[-1]))

Here's our final list of artists!

In [None]:
artist_extended_df

### Sense Check
This cell plots the total number of artists gathered (blue) and related artists removed (orange) as function of the number of iterations. We see that the blue line generally plateaus, indicating that we reached a reasonable upper limit of possible artists gathered. We also see that the number of artists removed is large for `i = 1` at around (8000~). This means that the first iteration removes a large number of non-kpop artists. Without removing these artists per iteration, the loop will not converge to a finite list of artists. 

In [None]:
plt.plot(iter_count,artist_count, label = "Artists in list")
plt.plot(iter_count[1:], removed_count[1:], label = "Removed in processing (not in genre filter)")
plt.xlabel("Number of iterations")
plt.ylabel("Artist count")
plt.legend()
plt.tight_layout()

In [None]:
# Save the existing artist data
# We drop the temp_processed column before writing to csv
artists_extended_dir = 'Data/artists_extended.csv'
artist_extended_df.drop('temp_processed', axis = 1).to_csv(artists_extended_dir, index = False)

## Step 4: Loading top tracks per artist
From our list of k-pop artists, we then get their top 10 tracks. This gives us a reasonable number of songs for our Kpop Song Recommender! 

In [None]:
# Load the existing artist data 
artists_dir = 'Data/artists.csv' 
artists_dir = 'Data/artists_extended.csv' # Uncomment this if you want to use the extended artists dataset
artist_df = pd.read_csv(artists_dir)

In [None]:
artists = artist_df.artist.values
artist_uris = artist_df.artist_uri.values

# Loop through artist to build a list of tracks from their top 10 songs

# Create dataframe to store artist data
artist_cols = ['artist', 'artist_uri']
track_cols = ['track','track_uri','popularity']
track_df = pd.DataFrame(columns = artist_cols + track_cols)

for artist,uri in tqdm(zip(artists, artist_uris), total = len(artist_uris)):
    
    # print("Current artist: {}".format(artist))
    
    top10_result, success = repeatAPICall(sp.artist_top_tracks,{'artist_id':uri,'country':'PH'})
    if not success: 
        print("Skipping to next artist.")
        continue

    # Remove value in playlist_result['items']  when track is listed as None object
    #top10_result['tracks'] = [track for track in top10_result['tracks'] if track is not None]

    # Skip the playlist if there are any errors
    try:
        track_uris = [track['uri'] for track in top10_result['tracks']]
        tracks = [track['name'] for track in top10_result['tracks']]
        popularity = [track['popularity'] for track in top10_result['tracks']]
    except: 
        print("Error in playlist {}".format(playlist))
    
    temp_df = pd.DataFrame(zip(tracks,track_uris, popularity),columns = track_cols)
    # Set the artist and artist columns
    temp_df['artist'] = artist
    temp_df['artist_uri'] = uri
    track_df = pd.concat([track_df.reset_index(drop=True), temp_df.reset_index(drop=True)]).drop_duplicates()

Note: if a track has multiple artists, and those artists have this track, it will show up multiple times. In the cell below, we see that the number of rows is more than the number of unique track_uri. 

We will keep these duplicates for now, but keep this in mind when post-processing the data. 

In [None]:
print("Number of track_df rows: {}\nNumber of unique track_uri: {}".format(len(track_df), track_df['track_uri'].nunique()))

In [None]:
# Save all tracks to file
tracks_dir = 'Data/tracks_top10.csv'
track_df.to_csv(tracks_dir, index = False)

## Step 5: Getting audio features per track

In this last step, we will generate the audio features for each track in our database using Spotify's Audio Features functionality. These include a song's danceability, tempo, energy key, time_signature, liveness, etc. In the main notebook, this will be used to as a basis to recommend kpop songs that are similar to a user's top tracks in terms of these features. 

In [None]:
# Load the track data
tracks_dir = 'Data/tracks_top10.csv'
track_df = pd.read_csv(tracks_dir)

tracks = track_df.track.values
track_uris = track_df.track_uri.values

track_df

In this next cell, we will go through each track and generate its audio features (saved as a dataframe). Each track is identified by its unique `track_uri`.

`sp.audio_feautures()` takes a list of track_ids (maximum of 100). We will loop through the list of track uri in batches of 100 to minimize the amount of API requests. 

In [None]:
batch_size = 100

# This list of columns is taken directly from the keys of a feature dictionary
features_cols = ['danceability', 
                 'energy', 
                 'key', 
                 'loudness', 
                 'mode', 
                 'speechiness', 
                 'acousticness', 
                 'instrumentalness', 
                 'liveness', 
                 'valence', 
                 'tempo', 
                 'type', 
                 'id', 
                 'uri', 
                 'track_href', 
                 'analysis_url', 
                 'duration_ms', 
                 'time_signature']

features_df = pd.DataFrame(columns = features_cols)

for i in tqdm(range(0, len(track_uris), batch_size)):
    
    # Select the current batch
    track_uris_batch = track_uris[i:i+batch_size]
    
    features_result, success = repeatAPICall(sp.audio_features,{'tracks':track_uris_batch})
    if not success: 
        print("Skipping to next batch.")
        continue
    
    # Deepcopy the list of dictionaries to be modified
    # This is necessary for this particular structure
    features_dicts = deepcopy(features_result)
    
    
    # Drop None in features_dict
    # This will mean that some of our songs will not have features
    if any(d is None for d in features_dicts):
        
        print("Batch: {} to {} | Some songs do not have features; dropping from list.".format(i+1, i+1+batch_size)) 
        print("Count: {}".format(len(features_dicts)))
        features_dicts = [d for d in features_dicts if d is not None]
        print("New count: {}".format(len(features_dicts)))

    temp_df = pd.DataFrame.from_records(features_dicts) 
    features_df = pd.concat([features_df.reset_index(drop=True), temp_df.reset_index(drop=True)])
    
    temp_df_count = len(temp_df.index)
    if temp_df_count != batch_size:
        print("Batch: {} to {} | Dataframe rows count: {}".format(i+1, i+1+batch_size, temp_df_count)) 


# Reset index and rename 'uri' to 'track_uri'
# Drop duplicates based on track_uri
features_df = features_df.rename(columns={'uri':'track_uri'}).drop_duplicates(subset=['track_uri'])
features_df

Finally, we left join the features to the `track_df` using `track_uri`

In [None]:
# Merge features to track_df by track_uri
# Note: some rows will not have features. We keep them for now to retain the track info
track_features_df = track_df.merge(features_df, on='track_uri', how='left').reset_index(drop = True)
track_features_df

In [None]:
# Save tracks with features to file
# Save all tracks to file
tracks_features_dir = 'Data/tracks_top10_features.csv'
track_features_df.to_csv(tracks_features_dir, index = False)

## Done!

After running this notebook, you should now have the following updated files in your Data Folder:
1. playlists.csv
2. artists.csv
3. artists_extended.csv
4. tracks_top10.csv
5. tracks_top10_features.csv

The following code cells will try to load all albums by an artist. This will be more computatinally expensive than the previous segment where we only got an artist's top 10 tracks.

In [None]:
# Load the existing artist data 
artists_dir = 'Data/artists.csv' 
artists_dir = 'Data/artists_extended.csv' # Uncomment this if you want to use the extended artists dataset
artist_df = pd.read_csv(artists_dir)

In [None]:
artists = artist_df.artist.values
artist_uris = artist_df.artist_uri.values

# Loop through artist to build a list of albums

# Create dataframe to store artist data
artist_cols = ['artist', 'artist_uri']
album_cols = ['album','album_uri','release_date','release_date_precision','total_tracks']
album_df = pd.DataFrame(columns = artist_cols + album_cols)

for artist,uri in tqdm(zip(artists, artist_uris), total = len(artist_uris)):
    
    func_params = {
        'artist_id':uri,
        'country':'PH',
        'album_type':['album','single','compilation']
    }
    
    albums_result, success = repeatAPICall(sp.artist_albums, func_params)
    
    if not success: 
        print("Skipping to next artist.")
        continue

    albums_list = albums_result['items']
    while albums_result['next']:
#         print("Going to next 50 albums.")
        albums_result, success = repeatAPICall(sp.next,{'result':albums_result})
        if not success: 
            print("Skipping to next artist.")
            continue
        albums_list.extend(albums_result['items'])

    # Skip the artist if there are any errors
    try:
        album_uris = [album['uri'] for album in albums_list]
        albums = [album['name'] for album in albums_list]
        release_dates = [album['release_date'] for album in albums_list]
        release_date_precisions = [album['release_date_precision'] for album in albums_list]
        totals = [album['total_tracks'] for album in albums_list]

    except: 
        print("Error in artist {}".format(artist))
    
    temp_df = pd.DataFrame(zip(albums,album_uris, release_dates, release_date_precisions, totals),columns = album_cols)
    # Set the artist and artist columns
    temp_df['artist'] = artist
    temp_df['artist_uri'] = uri
    
    album_df = pd.concat([album_df.reset_index(drop=True), temp_df.reset_index(drop=True)]).drop_duplicates()
album_df = album_df.reset_index(drop = True)

album_df 

In [None]:
album_df.sample(50)