# **Extraction of Playlist Data using Spotipy**
In order to form our dataset, we made our own playlist on Spotify consisting of songs from different eras, different genres, different languages, etc. This was done through the merging of multiple of Spotify's playlists.

After this dataset playlist was made, we used <u>Spotify's Web API 'Spotipy'</u> to extract detailed information about each track (e.g. audio features, track popularity, artist popularity, etc.). The [final dataset](https://open.spotify.com/playlist/1aA8TSi48YqaGdXNDqGrVV?si=5b086c85d48248b6) consists of 9942 tracks.

This Jupyter Notebook will take you through the process of track data extraction :
> 1. Extraction of Playlist Tracks
> 2. Obtaining Audio Features and Artist Information for each track
> 3. Writing to CSV file

---

## **Import Necessary Libraries**
- *spotipy* is vital for the extraction of any data from Spotify
- *SpotifyClientCredentials* is necessary to obtain authorisation from Spotify


In [3]:
import spotipy
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
import time # time.sleep() used throughout data extraction to prevent MaxRetriesError from the API

client_id = '6f214ac01be74f798b00a6ca1cc14cb0' # our personal client_id 
client_secret = '131ee3fba4a6432fafb814657def5785' # our personal client_secret

# Obtain authorisation from Spotify
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager, retries=0) 

## **Initialise Arrays and Dataframe**

In [28]:
dataset_URI = 'spotify:playlist:1aA8TSi48YqaGdXNDqGrVV' # unique URI of our dataset playlist
track_id = []
artist_id = []
track_names = []
album_names = []
artist_genres = []
artist_pop = []
track_pop = []
album_date = []
audio_features_artists = []

In [36]:
# column names for audio features, artist information, track URI, artist URI
col_names = ['Danceability', 'Energy', 'Key', 'Loudness', 'Mode', 'Speechiness', 'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Duration', 'Time Signature',
             'Artist Genres', 'Artist Name', 'Artist Popularity', 'Track URI', 'Artist URI']

# create an empty dataframe with the above column names
all_tracks = pd.DataFrame(columns=col_names)

## **Helper Functions**

### **Extract Playlist Tracks**

In [4]:
def get_playlist_tracks(pl_URI, lim, offs):
    """
    The get_playlist_tracks() function extracts information about each track from the dataset playlist

    For tracks with multiple artists, the main (i.e. the first) artist was chosen as the metric of popularity
    """

    # iterate through playlist tracks
    for track in sp.playlist_tracks(pl_URI, limit=lim, offset=offs)["items"]:
        # only take in non-local tracks (i.e. tracks already on spotify)
        if not track['is_local']:
            ## URI
            track_uri = track["track"]["uri"] # get track's URI
            
            ## Track Information
            # Track name
            track_name = track["track"]["name"] # get track's name
            # Popularity of the track
            t_pop = track["track"]["popularity"] # get track's popularity index
            
            ## Artist Information
            artist_uri = track["track"]["artists"][0]["uri"] # get artists' URI
            
            ## Album Information
            # Album name
            album = track["track"]["album"]["name"] # get album name
            # Album release date in format 'YYYY-MM'
            date = track["track"]["album"]["release_date"] # get album release date -- or release_date_precision???

            track_id.append(track_uri)
            artist_id.append(artist_uri)
            track_names.append(track_name)
            album_names.append(album)
            track_pop.append(t_pop)
            album_date.append(date)
    

### **Get audio features and artist information of the tracks**

In [34]:
def get_audio_artist(tracks, artists, df):
    """
    Retrieves audio features, artist name, artist genres, artist popularity of the tracks in intervals of 50 tracks.

    sp.artists() is used to extract artist information. Only artist name, artist genres, artist popularity
    are extracted from the dictionary of artist information.

    sp.audio_features() is used to extract audio features. Only the audio features are extracted
    with the other track information filtered out.

    Returns dataframe containing all the above columns.
    """

    for i in range(0, len(tracks), 50):
        if (i+50) > len(tracks):
            track_interval = tracks[i:len(tracks)]
            artist_interval = artists[i:len(artists)]
        else:
            track_interval = tracks[i:i+50]
            artist_interval = artists[i:i+50]
        
        ## Retrieve artist information (i.e. popularity, genres, name)
        # Get artist object
        artist_info = sp.artists(artist_interval) # returns an artist object corresponding to the given URI, this object contains detailed info on the artist
        # Name, popularity, genre of Artist
        artist_info = [[v for k, v in d.items() if k in ['name', 'genres', 'popularity']] for d in artist_info["artists"]]
        for a in artist_info:
            a[0] = ", ".join(a[0]) # get artists' list of genres and converts the list into a string with a comma as the separator

        ## Retrieve audio features
        audio_features = sp.audio_features(track_interval)
        audio_features = [[v for k, v in d.items() if k not in ['type', 'id', 'uri', 'track_href', 'analysis_url']] for d in audio_features]
        
        ## Combine retrieved information into one list (i.e. a row in the dataframe)
        audio_artist = [] # store all rows
        batch = i
        for j in range(len(audio_features)):
            row = audio_features[j]+ artist_info[j] + [tracks[batch]] + [artists[batch]]
            audio_artist.append(row)
            batch += 1

        print(f"{i+50} tracks done", end="; ")
        time.sleep(2)

        ## Add audio_artist rows into the dataframe
        audio_artist = pd.DataFrame(data=audio_artist, columns=col_names)
        df = pd.concat([df, audio_artist], join="outer", ignore_index=True)

    return df

## **Extract data using API**
First, all the tracks are extracted from the playlist using the get_playlist_tracks() function

In [30]:
for i in range(0,9901,100):
    get_playlist_tracks(dataset_URI, 100, i)
    print(f"{i+100} done", end="; ")

    if (i+100) % 1000 == 0:
        time.sleep(5)
        print()

100 done; 200 done; 300 done; 400 done; 500 done; 600 done; 700 done; 800 done; 900 done; 1000 done; 
1100 done; 1200 done; 1300 done; 1400 done; 1500 done; 1600 done; 1700 done; 1800 done; 1900 done; 2000 done; 
2100 done; 2200 done; 2300 done; 2400 done; 2500 done; 2600 done; 2700 done; 2800 done; 2900 done; 3000 done; 
3100 done; 3200 done; 3300 done; 3400 done; 3500 done; 3600 done; 3700 done; 3800 done; 3900 done; 4000 done; 
4100 done; 4200 done; 4300 done; 4400 done; 4500 done; 4600 done; 4700 done; 4800 done; 4900 done; 5000 done; 
5100 done; 5200 done; 5300 done; 5400 done; 5500 done; 5600 done; 5700 done; 5800 done; 5900 done; 6000 done; 
6100 done; 6200 done; 6300 done; 6400 done; 6500 done; 6600 done; 6700 done; 6800 done; 6900 done; 7000 done; 
7100 done; 7200 done; 7300 done; 7400 done; 7500 done; 7600 done; 7700 done; 7800 done; 7900 done; 8000 done; 
8100 done; 8200 done; 8300 done; 8400 done; 8500 done; 8600 done; 8700 done; 8800 done; 8900 done; 9000 done; 
9100 done;

---

Next, the audio features and artist information of the tracks are retrieved from Spotify using the API. This is done by the get_audio_artist() function

In [39]:
for i in range(0, 9901, 1000):
    print(f"Retrieving information for tracks {i} to {i+1000}")
    all_tracks = get_audio_artist(track_id[i:i+1000], artist_id[i:i+1000], all_tracks)
    print()

Retrieving information for tracks 0 to 1000
50 tracks done; 100 tracks done; 150 tracks done; 200 tracks done; 250 tracks done; 300 tracks done; 350 tracks done; 400 tracks done; 450 tracks done; 500 tracks done; 550 tracks done; 600 tracks done; 650 tracks done; 700 tracks done; 750 tracks done; 800 tracks done; 850 tracks done; 900 tracks done; 950 tracks done; 1000 tracks done; 
Retrieving information for tracks 1000 to 2000
50 tracks done; 100 tracks done; 150 tracks done; 200 tracks done; 250 tracks done; 300 tracks done; 350 tracks done; 400 tracks done; 450 tracks done; 500 tracks done; 550 tracks done; 600 tracks done; 650 tracks done; 700 tracks done; 750 tracks done; 800 tracks done; 850 tracks done; 900 tracks done; 950 tracks done; 1000 tracks done; 
Retrieving information for tracks 2000 to 3000
50 tracks done; 100 tracks done; 150 tracks done; 200 tracks done; 250 tracks done; 300 tracks done; 350 tracks done; 400 tracks done; 450 tracks done; 500 tracks done; 550 tracks 

## **Write to CSV File**

In [42]:
all_tracks['Track Name'] = track_names
all_tracks['Album'] = album_names
all_tracks['Album Release Date'] = album_date
all_tracks['Track Popularity'] = track_pop

# Reorder columns
all_tracks = all_tracks[['Track Name', 'Artist Name', 'Album', 'Album Release Date', 'Artist Genres',
                         'Danceability', 'Energy', 'Key', 'Loudness', 'Mode', 'Speechiness', 'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Duration', 'Time Signature',
                         'Artist Popularity', 'Track Popularity', 'Track URI', 'Artist URI']]

In [44]:
all_tracks.to_csv('dataset.csv', index=True) # write to 'dataset.csv'