# **Editing Kaggle Dataset for Time Comparison**
In order to conduct a clearer analysis of spotify tracks across time, we used a Kaggle dataset that was specifically designed to compromise tracks from a wide range of eras.

However, this dataset did not contain the feature <u>'artist popularity'</u>. It was thus necessary to use the Spotify API to extract the artist popularity values of these tracks and add them to the dataset.

This Jupyter Notebook will take you through the process of adding 'artist popularity' to the Kaggle dataset

In [90]:
import csv
import spotipy
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
import time # time.sleep() used throughout data extraction to prevent MaxRetriesError from the API

client_id = '6f214ac01be74f798b00a6ca1cc14cb0' # our personal client_id 
client_secret = '131ee3fba4a6432fafb814657def5785' # our personal client_secret

# Obtain authorisation from Spotify
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager, retries=0) 

## **Retrieve Artist URI and Track URI from CSV file**
> Artist URI is necessary to retrieve the 'artist popularity' metric\
> Track URI is necessary to _filter out local files/tracks_ that do not exist in the Spotify database

In [121]:
df = pd.read_csv('top_10000_1960-now.csv')

artist_id = []
with open('top_10000_1960-now.csv', 'r') as file:
    file = csv.reader(file)
    header = next(file)

    """
    For tracks with multiple artists, the main (i.e. the first) artist was chosen as the metric of popularity
    """

    for line in file:
        if 'local' not in line[0]: # filter out local tracks
            line[2] = line[2].split(',')[0] # only use the main artist
            artist_id.append(line[2])
        else:
            df.drop(df.loc[df['Track URI'] == line[0]].index, inplace=True) # drop local tracks from csv file

## **Helper Function to retrieve 'artist popularity'**

In [122]:
def get_artist_pop(artists, pop_list):
     
    for i in range(0, len(artists), 50):
        if (i+50) > len(artists):
            artist_interval = artists[i:len(artists)]
        else:
            artist_interval = artists[i:i+50]

        ## Retrieve artist information (i.e. popularity, genres, name)
        # Get artist object
        artist_info = sp.artists(artist_interval) # returns an artist object corresponding to the given URI, this object contains detailed info on the artist
        
        # Retrieve artist popularity
        popularity = []
        for d in artist_info["artists"]:
            for k, v in d.items():
                if k == 'popularity':
                    popularity.append(v)

        pop_list.extend(popularity)

        print(f"{i+50} tracks done", end="; ")
        time.sleep(2)

## **Extract data using API**

In [123]:
artist_pop = []

for i in range(0, 9901, 1000):
    print(f"Retrieving artist popularity for tracks {i} to {i+1000}")
    get_artist_pop(artist_id[i:i+1000], artist_pop)
    print()

Retrieving artist popularity for tracks 0 to 1000
50 tracks done; 100 tracks done; 150 tracks done; 200 tracks done; 250 tracks done; 300 tracks done; 350 tracks done; 400 tracks done; 450 tracks done; 500 tracks done; 550 tracks done; 600 tracks done; 650 tracks done; 700 tracks done; 750 tracks done; 800 tracks done; 850 tracks done; 900 tracks done; 950 tracks done; 1000 tracks done; 
Retrieving artist popularity for tracks 1000 to 2000
50 tracks done; 100 tracks done; 150 tracks done; 200 tracks done; 250 tracks done; 300 tracks done; 350 tracks done; 400 tracks done; 450 tracks done; 500 tracks done; 550 tracks done; 600 tracks done; 650 tracks done; 700 tracks done; 750 tracks done; 800 tracks done; 850 tracks done; 900 tracks done; 950 tracks done; 1000 tracks done; 
Retrieving artist popularity for tracks 2000 to 3000
50 tracks done; 100 tracks done; 150 tracks done; 200 tracks done; 250 tracks done; 300 tracks done; 350 tracks done; 400 tracks done; 450 tracks done; 500 tracks

## **Write to new CSV file**

In [125]:
df["Artist Popularity"] = artist_pop # add artist popularity column

In [126]:
df.to_csv('top_10000_1960-now_updated.csv', index=True) # write to 'top_10000_1960-now_updated.csv'