## Download data from Spotify API

This notebook aims at dowloading and storing all the required data for our analysis. As a reminder, the goal of this project is to analyse my Spotify tracks audio features to discover if I have a specific audio profile I like ( for example high energy, music you can groove too) or if, as I like to believe, I have various taste in audio profiles.

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
import time 
import numpy as np
import os, pickle

### User credentials and authentification

In [None]:
# reading from csv that contains client id and client secret
spotify_client_info = pd.read_csv('/Users/admin/Documents/spotify_client_info.csv')

In [None]:
client_id = spotify_client_info.iloc[0,0]
client_secret = spotify_client_info.iloc[0,1]

In [None]:
sp = spotipy.SpotifyOAuth

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
        client_id = spotify_client_info.iloc[0,0],
        client_secret = spotify_client_info.iloc[0,1],
        redirect_uri='http://localhost:8888/callback',
        scope="user-library-read"
    ) )

### Track IDs collection

First, I need to collect all my track IDs and gather them in a list.

In [None]:
def getTrackIDs(user, playlist_id):
    ids = []
    playlist = sp.user_playlist(user, playlist_id)
    for item in playlist['tracks']['items']:
        track = item['track']
        ids.append(track['id'])
    return ids

In [None]:
results = sp.current_user_saved_tracks(50)
results['items'][0]['track']['id']

In [None]:
ids = []

# Get current user's saved tracks
results = sp.current_user_saved_tracks(limit=50)  # Adjust the limit if needed
while results:
    for idx, item in enumerate(results['items']):
        track = item['track']
        ids.append(track['id'])

# Check if there are more tracks to retrieve
    if results['next']:
        results = sp.next(results)
    else:
        results = None

ids
        


In [None]:
#Saving the IDs list into a pkl file for easier and quicker access

with open('spotify_tracks_ids.pkl', 'wb') as e:
    pickle.dump(ids, e)

### Audio features collection

In [None]:
#Loading the ids from the file
with open('spotify_tracks_ids.pkl', 'rb') as f:
    ids = pickle.load(f)

In [None]:
len(ids)

There are 5984 songs to loop from. I then chose to loop through batches of 100 songs to not exceed Spotify's rate limit. I am collecting for each song:
- meta date (artist, song, name, release date and popularity)
- audio features (acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, time_signature, key, mode, uri) https://developer.spotify.com/documentation/web-api/reference/get-audio-features

All this information is grouped in a dataframe that will be exported to a pkl file for further analysis later on.

In [None]:
import pandas as pd
import time
from spotipy.exceptions import SpotifyException

def exponential_backoff_retry(func, *args, max_retries=3, base_delay=1, **kwargs):
    for attempt in range(max_retries):
        try:
            return func(*args, **kwargs)
        except SpotifyException as e:
            if e.http_status == 429:
                print(f"Rate limited. Retrying in {base_delay * 2 ** attempt} seconds.")
                time.sleep(base_delay * 2 ** attempt)
            else:
                raise e
    print("Max retries exceeded. Unable to fetch track features.")
    return None

def get_user_saved_track_features(sp, ids, start_track_id=0, batch_size=100):
    all_tracks = []
    total_tracks = len(ids)
    
    while start_track_id < total_tracks:
        end_track_id = min(start_track_id + 1000, total_tracks)
        tracks_f = []
        tracks_m = []
        
        # Iterate through each batch of track IDs
        batches = [ids[i:i+batch_size] for i in range(start_track_id, end_track_id, batch_size)]
        for batch in batches:
            for track_id in batch:
                meta = exponential_backoff_retry(sp.track, track_id)
                if meta is None:
                    continue
                name = meta['name']
                album = meta['album']['name']
                artist = meta['album']['artists'][0]['name']
                release_date = meta['album']['release_date']
                length = meta['duration_ms']
                popularity = meta['popularity']
                tracks_m.append([name, album, artist, release_date, length, popularity])
                print(f"Processed meta for track ID {track_id}")
            
            batch_features = exponential_backoff_retry(sp.audio_features, batch)

            if batch_features:
                for features in batch_features:
                    print(features)
                    if features is not None and any(value is not None for value in features.values()):
                        acousticness = features['acousticness']
                        danceability = features['danceability']
                        energy = features['energy']
                        instrumentalness = features['instrumentalness']
                        liveness = features['liveness']
                        loudness = features['loudness']
                        speechiness = features['speechiness']
                        tempo = features['tempo']
                        valence = features['valence']
                        time_signature = features['time_signature']
                        key = features['key']
                        mode = features['mode']
                        uri = features['uri']

                        tracks_f.append([acousticness, danceability, energy, instrumentalness,
                                       liveness, loudness, speechiness, tempo, valence,
                                       time_signature, key, mode, uri])
                        print(f"Processed track ID {track_id}, {features['uri']}")
                    else:
                        print(f"Skipping track ID {track_id} because at least one feature value is None")

                    time.sleep(1)  # Sleep for 1 second per song
            elif batch_features is None:
                print(f"Skipping batch due to error")

            time.sleep(1) # Sleep for 1 second per batch to avoid rate limiting
        
        batch_results = [meta_data + track_feature for meta_data, track_feature in zip(tracks_m, tracks_f)]
        all_tracks.extend(batch_results)

        start_track_id = end_track_id  # Update start_track_id to the next batch

    return all_tracks


In [None]:
all_track_features = get_user_saved_track_features(sp, ids, start_track_id=0, batch_size=100)

In [None]:
with open('spotify_alltracks.pkl', 'wb') as f:
    pickle.dump(all_track_features, f)

In [None]:
with open('spotify_alltracks.pkl', 'rb') as j:
    all_track_features = pickle.load(j)

In [None]:
while None in all_track_features:
    all_track_features.remove(None)

In [None]:
    df = pd.DataFrame(all_track_features, columns=['name', 'album', 'artist', 'release_date',
                                       'length', 'popularity', 'acousticness', 'danceability',
                                       'energy', 'instrumentalness', 'liveness', 'loudness',
                                       'speechiness', 'tempo', 'valence', 'time_signature',
                                       'key', 'mode', 'uri'])

In [None]:
with open('spotify_df.pkl', 'wb') as l:
    pickle.dump(df, l)