In [1]:
import pandas as pd
import numpy as np
import re
from collections import Counter
import requests
import bs4
import sqlite3
import time
from getpass import getpass
from urllib.parse import urljoin

HTML('''
<script
    src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js ">
</script>
<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
 } else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
    value="Click here to toggle on/off the raw code."></form>
''')

<div style="background-color: #6d03ba; padding: 10px 0;">
    <center><h1 style="color: white; font-family:Roboto; font-weight:bold">SPOTIFY DATA EXTRACTION VIA API AND WEB SCRAPING</h1></center>
</div> 

Embarking on the Spotify API journey, we unraveled the complexity of 35-40 calls, extracting in-depth data on artists, songs, and playlists. From securing access tokens with a developer account to maneuvering around Spotify's strict quota, we faced nuanced challenges. The time constraint, with tokens expiring every hour, added a constant challenge, influencing our approach for efficient data extraction. Join us in exploring the technical intricacies of mastering Spotify's API, offering insights for developers in the realm of digital music analysis.

<div style="background-color: #fc8709; padding: 10px 0;">
    <center><h2 style="color: white; font-family:Roboto; font-weight:bold">STEP 1: Create Access Token</h2></center>

Spotify's API calls require an access/bearer token.<br>
The token can be acquired by using the **<i>"token"</i>** API call of Spotify.

In [2]:
# INPUT YOUR CLIENT ID HERE
client_id = getpass()

 ········


In [3]:
# INPUT YOUR CLIENT SECRET HERE
client_secret = getpass()

 ········


In [4]:
def get_token(client_id, client_secret):
    """
    Perform a Spotify API call and return an access token.
    
    Parameters:
    -----------
    client_id: str
        Hexadecimal string that represents client id.
    client_secret: str
        Hexadecimal string that serves as unique client password.
        
    Returns:
    --------
    api_key: str
        String that contains the Access Token.
    """
    url = "https://accounts.spotify.com/api/token"
    headers = {'Content-Type': 'application/x-www-form-urlencoded'}
    params = {
        'grant_type': 'client_credentials',
        'client_id': client_id,
        'client_secret': client_secret
    }
    response = requests.post(url, headers=headers, data=params).json()
    api_key = response['access_token']
    return api_key

In [5]:
api_key = get_token(client_id, client_secret)

<div style="background-color: #fc8709; padding: 10px 0;">
    <center><h2 style="color: white; font-family:Roboto; font-weight:bold">STEP 2: Retrieve Tracks from All Genres</h2></center>

To create a diverse pool of songs to perform PCA on, Spotify's API can provide a list of available genres.<br>
We can then use an API call that gives recommendations per genre, this will give us a list of track ids to be used in the next step.

In [6]:
# Extract list of available genres on Spotify
genres = requests.get('https://api.spotify.com/v1/recommendations/'
                      'available-genre-seeds',
                      headers={'Authorization': f'Bearer {api_key}'}
                     ).json()['genres']

# Extract track_ids recommendations for each genre
track_ids = set()
for genre in genres:
    time.sleep(2)
    response = requests.get(f'https://api.spotify.com/v1/recommendations?'
                            f'seed_genres={genre}&'
                            f'limit=10',
                            headers={'Authorization': f'Bearer {api_key}'}
                           ).json()
    for track in response['tracks']:
        track_ids.add(track['id'])
track_ids = list(track_ids)

<div style="background-color: #1ccf54; padding: 10px 0;">
    <center><h3 style="color: white; font-family:Roboto; font-weight:bold">Get the audio features per track</h3></center>

The **<i>audio-features</i>** API call will give us the audio features of each track upon providing its id.<br>
Web-scraping can give us the name of the song and the artist.

In [7]:
def audio_features(track_ids):
    """
    Return a dataframe that contains information on each track
    
    Parameters:
    -----------
    track_ids: array_like
        An iterable of strings that correspond to id's of Spotify tracks
        
    Returns:
    --------
    df_tracks: pandas.DataFrame
        Dataframe that contains audio_featurs, url, track_name, artist
        for each track
    """
    # Group track ids in batches of 100
    batches = [','.join(track_ids[i:i+100]) for i in 
           range(0, len(track_ids), 100)]
    # Get the audio features for each track
    stats = []
    for track_ids in batches:
        time.sleep(2)
        url = f'https://api.spotify.com/v1/audio-features?ids={track_ids}'
        headers = {'Authorization': f'Bearer {api_key}'}
        content = requests.get(url, headers=headers).json()['audio_features']
        stats.extend(content)
    # Retrieve only the relevant stats and return a dataframe
    dict_stats = {}
    for stat in stats:
        for key, value in stat.items():
            if key not in dict_stats:
                dict_stats[key] = [value]
            else:
                dict_stats[key].append(value)
    df_tracks = pd.DataFrame(dict_stats)
    # Retrieve artists and track_name of tracks
    artists = []
    track_name = []
    base_url = 'https://open.spotify.com/track/'
    links = [urljoin(base_url, id_) for id_ in df_tracks['id']]
    for link in links:
        time.sleep(2)
        content = requests.get(link).content
        soup = bs4.BeautifulSoup(content)
        artists.append(soup.find('a').text)
        track_name.append(soup.find('h1').text)
    df_tracks['artist'] = artists
    df_tracks['track_name'] = track_name
    return df_tracks

In [8]:
df_tracks = audio_features(track_ids)
df_tracks.drop_duplicates(subset='id', inplace=True, ignore_index=True)
df_tracks.to_csv('track_pool.csv', index=False)

<div style="background-color: #fc8709; padding: 10px 0;">
    <center><h2 style="color: white; font-family:Roboto; font-weight:bold">STEP 3: Retrieve User Top Tracks for Implementation</h2></center>

Create a new access token to avoid errors due to expired tokens.<br>
The data extracted here will be used to assign musical archetypes to users.

In [9]:
# INPUT YOUR CLIENT ID HERE
client_id = getpass()

 ········


In [10]:
# INPUT YOUR CLIENT SECRET HERE
client_secret = getpass()

 ········


In [11]:
api_key = get_token(client_id, client_secret)

In [12]:
df_user = pd.read_csv('user_tracks.csv')
user_track_ids = df_user['track_id'].to_list()
user_tracks = audio_features(user_track_ids)
user_tracks['user'] = df_user['Name'].to_list()
user_tracks.to_csv('top_tracks.csv', index=False)

<div style="background-color: #1ccf54; padding: 10px 0;">
    <center><h1 style="color: white; font-family:Roboto; font-weight:bold">Recommendations</h1></center>

To optimize your experience, here's a set of recommendations tailored for new developers, ensuring a seamless introduction to the Spotify API landscape.
 
Start by familiarizing yourself with the different endpoints available in the Spotify API. From artists to playlists, there's many waiting to be uncovered.
 
Token Crafting:
Then, understand the intricacies of token management. Understanding how to request, refresh, and utilize access tokens will keep your API interactions efficient.
 
Respect the rate limits to ensure your API calls flow smoothly. Strategic pacing will prevent interruptions in your data collection process.
 
Dive into the documentation to understand each endpoint's parameters, responses, and nuances. It's your roadmap to the perfect API call. A key to effeciency, reading the documentation is crucial to avoid request timeouts or 429 errors.
 
Optimize your API requests by selecting only the data you need. This collection reduces unnecessary data transfer, making your applications perform efficiently.