# Spotify API Feature Extraction

* Spotify Mlllion Playlist Dataset Challenge [Homepage](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge)
* [Spotify Web API docs](https://developer.spotify.com/documentation/web-api/reference/#/)

**Community Examples**
* [Extracting song lists](https://github.com/tojhe/recsys-spotify/blob/master/processing/songlist_extraction.py)
* [construct audio features with Spotify API](https://github.com/tojhe/recsys-spotify/blob/master/processing/audio_features_construction.py)
* [Using Spotify API](https://towardsdatascience.com/extracting-song-data-from-the-spotify-api-using-python-b1e79388d50)

#### After reading through these, create a new Spotify App and get the customer id, secret
![](img/spotify-dev-console.png)

This example uses a local json credentials and if you are concerned over visibility to apikeys, please see [GCP Secret Manager](https://cloud.google.com/secret-manager)

Below is an example if you were to add the json file to secret manager (keys: `secret`, `id`)

```python
from google.cloud import secretmanager

###Note you copy/paste this from secret manager in console
SECRET_VERSION = 'projects/934903580331/secrets/spotify-creds1/versions/1'

sm_client = secretmanager.SecretManagerServiceClient()

name = sm_client.secret_path(PROJECT_ID, SECRET_ID)

response = client.access_secret_version(request={"name": SECRET_VERSION})   

payload = json.loads(response.payload.data.decode("UTF-8"))
```

### Package Installation

In [1]:
# ! pip3 install -U spotipy google-cloud-storage google-cloud-aiplatform gcsfs --user -q
# ! pip3 install --user kfp google-cloud-pipeline-components --upgrade -q
# !pip3 install --user -q google-cloud-secret-manager

In [2]:
! python3 -c "import kfp; print('KFP SDK version: {}'.format(kfp.__version__))"
! python3 -c "import google_cloud_pipeline_components; print('google_cloud_pipeline_components version: {}'.format(google_cloud_pipeline_components.__version__))"
! python3 -c "import google.cloud.aiplatform; print('aiplatform SDK version: {}'.format(google.cloud.aiplatform.__version__))"

KFP SDK version: 1.8.19
google_cloud_pipeline_components version: 1.0.39
aiplatform SDK version: 1.22.0


### Constants - setup to your config

In [3]:
PROJECT_ID = 'hybrid-vertex' #update
LOCATION = 'us-central1' 

BUCKET_NAME = 'matching-engine-content'
# BUCKET_NAME = 'spotify-million-playlists'

VERSION = 6
PIPELINE_VERSION = 'v6' # pipeline code
PIPELINE_TAG = f'{PIPELINE_VERSION}-spotify-feature-enrich'
print("PIPELINE_TAG:", PIPELINE_TAG)

PIPELINE_TAG: v6-spotify-feature-enrich


Create Bucket if needed

In [4]:
# !gsutil mb -l $LOCATION gs://$BUCKET_NAME

In [5]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

import re
from tqdm import tqdm

import os

import pandas as pd
import numpy as np
import json

import time

import gcsfs

# GCP
from google.cloud import aiplatform
from google.cloud import storage

# Pipelines
from typing import Any, Callable, Dict, NamedTuple, Optional, List
from google_cloud_pipeline_components import aiplatform as gcc_aip
from google_cloud_pipeline_components.types import artifact_types

# Kubeflow SDK
# TODO: fix these
from kfp.v2 import dsl
import kfp
import kfp.v2.dsl
from kfp.v2.google import client as pipelines_client
from kfp.v2.dsl import (Artifact, Dataset, Input, InputPath, Model, Output,
                        OutputPath, component)


aiplatform.init(
    project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_NAME
)

# Get spotify credentials
# This file has id and secret stored as attributes

creds = open('spotify-creds.json')
spotify_creds = json.load(creds)
creds.close()

### Clients & credentials

Setup Vertex AI client for pipelines

Spotify shoulld be stored in a json file with a your credentials

In [6]:
# # Setup clients
aiplatform.init(
    project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_NAME
)

In [7]:
# Get spotify credentials
# This file has id and secret stored as attributes

creds = open('spotify-creds.json')
spotify_creds = json.load(creds)
creds.close()

# Create Pipeline Components

### Audio Features

[Link to artist API and related features we will pull](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features)

In [8]:
@kfp.v2.dsl.component(
    base_image="python:3.9",
    packages_to_install=['fsspec', 'google-cloud-bigquery',
                         'google-cloud-storage',
                         'gcsfs',
                         'spotipy','requests','db-dtypes',
                         'numpy','pandas','pyarrow','absl-py', 'pandas-gbq==0.17.4',
                        ])
def call_spotify_api_audio(
    project: str,
    location: str,
    client_id: str,
    batch_size: int,
    batches_to_store: int,
    target_table: str,
    client_secret: str,
    unique_table: str,
    sleep_param: float,
) -> NamedTuple('Outputs', [('done_message', str),]):
    print(f'pip install complete')
    import os
    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials
    import re
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    import pandas as pd
    import json
    import time
    from google.cloud import storage
    import gcsfs
    import numpy as np
    from requests.exceptions import ReadTimeout, HTTPError, ConnectionError, RequestException
    from absl import logging
    from google.cloud import bigquery
    import pandas_gbq

    # print(f'package import complete')

    logging.set_verbosity(logging.INFO)
    logging.info(f'package import complete')

    
    bq_client = bigquery.Client(
      project=project, location=location
    )

    logging.info(f'spotipy auth complete')
    def spot_audio_features(uri, client_id, client_secret):

        # Authenticate
        client_credentials_manager = SpotifyClientCredentials(
            client_id=client_id, 
            client_secret=client_secret
        )
        sp = spotipy.Spotify(
            client_credentials_manager = client_credentials_manager, 
            requests_timeout=10, 
            retries=10 )
        ############################################################################
        # Create Track Audio Features DF
        ############################################################################

        #Audio features
        uri = [u.replace('"', '') for u in uri] #fix the quotes 
        a_feats = sp.audio_features(uri)
        features = pd.json_normalize(a_feats, )#.to_dict('list')
        features['track_uri'] = uri
        return features

    bq_client = bigquery.Client(
      project=project, location=location
    )

    query = f"select distinct track_uri from `{unique_table}`" 

    count = 1
    uri_batch = []

    #refactor
    schema = [{'name':'danceability', 'type': 'FLOAT'},
            {'name':'energy', 'type': 'FLOAT'},
            {'name':'key', 'type': 'FLOAT'},
            {'name':'loudness', 'type': 'FLOAT'},
            {'name':'mode', 'type': 'INTEGER'},
            {'name':'speechiness', 'type': 'FLOAT'},
            {'name':'acousticness', 'type': 'FLOAT'},
            {'name':'instrumentalness', 'type': 'FLOAT'},
            {'name':'liveness', 'type': 'FLOAT'},
            {'name':'valence', 'type': 'FLOAT'},
            {'name':'followers', 'type': 'FLOAT'},
            {'name':'tempo', 'type': 'FLOAT'},
            {'name':'type', 'type': 'STRING'},
            {'name':'id', 'type': 'STRING'},
            {'name':'uri', 'type': 'STRING'},
            {'name':'track_href', 'type': 'STRING'},
            {'name':'analysis_url', 'type': 'STRING'},
            {'name':'duration_ms_y', 'type': 'INTEGER'},
            {'name':'time_signature', 'type': 'INTEGER'},
            {'name':'track_uri', 'type': 'STRING'},
    ]

    tracks = bq_client.query(query).result().to_dataframe()
    track_list = tracks.track_uri.to_list()
    logging.info(f'finished downloading tracks')
    uri_list_length = len(track_list)
    inner_batch_count = 0 #avoiding calling the api on 0th iteration
    for uri in track_list:
        if count % batch_size == 0 or uri_list_length == count: #grab a batch of 50 songs
            uri_batch.append(uri)
            ### Try catch block for function
            try:
                audio_featureDF = spot_audio_features(uri_batch, client_id, client_secret)
                time.sleep(sleep_param)
            except ReadTimeout:
                logging.info("'Spotify timed out... trying again...'")
                audio_featureDF = spot_audio_features(uri_batch, client_id, client_secret)
                time.sleep(sleep_param)
            except HTTPError as err: #JW ADDED
                logging.info(f"HTTP Error: {err}")
            except spotipy.exceptions.SpotifyException as spotify_error: #jw_added
                logging.info(f"Spotify error: {spotify_error}")
            # Accumulate batches on the machine before writing to BQ
            if inner_batch_count <= batches_to_store or uri_list_length == count:
                if inner_batch_count == 0:
                    appended_data = audio_featureDF
                else:
                    appended_data = pd.concat([audio_featureDF, appended_data])
                inner_batch_count += 1
                uri_batch = []
                count += 1
            else:
                try:
                    appended_data.to_gbq(
                        destination_table=target_table, 
                        project_id=f'{project}', 
                        location='us-central1', 
                        table_schema=schema,
                        progress_bar=False, 
                        reauth=False, 
                        if_exists='append'
                        )
                except pandas_gbq.gbq.InvalidSchema as invalid_schema:
                    logging.info('invalid schema, skipping')
                    pass
                del appended_data
                logging.info(f'{count} of {uri_list_length} complete!')
                uri_batch = []
                count += 1
                inner_batch_count = 0
        else:
            uri_batch.append(uri)
            count += 1


    logging.info(f'audio features appended')

    return (
      f'DONE',
    )

In [9]:
# ### DEBUG

# def spot_track_features(uri, client_id, client_secret):

#     # Authenticate
#     client_credentials_manager = SpotifyClientCredentials(
#         client_id=client_id, 
#         client_secret=client_secret
#     )
#     sp = spotipy.Spotify(
#         client_credentials_manager = client_credentials_manager, 
#         requests_timeout=3, 
#         retries=3 )

#     ############################################################################
#     # Create Track Audio Features DF
#     ############################################################################ 

#     artists = sp.artists(uri)

#     features = pd.json_normalize(artists).to_dict('list')
#     artist_pop = []
#     artist_genres = []
#     followers = []
#     id_list = uri
#     for artist in artists['artists']:
#         if artist is not None:
#             artist_pop.append(artist['popularity'])
#             artist_genres.append(artist['genres'])
#             followers.append(artist['followers']['total'])
#         else:
#             artist_pop.append(-1)
#             artist_genres.append('unknown')


#     features["artist_pop"] = artist_pop
#     features["genres"] = artist_genres
#     features['followers'] = followers
#     features['artist_uri'] = id_list
#     audio_df = pd.DataFrame(features)
#     audio_df['genres'] = audio_df['genres'].astype(str)
#     return audio_df
    
# spot_track_features(['spotify:artist:000h2XLY65iWC9u5zgcL1M', 'spotify:artist:000xagx3GkcunHTFdB4ly0'], spotify_creds['id'], 
#                     spotify_creds['secret'])

### Artists 

[Link to artist API and related features we will pull](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-an-artist)

In [10]:
### Artist tracks api call

@kfp.v2.dsl.component(
    base_image="python:3.9",
    packages_to_install=['fsspec',' google-cloud-bigquery',
                         'google-cloud-storage',
                         'gcsfs',
                         'spotipy','requests','db-dtypes',
                         'numpy','pandas','pyarrow','absl-py', 'pandas-gbq==0.17.4',
                        'google-cloud-secret-manager'])
def call_spotify_api_artist(
    project: str,
    location: str,
    unique_table: str,
    batch_size: int,
    batches_to_store: int,
    client_id: str,
    client_secret: str,
    sleep_param: float,
    target_table: str,
) -> NamedTuple('Outputs', [('done_message', str),]):
    print(f'pip install complete')
    import os
    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials
    import re
    import warnings
    warnings.simplefilter(action='ignore', category=FutureWarning)
    import pandas as pd
    import json
    import time
    from google.cloud import storage
    import gcsfs
    import numpy as np
    from requests.exceptions import ReadTimeout, HTTPError, ConnectionError, RequestException
    from absl import logging
    from google.cloud import bigquery
    import pandas_gbq

    logging.set_verbosity(logging.INFO)
    logging.info(f'package import complete')

    storage_client = storage.Client(
        project=project
    )
    
    logging.info(f'spotipy auth complete')
    
    def spot_track_features(uri, client_id, client_secret):

        # Authenticate
        client_credentials_manager = SpotifyClientCredentials(
            client_id=client_id, 
            client_secret=client_secret
        )
        sp = spotipy.Spotify(
            client_credentials_manager = client_credentials_manager, 
            requests_timeout=10, 
            retries=10 )
        ############################################################################
        # Create Track Audio Features DF
        ############################################################################
        #artists api call
        uri = [u.replace('"', '') for u in uri] #fix the quotes 
        artists = sp.artists(uri)
        
        features = pd.json_normalize(artists, ) #.to_dict('list')
        return features
        

    bq_client = bigquery.Client(
      project=project, location=location
    )


    query = f"select distinct artist_uri from `{unique_table}`"
    

    schema = [{'name': 'artist_pop', 'type': 'INTEGER'},
            {'name':'genres', 'type': 'STRING'},
            {'name':'followers', 'type': 'INTEGER'},
            {'name':'artist_uri', 'type': 'STRING'}
    ]
    count = 1
    uri_batch = []

    ats = bq_client.query(query).result().to_dataframe()
    artist_set = ats.artist_uri.to_list()
    uri_list_length = len(artist_set)
    logging.info(f'finished downloading tracks')
    inner_batch_count = 0
    for uri in artist_set:
        if count % batch_size == 0 or uri_list_length == count: #grab a batch of 50 artists
            uri_batch.append(uri)
            ### Try catch block for function
            try:
                artists_featureDF = spot_track_features(uri_batch, client_id, client_secret)
                time.sleep(sleep_param)
            except ReadTimeout:
                logging.info("'Spotify timed out... trying again...'")
                artists_featureDF = spot_track_features(uri_batch, client_id, client_secret)
                time.sleep(sleep_param)
            except HTTPError as err: 
                logging.info(f"HTTP Error: {err}")
            except spotipy.exceptions.SpotifyException as spotify_error: #jw_added
                logging.info(f"Spotify error: {spotify_error}")
            # Accumulate batches on the machine before writing to BQ
            if inner_batch_count <= batches_to_store or uri_list_length == count:
                if inner_batch_count == 0:
                    appended_data = artists_featureDF
                else:
                    appended_data = pd.concat([artists_featureDF, appended_data])
                
                inner_batch_count += 1
                uri_batch = []
                count += 1
            else:
                try:
                    appended_data.to_gbq(
                        destination_table=target_table, 
                        project_id=f'{project}', 
                        location='us-central1', 
                        table_schema=schema,
                        progress_bar=False, 
                        reauth=False, 
                        if_exists='append'
                        )
                except pandas_gbq.gbq.InvalidSchema as invalid_schema:
                    logging.info('invalid schema, skipping')
                    pass
                
                del appended_data
                logging.info(f'{count} of {uri_list_length} complete!')
                inner_batch_count = 0
                uri_batch = []
                count += 1
        else:
            uri_batch.append(uri)
            count += 1
    return (
          f'DONE',
      )

## Build Pipeline

In [11]:
from typing import Dict

@kfp.v2.dsl.pipeline(
  name=f'spotify-feature-enrichment-{PIPELINE_TAG}'.replace('_', '-')
)
def pipeline(
    project: str,
    location: str,
    unique_table: str,
    target_table_audio: str,
    target_table_artist: str,
    batch_size: int,
    batches_to_store: int,
    sleep_param: float,
    spotify_id: str = spotify_creds['id'],
    spotify_secret: str = spotify_creds['secret'],
    ):


    call_spotify_api_artist_op = call_spotify_api_artist(
        project=project,
        location=location,
        client_id=spotify_id,
        client_secret=spotify_secret,
        batch_size=batch_size,
        sleep_param=sleep_param,
        unique_table=unique_table,
        target_table=target_table_artist,
        batches_to_store=batches_to_store,
    ).set_display_name("Get Artist Features From Spotify API")

    call_spotify_api_audio_op = call_spotify_api_audio(
        project=project,
        location=location,
        client_id=spotify_id,
        client_secret=spotify_secret,
        batch_size=batch_size,
        sleep_param=sleep_param,
        unique_table=unique_table,
        target_table=target_table_audio,
        batches_to_store=batches_to_store,
    ).set_display_name("Get Track Audio Features From Spotify API")

### Compile the pipeline to json
This can be stored on gcs as well for broader orchastration purposes

In [12]:
kfp.v2.compiler.Compiler().compile(
  pipeline_func=pipeline, 
  package_path='custom_container_pipeline_spec.json',
)



### Set the pipeline parameters

Use a dictionary with the afforementioned types defined by your pipeline

In [13]:
# jtotten-project #
GCS_BUCKET = 'matching-engine-content'
BQ_DATASET = 'mdp_eda_test'



PIPELINE_PARAMETERS = dict(
    project = PROJECT_ID,
    location = 'us-central1',
    unique_table = f'{PROJECT_ID}.{BQ_DATASET}.tracks_unique', 
    target_table_audio = f'{PROJECT_ID}.{BQ_DATASET}.audio_features',
    target_table_artist = f'{PROJECT_ID}.{BQ_DATASET}.artist_features',
    batch_size = 50,
    batches_to_store = 100000,
    sleep_param = 0.05,
)

PIPELINE_PARAMETERS

{'project': 'hybrid-vertex',
 'location': 'us-central1',
 'unique_table': 'hybrid-vertex.mdp_eda_test.tracks_unique',
 'target_table_audio': 'hybrid-vertex.mdp_eda_test.audio_features',
 'target_table_artist': 'hybrid-vertex.mdp_eda_test.artist_features',
 'batch_size': 50,
 'batches_to_store': 100000,
 'sleep_param': 0.05}

In [14]:
job = aiplatform.PipelineJob(display_name = f'spotify-feature-enrichment-{PIPELINE_TAG}'.replace('_', '-'),
                             template_path = 'custom_container_pipeline_spec.json',
                             pipeline_root = f'gs://{BUCKET_NAME}/{VERSION}',
                             parameter_values = PIPELINE_PARAMETERS,
                             project = PROJECT_ID,
                             location = LOCATION,
                              enable_caching=False)

job.submit()

Creating PipelineJob
PipelineJob created. Resource name: projects/934903580331/locations/us-central1/pipelineJobs/spotify-feature-enrichment-v6-spotify-feature-enrich-20230224234340
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/934903580331/locations/us-central1/pipelineJobs/spotify-feature-enrichment-v6-spotify-feature-enrich-20230224234340')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/spotify-feature-enrichment-v6-spotify-feature-enrich-20230224234340?project=934903580331
