# 01b_Collection: Querying Spotipy for Audio Features + Analysis

**Description**: Using the Spotipy wrapper, retrieving audio features and audio analysis from the Spotify API.

**Disclaimer**: Since certain processes within this notebook require API keys (which are not stored within this notebook), or datbase access credentials, it is not possible to run every cell from start to finish. If you'd like to do so, you'll need to request Spotify API access with client credentials [here](https://developer.spotify.com/dashboard/login), and reach out to receive access to the SQL database referenced."

## Table of Contents

1. [Retrieving Audio Features](#1)
2. [Retrieving Audio Analysis](#2)

In [1]:
import json
import pickle
import re
import time

import numpy as np
import pandas as pd

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

from sqlalchemy import create_engine

from library import mk_msong_list

**Disclaimer:** Client Credentials are listed for illustrative purposes only. You will not be able to replicate the information contained here without actual API access credentials.

In [2]:
client_credentials_manager = SpotifyClientCredentials(client_id="xXXXXXxxxXXXXXxxxxXXXxx",
                                                          client_secret="xXXXXXxxxXXXXXxxxxXXXxx")
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

#### Testing Connection

In [6]:
sp.audio_features('62bOmKYxYg7dhrC6gH9vFn')

[{'danceability': 0.61,
  'energy': 0.926,
  'key': 8,
  'loudness': -4.843,
  'mode': 0,
  'speechiness': 0.0479,
  'acousticness': 0.031,
  'instrumentalness': 0.0012,
  'liveness': 0.0821,
  'valence': 0.861,
  'tempo': 172.638,
  'type': 'audio_features',
  'id': '62bOmKYxYg7dhrC6gH9vFn',
  'uri': 'spotify:track:62bOmKYxYg7dhrC6gH9vFn',
  'track_href': 'https://api.spotify.com/v1/tracks/62bOmKYxYg7dhrC6gH9vFn',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/62bOmKYxYg7dhrC6gH9vFn',
  'duration_ms': 200400,
  'time_signature': 4}]

<a name="1"></a>
### 1. Retrieving Audio Features

#### Loading in Song List

In [3]:
with open('../pickle/song_list.pkl', 'rb+') as f:
    song_list = pickle.load(f)

#### 1a. Retrieving Audio Features for Every Song in `song_list`

Adding `try` & `except` statements, along with `sleep` times was necessary to keep audio feature extraction going uninterrupted.

In [4]:
def get_song_feat(song_list):
    '''
    Retrieve audio features for every song in `song_list`.
    '''
    song_feat = []
    for i in song_list:
        if isinstance(i, dict):
            id_list = []
            for k in i['tracks']:
                id_list.append(k['id'])
        else:
            continue
        try:
            song_feat.append(sp.audio_features(id_list))
            time.sleep(1)
        except:
            time.sleep(5)
            song_feat.append(sp.audio_features(id_list))
    return song_feat

In [6]:
song_feat = get_song_feat(song_list)

In [36]:
len(song_feat)

2444

In [38]:
with open('../pickle/song_feat.pkl', 'wb+') as f:
    pickle.dump(song_feat, f)

<a name="2"></a>
## 2. Retrieving Audio Analysis

Due to the size of a combined resulting dict, I saved every song's corresponding audio analysis into a separate json file. This made it easier to eventually load in, perform EDA, and manipulate, as I could do it on a single song basis (the total size of audio analysis files utilized is roughly 10GB).

In [7]:
def get_aa(song_list):
    '''
    Retrieve audio analysis for every song in `song_list` and store each in separate .json file.
    '''
    count = 0
    for i in song_list:
        if isinstance(i, dict):
            for k in i['tracks']:
                try:
                    analysis = sp.audio_analysis(k['id'])
                except:
                    time.sleep(5)
                    analysis = sp.audio_analysis(k['id'])
                with open('../data/audio_analysis/{}.json'.format(k['id']), 'w') as f:
                    json.dump(analysis, f)
        count += 1
        if count % 5000 == 0:
            print('song {} finished'.format(count))
    print('finished')

In [9]:
get_aa(song_list)

### 2a. Checking to See Which Songs Weren't Grabbed in Audio Analysis

In [47]:
df = pd.read_csv('../data/analysis_list.txt', delimiter=" ", header=None)

In [53]:
df = df.apply(lambda x: x.str.rstrip('.json'), 1)

In [82]:
df.shape

(23129, 1)

#### Connecting to PostgreSQL to Retreive Song Listing

For ease of access anywhere, I established a PostgreSQL database, where I've been storing all of the tables created during the data collection process

**Disclaimer**: The credentials listed are for illustrative purposes only. Please reach out if you'd like to connect to this database.

In [64]:
engine = create_engine('postgresql://postgres:xxxxXXXXXxxxx@xxxxxxxxxus-west-2.compute.amazonaws.com:5432/postgres')
engine.connect()

<sqlalchemy.engine.base.Connection at 0x7fa6daec65f8>

In [65]:
song_id_list = pd.read_sql("""
                            SELECT * FROM song_list
                            """, con=engine)

In [84]:
no_aa = song_id_list[~song_id_list['song_id'].isin(df[0])]    

In [85]:
no_aa.shape

(1506, 2)

Looks like 1506 titles weren't reterieved in the Audio Analysis API pull. However, I did not consider this a large enough amount to stop processing and grab additional songs.

#### Creating Array of Song Titles + Unique ID

I'll need this for a ton of different things, including checking my progress on grabbing the audio analysis.

In [12]:
master_song_list = mk_msong_list.mk_msong_list(song_list)

##### Size of `master_song_list`

In [13]:
len(master_song_list)

23888

In [8]:
23888 * 403

9626864

##### Dumping Master Song List to `.json`

In [40]:
with open('../data/master_song_list.json', 'w+') as f:
    json.dump(master_song_list, f)

In [23]:
with open('../data/master_song_list.json', 'r') as f:
    master_song_list = json.load(f)

#### Next notebook: 02_transforming_feats