# Get audio features of tracks

https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/

>- `acousticness`	`float`	A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
>- `analysis_url`	`string`	An HTTP URL to access the full audio analysis of this track. An access token is required to access this data.
>- `danceability`	`float`	Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
>- `duration_ms`	`int`	The duration of the track in milliseconds.
>- `energy`	`float`	Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
>- `id`	`string`	The Spotify ID for the track.
>- `instrumentalness`	`float`	Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
>- `key`	int	The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
>- `liveness`	`float`	Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
>- `loudness`	`float`	The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
>- `mode`	`int`	Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
>- `speechiness`	`float`	Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
>- `tempo`	`floa`	The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
>- `time_signature`	int	An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
>- `track_href`	`string`	A link to the Web API endpoint providing full details of the track.
>- `type`	string	The object type: “audio_features”
>- `uri`	string	The Spotify URI for the track.
>- `valence`	`float`	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

In [2]:
%run ../credentials/credentials.ipynb
CLIENT_ID = Client_ID
CLIENT_SECRET = Client_Secret

In [2]:
import requests
import pandas as pd
import time as tm
import concurrent.futures
import functools
import os
import glob

from time import time

In [3]:
def auth():
    AUTH_URL = 'https://accounts.spotify.com/api/token'

    # POST
    auth_response = requests.post(AUTH_URL, {
        'grant_type': 'client_credentials',
        'client_id': CLIENT_ID,
        'client_secret': CLIENT_SECRET,
        'Retry-After': 6
    })

    # convert the response to JSON
    auth_response_data = auth_response.json()

    # save the access token
    access_token = auth_response_data['access_token']
    
    headers = {
    'Authorization': 'Bearer {token}'.format(token=access_token)
    }
    
    return headers

In [4]:
headers = auth()

BASE_URL = 'https://api.spotify.com/v1/'

# Track ID from the URI
track_id = '6y0igZArWVi6Iz0rj35c1Y'

# actual GET request with proper header
r = requests.get(BASE_URL + 'audio-features/' + track_id, headers=headers)

In [5]:
track_uri = pd.read_csv('../data/track_uris.csv')

In [75]:
include = [
    'uri',
    'danceability',
    'energy',
    'key',
    'loudness',
    'mode',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo',
    'duration_ms',
    'time_signature']

In [7]:
# lim = int(len(track_uri)/1)
# def threading(start_index, tracks_per_worker, headers):
#     #print(f'start{start_index}')
#     lst = []
#     failed = []
#     for i, track in enumerate(track_uri['0'][:lim][start_index : start_index + tracks_per_worker]):
#         track_id = track.replace('spotify:track:',"")
#         r = requests.get(BASE_URL + 'audio-features/' + track_id, headers=headers)     
#         if r.status_code == 200:
#             r1 = r.json()
#             D = {key:value for key, value in r1.items() if key in include}
#             D['uri'] = track
#             lst.append(D)
#         else:
#             tm.sleep(30)
#             failed.append(track)
#             print(r.status_code, track)
        
#         pd.DataFrame(lst).to_csv(f'../data/audio-features/tracks{start_index}-{ start_index + tracks_per_worker}.csv', index = None)
#         if failed != []:
#             pd.DataFrame(failed).to_csv(f'../data/audio-features/failed/tracks-failed.csv', mode='a', header=False, index = None)
#     #print(f'end{start_index}')
#     return lst

# max_workers = 3
# min_index = 0 # min: 0
# end_index = len(track_uri) # max: 345707
# number_of_tracks = end_index - min_index
# tracks_per_worker = int(number_of_tracks/max_workers) 
# start_index = range(min_index, end_index, tracks_per_worker)

In [8]:
files = glob.glob('../data/audio-features/*.csv')
file = max(files , key = os.path.getctime)
lim_low = int(file.replace('../data/audio-features\\',"").split('-')[0])
lim_low

977600

In [9]:
headers = auth()

In [66]:
lim = len(track_uri)
lim_low = 0
print(lim_low, lim)
failed = []
step=100
time0 = time()
for i in range(lim_low,lim,step):
    L = i
    if i + step <= lim:
        U = i+step
    else:
        U = lim
    print(L,U)
    track_ids = ','.join(list(track_uri['0'][L:U].str.replace('spotify:track:',"")))
    #print(L,U)
    r = requests.get(BASE_URL + 'audio-features/?ids=' + track_ids, headers=headers)     
    if r.status_code!=200:
        print(r.status_code)
    #print('tr',track_ids)
    if r.status_code == 200:
        D = pd.DataFrame([el for el in r.json()['audio_features'] if el != None])
        D.to_csv(f'../data/audio-features/{L}-{U}tracks.csv', index = None)
        #print(D)
    else:
        tm.sleep(30)
        failed.append(track_ids)
        print(r.status_code)
        
    
    if failed != []:
        pd.DataFrame(failed).to_csv(f'../data/audio-features/failed/tracks-failed.csv', mode='a', header=False, index = None)
    
#     if L%10000 == 0:
#         tm.sleep(6)
#         print(f'{L} - {(time()-time0)/60:0.2f}min')
    
#     if L%100000 == 0:
#         tm.sleep(30)
#         print(f'{L} - {(time()-time0)/60:0.2f}min')
#         headers = auth()    

2237600 2237700
2237600 2237700


In [67]:
step=100
all_tracks = []
for i in range(0,len(track_uri),step):
    L = i
    if i + step <= len(track_uri):
        U = i+step
    else:
        U = len(track_uri)
    el = f'{L}-{U}tracks.csv'
    all_tracks.append(el)

In [68]:
all_tracks_downloaded = []
for i, file in enumerate(os.listdir(r'../data/audio-features/')):
    if file.endswith('.csv'):
        all_tracks_downloaded.append(file)

In [75]:
len(all_tracks), len(all_tracks_downloaded),all_tracks[:1], all_tracks_downloaded[:1]

(22623, 22623, ['0-100tracks.csv'], ['0-100tracks.csv'])

In [55]:
len(all_tracks_downloaded)

22622

(['0-100tracks.csv'], ['0-100tracks.csv'])

In [70]:
missed = []
for el in all_tracks:
    if el not in all_tracks_downloaded:
        missed.appemd(el)

if missed == []:
    print('Nothing missed')
else:
    print("Missed tracks", missed)

Nothing missed


# Combine all files 

In [9]:
import numpy as np

In [29]:
arr = np.empty((22624*100,18), dtype=object)
data = pd.DataFrame()
t0=time()
row_counter = 0
for i, file in enumerate(os.listdir(r'../data/audio-features/')):
    if file.endswith('.csv'):
        data_to_append = pd.read_csv(f'../data/audio-features/{file}').to_numpy()
        arr[row_counter:row_counter+len(data_to_append),:] = data_to_append
        if i % 5000==0:
            print(f'{i} finished in {(time()-t0)/60:0.2f} min')
        row_counter = row_counter+len(data_to_append)

0 finished in 0.00 min
1000 finished in 0.04 min
2000 finished in 0.08 min
3000 finished in 0.12 min
4000 finished in 0.15 min
5000 finished in 0.19 min
6000 finished in 0.23 min
7000 finished in 0.27 min
8000 finished in 0.31 min
9000 finished in 0.35 min
10000 finished in 0.39 min
11000 finished in 0.43 min
12000 finished in 0.48 min
13000 finished in 0.52 min
14000 finished in 0.56 min
15000 finished in 0.61 min
16000 finished in 0.67 min
17000 finished in 0.72 min
18000 finished in 0.77 min
19000 finished in 0.82 min
20000 finished in 0.87 min
21000 finished in 0.92 min
22000 finished in 0.97 min


In [35]:
t0=time()
row_counter=0
for i, file in enumerate(os.listdir(r'../data/audio-features/')):
    if file.endswith('.csv'):
        data_to_append = pd.read_csv(f'../data/audio-features/{file}')
        if i % 5000==0:
            print(f'{i} finished in {(time()-t0)/60:0.2f} min')
        row_counter = row_counter+len(data_to_append)
print(row_counter)

0 finished in 0.00 min
5000 finished in 0.18 min
10000 finished in 0.36 min
15000 finished in 0.55 min
20000 finished in 0.74 min
2262190


In [38]:
arr.shape

(2262400, 18)

In [61]:
for file in os.listdir(r'../data/audio-features/')[:1]:
    if file.endswith('.csv'):
        cols = pd.read_csv(f'../data/audio-features/{file}').columns

In [62]:
cols

Index(['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms',
       'time_signature'],
      dtype='object')

In [68]:
df = pd.DataFrame(arr, columns = cols)

In [69]:
df[df.isnull().any(axis=1) == False].shape

(2262190, 18)

In [70]:
df = df[df.isnull().any(axis=1) == False]

In [71]:
df.shape

(2262190, 18)

In [78]:
df[include].to_csv('../data/audio-features-combines.csv', index=None)

In [79]:
df_1 =  pd.read_csv('../data/audio-features-combines.csv')

In [83]:
df_1.dtypes

uri                  object
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms           int64
time_signature        int64
dtype: object