# What's in this notebook?
Here I go through the process of getting all the data needed for the project. Before starting in the notebook, I made two playlists on Spotify consisting of roughly 1000 songs each -- one playlist had only ballet pieces, while the other had only non-ballet pieces. All of the data is retrieved from Spotify's Tracks API and Playlist API. I got data on song ids (Playlist API), track features (Track API), and track analysis (Track API). The steps taken were roughly:
1. Get song ids from each playlist
2. Get song features 
3. Get song analysis

Note: spotipy is amazing so I accessed the spotify API through their library!

Other note: functions defined in this notebook are only meant to be run in classes in which a Spotify spotipy object has already been instantiated (and named sp)

## Set up
Importing stuff, creating a Spotify spotipy object

In [3]:
pip install spotipy

Collecting spotipy
  Downloading https://files.pythonhosted.org/packages/59/46/3c957255c96910a8a0e2d9c25db1de51a8676ebba01d7966bedc6e748822/spotipy-2.4.4.tar.gz
Building wheels for collected packages: spotipy
  Building wheel for spotipy (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/hannah/Library/Caches/pip/wheels/76/28/19/a86ca9bb0e32dbd4a4f580870250f5aeef852870578e0427e6
Successfully built spotipy
Installing collected packages: spotipy
Successfully installed spotipy-2.4.4
Note: you may need to restart the kernel to use updated packages.


In [83]:
import config
import pandas as pd
import spotipy
import time
from spotipy.oauth2 import SpotifyClientCredentials

In [None]:
client_credentials_manager = SpotifyClientCredentials(client_id=config.client_id, 
                                                          client_secret=
                                                          config.client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

## 1: Get Song IDs
Final product: Two lists of song ids -- one for the ballet playlist and one for the non-ballet playlist.

In [93]:
ballet_playlistID = '4T0uU1QrXbK2vHp2zMCDQK'
nonballet_playlistID = '0vKpBpWCRsWEo17m13HA4g'
test = '7tR5BxigI7TWjf6IFgdpQ3'

# get song ids, given a playlist ID, username and the length of the playlist
def get_song_ids(username, playlistID, playlist_length):
    song_ids = []

    # need to paginate (only 100 ids given per request)
    for i in range(0, playlist_length, 100):
        song_ids.extend([x['track']['id'] for x in 
                       sp.user_playlist_tracks(username, 
                                               playlistID, offset=i, 
                                               fields='items(track(id))')['items']])
        time.sleep(.35)
    return song_ids

In [95]:
ballet_ids = get_song_ids("hparker96", ballet_playlistID, 1041)

In [96]:
nonballet_ids = get_song_ids("hparker96", nonballet_playlistID, 1084)

## 2: Get Song Features
A dataframe of song features, with the ID as the index. This will be merged with the dataframe to be produced in the next section, to get our final dataframe. 

In [133]:
# get song features, given list of ids to get the features of
def get_features(ids):
    features = []
    i=0
    
    # can only get 50 at a time
    while i < len(ids):
        features.extend(sp.audio_features(tracks=ids[i:i+50]))
        i += 50
        time.sleep(.35)
    
    # above will only run if len(ids) > 50
    if len(ids) <= 50:
        features.extend(sp.audio_features(tracks=ids))

    return features

In [154]:
# takes a bit for lots of ids!
ballet_features = get_features(ballet_ids)

In [136]:
nonballet_features = get_features(nonballet_ids)

In [161]:
# create a dataframe of ballet features, set 'ballet' column 
# to 1 (ie, yes, it's a ballet song)
ballet_features_df = pd.DataFrame(ballet_features)
ballet_features_df['ballet'] = 1

In [163]:
# create a dataframe of ballet features, set 'ballet' column 
# to 0 (ie, no, it's not a ballet song)
nonballet_features_df = pd.DataFrame(nonballet_features)
nonballet_features_df['ballet'] = 0

In [166]:
# stick them together
feature_df = pd.concat([ballet_features_df,nonballet_features_df])

In [169]:
# get rid of columns that aren't useful
feature_df.drop(['analysis_url', 'track_href', 'type', 'uri'], axis=1, inplace=True)

In [171]:
# set id as the index, and we're done with this part!
feature_df.set_index('id', inplace=True)

In [200]:
feature_df

Unnamed: 0_level_0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,ballet
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
20BHajwSZPTbGbsFSxurtU,0.948,0.1610,142160,0.00474,0.946000,3,0.1220,-41.235,1,0.0356,100.496,4,0.0321,1
36Jo9Y3bVip2mLvBNbgNaN,0.506,0.1610,91173,0.05640,0.838000,1,0.1120,-35.230,1,0.0394,132.687,3,0.0365,1
4PJDfvW1Bw3a8fZy0C6Bjq,0.967,0.1050,141293,0.08010,0.816000,11,0.6760,-22.946,0,0.0366,66.158,3,0.0347,1
1Bi0JfOAavN55nKBYBoONe,0.949,0.3970,87653,0.32800,0.141000,8,0.8020,-20.748,0,0.0413,103.014,1,0.2310,1
1lxgxl9HWcFbx7FYjH0yFl,0.978,0.3810,60440,0.33600,0.868000,9,0.1100,-23.742,1,0.0614,106.060,4,0.0797,1
69ftg7xbry9efOJuA2rr0a,0.947,0.1240,315613,0.04580,0.884000,9,0.2770,-27.224,1,0.0381,88.513,4,0.0379,1
4bUmsjPY6B6UnrJWiqXjaX,0.970,0.1230,144293,0.01880,0.238000,9,0.3800,-31.280,0,0.0470,78.477,4,0.0375,1
7plfKTRZxPXE05euh0wQmQ,0.960,0.2980,142493,0.20900,0.794000,2,0.2710,-24.783,1,0.0448,175.134,4,0.2930,1
0QKbNkXMmZ7xgNHGqdkEnL,0.989,0.2640,61853,0.00240,0.586000,10,0.2040,-37.576,1,0.0445,53.377,4,0.3950,1
5ytXZyg6GzZQWkBvAsjkHN,0.949,0.0746,239667,0.02030,0.890000,11,0.1070,-31.690,1,0.0436,79.035,4,0.0359,1


## 3: Get Song Analysis
Final product: A dataframe of analysis of each feature, with the ID of the song as the index. To be merged with dataframe above. Note -- this data is messy -- each entry will not be in a format that's really properly suited to a table (things like sections, bars, beats, etc. have lists of data abouch each section/bar/beat/etc.). I'll have to figure out in EDA what the best way to aggregate that data is! That will be key to creating a good classifier.

In [187]:
# this will take approximately 8,000 years
# okay so maybe not that long, but 1041 songs took 1506.5 seconds
analysis_nonballet = []
start = time.time()
for ID in nonballet_ids:
    analysis_nonballet.append(sp.audio_analysis(ID))
    time.sleep(.35)
end = time.time()
print(f'took {end - start} seconds to process {len(test_ids)} ids')

retrying ...1secs
took 1817.9702677726746 seconds to process 12 ids


In [183]:
analysis_ballet = analysis

In [191]:
len(analysis_ballet)

1041

In [192]:
ballet_analysis_df = pd.DataFrame(analysis_ballet)
nonballet_analysis_df = pd.DataFrame(analysis_nonballet)
display(ballet_analysis_df)
display(nonballet_analysis_df)

Unnamed: 0,bars,beats,meta,sections,segments,tatums,track
0,"[{'start': 2.37543, 'duration': 2.51862, 'conf...","[{'start': 1.04131, 'duration': 0.67416, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 45.2725, 'confiden...","[{'start': 0.0, 'duration': 0.98136, 'confiden...","[{'start': 1.04131, 'duration': 0.44944, 'conf...","{'num_samples': 3134628, 'duration': 142.16, '..."
1,"[{'start': 0.34114, 'duration': 0.95762, 'conf...","[{'start': 0.34114, 'duration': 0.47535, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 4.93738, 'confiden...","[{'start': 0.0, 'duration': 0.50649, 'confiden...","[{'start': 0.34114, 'duration': 0.3169, 'confi...","{'num_samples': 2010372, 'duration': 91.17333,..."
2,"[{'start': 0.59006, 'duration': 1.91355, 'conf...","[{'start': 0.59006, 'duration': 0.95548, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 12.75043, 'confide...","[{'start': 0.0, 'duration': 0.80068, 'confiden...","[{'start': 0.59006, 'duration': 0.63698, 'conf...","{'num_samples': 3115518, 'duration': 141.29333..."
3,"[{'start': 1.88453, 'duration': 2.37082, 'conf...","[{'start': 0.65235, 'duration': 0.61783, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 10.14338, 'confide...","[{'start': 0.0, 'duration': 0.5581, 'confidenc...","[{'start': 0.65235, 'duration': 0.30892, 'conf...","{'num_samples': 1932756, 'duration': 87.65333,..."
4,"[{'start': 1.4702, 'duration': 2.24442, 'confi...","[{'start': 0.30137, 'duration': 0.59597, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 8.15456, 'confiden...","[{'start': 0.0, 'duration': 0.27918, 'confiden...","[{'start': 0.30137, 'duration': 0.29798, 'conf...","{'num_samples': 1332702, 'duration': 60.44, 's..."
5,"[{'start': 0.15514, 'duration': 2.41201, 'conf...","[{'start': 0.15514, 'duration': 0.61601, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 11.89862, 'confide...","[{'start': 0.0, 'duration': 0.11188, 'confiden...","[{'start': 0.15514, 'duration': 0.308, 'confid...","{'num_samples': 6959274, 'duration': 315.61333..."
6,"[{'start': 1.32633, 'duration': 2.94129, 'conf...","[{'start': 0.63096, 'duration': 0.69537, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 23.57802, 'confide...","[{'start': 0.0, 'duration': 0.34776, 'confiden...","[{'start': 0.63096, 'duration': 0.46358, 'conf...","{'num_samples': 3181668, 'duration': 144.29333..."
7,"[{'start': 0.06638, 'duration': 1.40132, 'conf...","[{'start': 0.06638, 'duration': 0.35342, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 10.15933, 'confide...","[{'start': 0.0, 'duration': 0.1078, 'confidenc...","[{'start': 0.06638, 'duration': 0.17671, 'conf...","{'num_samples': 3141978, 'duration': 142.49333..."
8,"[{'start': 4.24318, 'duration': 4.48652, 'conf...","[{'start': 0.86141, 'duration': 1.163, 'confid...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 15.45974, 'confide...","[{'start': 0.0, 'duration': 0.79429, 'confiden...","[{'start': 0.86141, 'duration': 0.77533, 'conf...","{'num_samples': 1363866, 'duration': 61.85333,..."
9,"[{'start': 2.27963, 'duration': 3.33022, 'conf...","[{'start': 0.58977, 'duration': 0.84112, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 36.73111, 'confide...","[{'start': 0.0, 'duration': 0.34816, 'confiden...","[{'start': 0.58977, 'duration': 0.56075, 'conf...","{'num_samples': 5284650, 'duration': 239.66667..."


Unnamed: 0,bars,beats,meta,sections,segments,tatums,track
0,"[{'start': 3.40298, 'duration': 2.37878, 'conf...","[{'start': 2.18132, 'duration': 0.62219, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 32.83717, 'confide...","[{'start': 0.0, 'duration': 1.7258, 'confidenc...","[{'start': 2.18132, 'duration': 0.3111, 'confi...","{'num_samples': 4563468, 'duration': 206.96, '..."
1,"[{'start': 1.25338, 'duration': 1.12142, 'conf...","[{'start': 1.25338, 'duration': 0.3756, 'confi...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 9.11914, 'confiden...","[{'start': 0.0, 'duration': 1.95356, 'confiden...","[{'start': 1.25338, 'duration': 0.1878, 'confi...","{'num_samples': 4570230, 'duration': 207.26667..."
2,"[{'start': 2.37602, 'duration': 1.58219, 'conf...","[{'start': 1.29048, 'duration': 0.54989, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 8.328, 'confidence...","[{'start': 0.0, 'duration': 1.47592, 'confiden...","[{'start': 1.29048, 'duration': 0.3666, 'confi...","{'num_samples': 6680856, 'duration': 302.98667..."
3,"[{'start': 1.43403, 'duration': 3.81262, 'conf...","[{'start': 0.55681, 'duration': 0.87722, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 13.83044, 'confide...","[{'start': 0.0, 'duration': 0.52041, 'confiden...","[{'start': 0.55681, 'duration': 0.43861, 'conf...","{'num_samples': 9759330, 'duration': 442.6, 's..."
4,"[{'start': 1.94563, 'duration': 1.41706, 'conf...","[{'start': 0.81862, 'duration': 0.39316, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 28.70306, 'confide...","[{'start': 0.0, 'duration': 0.34086, 'confiden...","[{'start': 0.81862, 'duration': 0.19658, 'conf...","{'num_samples': 3892854, 'duration': 176.54667..."
5,"[{'start': 3.22699, 'duration': 3.12225, 'conf...","[{'start': 1.65445, 'duration': 0.80309, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 12.66092, 'confide...","[{'start': 0.0, 'duration': 1.01977, 'confiden...","[{'start': 1.65445, 'duration': 0.40155, 'conf...","{'num_samples': 7888314, 'duration': 357.74667..."
6,[],[],"{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 13.06113, 'confide...","[{'start': 0.0, 'duration': 0.6444, 'confidenc...",[],"{'num_samples': 9537360, 'duration': 432.53333..."
7,"[{'start': 2.69999, 'duration': 1.36709, 'conf...","[{'start': 1.98695, 'duration': 0.35828, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 13.6933, 'confiden...","[{'start': 0.0, 'duration': 2.29714, 'confiden...","[{'start': 1.98695, 'duration': 0.17914, 'conf...","{'num_samples': 4307100, 'duration': 195.33333..."
8,"[{'start': 4.15367, 'duration': 2.99023, 'conf...","[{'start': 1.93689, 'duration': 0.72495, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 43.65186, 'confide...","[{'start': 0.0, 'duration': 1.56281, 'confiden...","[{'start': 1.93689, 'duration': 0.36248, 'conf...","{'num_samples': 11000010, 'duration': 498.8666..."
9,"[{'start': 1.92111, 'duration': 2.01877, 'conf...","[{'start': 0.42632, 'duration': 0.49008, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 34.13498, 'confide...","[{'start': 0.0, 'duration': 0.13918, 'confiden...","[{'start': 0.42632, 'duration': 0.24504, 'conf...","{'num_samples': 12989541, 'duration': 589.0948..."


In [194]:
analysis_df = pd.concat([ballet_analysis_df, nonballet_analysis_df])

In [195]:
analysis_df

Unnamed: 0,bars,beats,meta,sections,segments,tatums,track
0,"[{'start': 2.37543, 'duration': 2.51862, 'conf...","[{'start': 1.04131, 'duration': 0.67416, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 45.2725, 'confiden...","[{'start': 0.0, 'duration': 0.98136, 'confiden...","[{'start': 1.04131, 'duration': 0.44944, 'conf...","{'num_samples': 3134628, 'duration': 142.16, '..."
1,"[{'start': 0.34114, 'duration': 0.95762, 'conf...","[{'start': 0.34114, 'duration': 0.47535, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 4.93738, 'confiden...","[{'start': 0.0, 'duration': 0.50649, 'confiden...","[{'start': 0.34114, 'duration': 0.3169, 'confi...","{'num_samples': 2010372, 'duration': 91.17333,..."
2,"[{'start': 0.59006, 'duration': 1.91355, 'conf...","[{'start': 0.59006, 'duration': 0.95548, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 12.75043, 'confide...","[{'start': 0.0, 'duration': 0.80068, 'confiden...","[{'start': 0.59006, 'duration': 0.63698, 'conf...","{'num_samples': 3115518, 'duration': 141.29333..."
3,"[{'start': 1.88453, 'duration': 2.37082, 'conf...","[{'start': 0.65235, 'duration': 0.61783, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 10.14338, 'confide...","[{'start': 0.0, 'duration': 0.5581, 'confidenc...","[{'start': 0.65235, 'duration': 0.30892, 'conf...","{'num_samples': 1932756, 'duration': 87.65333,..."
4,"[{'start': 1.4702, 'duration': 2.24442, 'confi...","[{'start': 0.30137, 'duration': 0.59597, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 8.15456, 'confiden...","[{'start': 0.0, 'duration': 0.27918, 'confiden...","[{'start': 0.30137, 'duration': 0.29798, 'conf...","{'num_samples': 1332702, 'duration': 60.44, 's..."
5,"[{'start': 0.15514, 'duration': 2.41201, 'conf...","[{'start': 0.15514, 'duration': 0.61601, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 11.89862, 'confide...","[{'start': 0.0, 'duration': 0.11188, 'confiden...","[{'start': 0.15514, 'duration': 0.308, 'confid...","{'num_samples': 6959274, 'duration': 315.61333..."
6,"[{'start': 1.32633, 'duration': 2.94129, 'conf...","[{'start': 0.63096, 'duration': 0.69537, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 23.57802, 'confide...","[{'start': 0.0, 'duration': 0.34776, 'confiden...","[{'start': 0.63096, 'duration': 0.46358, 'conf...","{'num_samples': 3181668, 'duration': 144.29333..."
7,"[{'start': 0.06638, 'duration': 1.40132, 'conf...","[{'start': 0.06638, 'duration': 0.35342, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 10.15933, 'confide...","[{'start': 0.0, 'duration': 0.1078, 'confidenc...","[{'start': 0.06638, 'duration': 0.17671, 'conf...","{'num_samples': 3141978, 'duration': 142.49333..."
8,"[{'start': 4.24318, 'duration': 4.48652, 'conf...","[{'start': 0.86141, 'duration': 1.163, 'confid...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 15.45974, 'confide...","[{'start': 0.0, 'duration': 0.79429, 'confiden...","[{'start': 0.86141, 'duration': 0.77533, 'conf...","{'num_samples': 1363866, 'duration': 61.85333,..."
9,"[{'start': 2.27963, 'duration': 3.33022, 'conf...","[{'start': 0.58977, 'duration': 0.84112, 'conf...","{'analyzer_version': '4.0.0', 'platform': 'Lin...","[{'start': 0.0, 'duration': 36.73111, 'confide...","[{'start': 0.0, 'duration': 0.34816, 'confiden...","[{'start': 0.58977, 'duration': 0.56075, 'conf...","{'num_samples': 5284650, 'duration': 239.66667..."


In [197]:
analysis_df['id'] = ballet_ids+nonballet_ids

In [201]:
analysis_df.set_index('id', inplace=True)

## Final Dataframe
Almost there! Just have to merge and pickle the data, then we're done!

PS - this pickle is giant and therefore git won't take it, so it's going in the .gitignore

In [203]:
df = analysis_df.merge(feature_df, left_index=True, right_index=True)

In [204]:
import pickle
pickle.dump(df, open( "data.pickle", "wb" ) )