# Data acquisition notebook

Uses: 
    - Spotify API
    - Spotipy

## 1. Setup Spotipy

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

cid = "c7bb8703bc95445ca48b090c017cdc63"
secret = "49b2047ed46d4ed5b47d6b6523884b90"

client_cred_manager = SpotifyClientCredentials(client_id = cid, client_secret = secret)
sp = spotipy.Spotify(client_credentials_manager = client_cred_manager)

Data acquisition will be split in two.

First create a dataframe with track IDs, then a second dataframe with all audio-features.

## 2. Collect track ID

In [35]:
import timeit
start = timeit.default_timer()

# lists to store extracted information
artist_names = []
track_names = []
track_ids = []
popularities = []


# some info
# for spotipy.search: 
    # q : query
    # type : album, artist, track, playlist
    # limit : how many to return
    # offset : index of first item to return

# results now is a type dict object
# all relevant information we seek is stored under >tracks>items> ...
# where ... contains album, name, etc
for i in range(0, 10000, 50):
    results = sp.search(q = 'year:2019', type = 'track', limit = 50, offset = i)
    for i, t in enumerate(results['tracks']['items']): # the reason why i, t (tuple) is needed is because
        artist_names.append(t['artists'][0]['name'])   #i iterates on 'tracks' and t on 'items'
        track_names.append(t['name'])
        track_ids.append(t['id'])
        popularities.append(t['popularity'])

stop = timeit.default_timer()

print("This took", (stop-start), " seconds to run.")
        

retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...2secs
retrying ...1secs
This took 101.48866940000016  seconds to run.


In [65]:
# sanity check
print("Are there any new Led Zepellin songs? \nThere are %s new songs by the zeps..." % artist_names.count("Led Zepellin"))

print("\nWe collected %s tracks for our records. As expected?" %len(track_names))

Are there any new Led Zepellin songs? 
There are 0 new songs by the zeps...

We collected 10000 tracks for our records. As expected?


Okay, this looks good. Time to wrangle our data a bit.

Lets start by first making our lists into a panda dataframe.

In [147]:
df_tracks = pd.DataFrame(
    {
    'artist': artist_names,
    'track': track_names,
    'track_id': track_ids,
    'popularity': popularities
    }
)

In [85]:
df_tracks.sort_values("popularity", ascending= False).head()

# I only know of Billie Eilish and Sam Smith. Not bad I guess!

Unnamed: 0,artist,track,trackid,popularity
2,Shawn Mendes,Señorita,0TK2YIli7K1leLovkQiNik,100
42,Anuel AA,China,2ksOAxtIxY8yElEWw8RhgK,97
58,Tones and I,Dance Monkey,5ZULALImTm80tzUbYQYM9d,96
6,Billie Eilish,bad guy,2Fxmhks0bxGSBdJ92vM42m,96
13,Sam Smith,How Do You Sleep?,6b2RcmUt1g9N9mQ3CbjX2Y,96


In [86]:
print(df_tracks.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
artist        10000 non-null object
track         10000 non-null object
trackid       10000 non-null object
popularity    10000 non-null int64
dtypes: int64(1), object(3)
memory usage: 312.6+ KB
None


It's important to check whether we have duplicates.
Spotify API does not assign identical trackIDs to same songs coming from a single and an album. They treat them as different entities in the dataframe. We should fix this.

In [97]:
grouped = df_tracks.groupby(["artist", "track"], as_index = True).size()
# this groups the dataframe, first by artist, then by track and measures the size of the final grouping.
# if a song is repeated then the group artist,track will be of size > 1 since track will appear at least 2 times.

In [105]:
grouped[grouped > 1] # to see which are repeated
grouped[grouped > 1].count() # to count them

1202

In [110]:
# drop duplicates, again same logic on subsecting
df_tracks.drop_duplicates(subset = ['artist', 'track'], inplace=True)


In [119]:
# make sure no repeatitions are left
grouped_after_dropping = df_tracks.groupby(['artist','track'], as_index=True).size()
grouped_after_dropping[grouped_after_dropping > 1].count()

0

## 3. Time to collect audio features

As we have collected our track information, it is time to collect the relevant audio features for the tracks in our list.
This needs to be done separately. Again same logic in our previous pull.

In [131]:
# start timer
start = timeit.default_timer()

# initialise lists 
rows = [] 
steps = 100 # 100 to begin. causes problems otherwise for the API call
NA_counter = 0 # some tracks dont have information.

for i in range(0, len(df_tracks['trackid']), steps):
    batch = df_tracks['trackid'][i:i+steps] # selects 'steps' number of tracks from the df; no repetition since step increment in range
    features = sp.audio_features(batch) # features is now [[features], [features], ..., ..., ...]
    for i, t in enumerate(features):
        if t == None:
            NA_counter += 1
        else:
            rows.append(t)
            
            
stop = timeit.default_timer()

print("This took %s in seconds to complete." % (stop-start))

This took 12.917574500000228 in seconds to complete.


In [132]:
rows

[{'danceability': 0.715,
  'energy': 0.624,
  'key': 4,
  'loudness': -3.046,
  'mode': 0,
  'speechiness': 0.114,
  'acousticness': 0.11,
  'instrumentalness': 0,
  'liveness': 0.123,
  'valence': 0.412,
  'tempo': 158.087,
  'type': 'audio_features',
  'id': '5qmq61DAAOUaW8AUo8xKhh',
  'uri': 'spotify:track:5qmq61DAAOUaW8AUo8xKhh',
  'track_href': 'https://api.spotify.com/v1/tracks/5qmq61DAAOUaW8AUo8xKhh',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/5qmq61DAAOUaW8AUo8xKhh',
  'duration_ms': 173325,
  'time_signature': 4},
 {'danceability': 0.831,
  'energy': 0.502,
  'key': 10,
  'loudness': -4.045,
  'mode': 0,
  'speechiness': 0.046,
  'acousticness': 0.101,
  'instrumentalness': 0,
  'liveness': 0.122,
  'valence': 0.101,
  'tempo': 100.541,
  'type': 'audio_features',
  'id': '5ry2OE6R2zPQFDO85XkgRb',
  'uri': 'spotify:track:5ry2OE6R2zPQFDO85XkgRb',
  'track_href': 'https://api.spotify.com/v1/tracks/5ry2OE6R2zPQFDO85XkgRb',
  'analysis_url': 'https://api.spotify.

## Good, now lets wrangle!

Lets create a dataframe with the audio features

In [133]:
df_audio_features = pd.DataFrame.from_dict(rows, orient = "columns")

In [135]:
df_audio_features

df_audio_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8654 entries, 0 to 8653
Data columns (total 18 columns):
danceability        8654 non-null float64
energy              8654 non-null float64
key                 8654 non-null int64
loudness            8654 non-null float64
mode                8654 non-null int64
speechiness         8654 non-null float64
acousticness        8654 non-null float64
instrumentalness    8654 non-null float64
liveness            8654 non-null float64
valence             8654 non-null float64
tempo               8654 non-null float64
type                8654 non-null object
id                  8654 non-null object
uri                 8654 non-null object
track_href          8654 non-null object
analysis_url        8654 non-null object
duration_ms         8654 non-null int64
time_signature      8654 non-null int64
dtypes: float64(9), int64(4), object(5)
memory usage: 1.2+ MB


In [136]:
# lets drop a few columns
col_2_drop = ["type", "uri", "analysis_url", "track_href"]
df_audio_features.drop(col_2_drop, axis = 1, inplace = True)
df_audio_features.rename(columns = {"id": "track_id"}, inplace = True)

In [137]:
df_audio_features

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,track_id,duration_ms,time_signature
0,0.715,0.624,4,-3.046,0,0.1140,0.1100,0.000000,0.1230,0.412,158.087,5qmq61DAAOUaW8AUo8xKhh,173325,4
1,0.831,0.502,10,-4.045,0,0.0460,0.1010,0.000000,0.1220,0.101,100.541,5ry2OE6R2zPQFDO85XkgRb,205427,4
2,0.759,0.540,9,-6.039,0,0.0287,0.0370,0.000000,0.0945,0.750,116.947,0TK2YIli7K1leLovkQiNik,190960,4
3,0.703,0.594,5,-6.146,0,0.0752,0.3420,0.000000,0.1230,0.475,153.848,6fTt0CH2t0mdeB2N9XFG5r,114893,4
4,0.695,0.762,0,-3.497,1,0.0395,0.1920,0.002440,0.0863,0.553,120.042,21jGcNKet2qwijlDFuPiPb,215280,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8649,0.786,0.633,10,-4.493,1,0.0330,0.0492,0.000002,0.1140,0.945,120.005,5HClkiFAXSRSeMMofbNDGQ,208320,4
8650,0.433,0.934,5,-3.964,0,0.0481,0.0586,0.000918,0.6120,0.329,144.982,4smyx0T0iz2Ro9WFRlSo2p,294621,4
8651,0.971,0.464,8,-12.080,1,0.2040,0.1260,0.000000,0.0844,0.479,120.006,6BPB4N2FyFyrw91TVB1hMN,208576,4
8652,0.639,0.822,6,-5.433,1,0.0500,0.0309,0.000003,0.0286,0.824,141.989,3mPcoABQe29FXcYORjbVRr,190707,4


In [145]:
# lets see if there are any duplicates still present. There shouldn't be.
any(df_audio_features.duplicated(subset = ["track_id"], keep = False) == True)

#good

False

### Let's merge

In [148]:
df = pd.merge(df_tracks, df_audio_features, on = "track_id", how = "inner")

In [149]:
df

Unnamed: 0,artist,track,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Lizzo,Truth Hurts,5qmq61DAAOUaW8AUo8xKhh,94,0.715,0.624,4,-3.046,0,0.1140,0.1100,0.000000,0.1230,0.412,158.087,173325,4
1,Drake,Money In The Grave (Drake ft. Rick Ross),5ry2OE6R2zPQFDO85XkgRb,93,0.831,0.502,10,-4.045,0,0.0460,0.1010,0.000000,0.1220,0.101,100.541,205427,4
2,Shawn Mendes,Señorita,0TK2YIli7K1leLovkQiNik,100,0.759,0.540,9,-6.039,0,0.0287,0.0370,0.000000,0.0945,0.750,116.947,190960,4
3,Lil Nas X,Panini,6fTt0CH2t0mdeB2N9XFG5r,93,0.703,0.594,5,-6.146,0,0.0752,0.3420,0.000000,0.1230,0.475,153.848,114893,4
4,Post Malone,Circles,21jGcNKet2qwijlDFuPiPb,94,0.695,0.762,0,-3.497,1,0.0395,0.1920,0.002440,0.0863,0.553,120.042,215280,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9438,La Poderosa Banda San Juan,Malabares,5HClkiFAXSRSeMMofbNDGQ,57,0.786,0.633,10,-4.493,1,0.0330,0.0492,0.000002,0.1140,0.945,120.005,208320,4
9439,Above & Beyond,There's Only You - Crystal Skies Remix,4smyx0T0iz2Ro9WFRlSo2p,53,0.433,0.934,5,-3.964,0,0.0481,0.0586,0.000918,0.6120,0.329,144.982,294621,4
9440,Iggy Azalea,Clap Back,6BPB4N2FyFyrw91TVB1hMN,54,0.971,0.464,8,-12.080,1,0.2040,0.1260,0.000000,0.0844,0.479,120.006,208576,4
9441,Dillon Carmichael,99 Problems (Fish Ain't One),3mPcoABQe29FXcYORjbVRr,49,0.639,0.822,6,-5.433,1,0.0500,0.0309,0.000003,0.0286,0.824,141.989,190707,4


In [155]:
# great this works, lets save it
df.to_csv("full_dataframe_oct2.csv")