## Data collection using Spotipy

In this notebook, I will demonstrate how I gathered audio attribute data using the Spotipy package.

In [1]:
import numpy as np
import pandas as pd
import re
import spotipy
from sklearn.utils import shuffle
from spotipy import util
from spotipy.oauth2 import SpotifyClientCredentials

We will import the song datasets from the MIR Genre Predictor notebook. All of the cells below are part of the main notebook, so we will just run through the cells in order to quickly create our songs dataframe, which we will then use to gather audio attributes.

In [2]:
# Import the song datasets from Ranker, merge them together
dfA = pd.read_csv('lyrics1.csv')
dfB = pd.read_csv('lyrics2.csv')

frames = [dfA, dfB]

# Merge the two datasets
df_Ranker = pd.concat(frames)

# Group the songs by their lyrics
groups = ['song', 'year', 'album', 'genre', 'artist', 'ranker_genre']
df_Ranker = df_Ranker.sort_values(groups).groupby(groups).lyric.apply(' '.join).apply(lambda x: x.lower()).reset_index(name='lyric')

# Clean up the lyrics
df_Ranker['lyric'] = df_Ranker['lyric'].str.replace(r'[^\w\s]','')

In [3]:
# Drop unused columns in Ranker dataset
df_Ranker = df_Ranker.drop(['year', 'album', 'genre'], axis=1) # We will be using the 'ranker_genre' column instead of 'genre'

# Rename 'ranker_genre' column to 'genre' in Ranker dataset
df_Ranker = df_Ranker.rename(index=str, columns={'ranker_genre': 'genre', 'lyric': 'lyrics'})

In [4]:
# Import the song dataset from Kaggle
df_Kaggle = pd.read_csv('songdata1.csv', dtype={'song': str, 'year': str, 'artist': str, 'genre': str, 'lyrics': str})

# Clean lyrics text
df_Kaggle['lyrics'] = df_Kaggle['lyrics'].str.replace(r'[^\w\s]','')
df_Kaggle['lyrics'] = df_Kaggle['lyrics'].str.replace('\n', ' ')
df_Kaggle['lyrics'] = df_Kaggle['lyrics'].str.lower()

# Replace dash chars with space chars
df_Kaggle['song'] = df_Kaggle['song'].str.replace('-', ' ')
df_Kaggle['artist'] = df_Kaggle['artist'].str.replace('-', ' ')

# Drop unused columns in Kaggle dataset
df_Kaggle = df_Kaggle.drop(['year'], axis=1)

In [5]:
# Merge!
frames = [df_Ranker, df_Kaggle]
songsdf = pd.concat(frames)

# Make 'song' and 'artist' columns lowercase
songsdf['song'] = songsdf['song'].str.lower()
songsdf['artist'] = songsdf['artist'].str.lower()

In [6]:
# Group some of the genres together from the Ranker dataset

# hip hop
songsdf['genre'] = np.where((songsdf['genre'] == 'Hip Hop')|
                                   (songsdf['genre'] == 'Hip-Hop')|
                                   (songsdf['genre'] == 'rhythm and blues')|
                                   (songsdf['genre'] == 'R&B'),
                                   'hip hop', 
                                   songsdf['genre'])

# punk/metal
songsdf['genre'] = np.where((songsdf['genre'] == 'screamo')|
                                   (songsdf['genre'] == 'punk rock')|
                                   (songsdf['genre'] == 'heavy metal')|
                                   (songsdf['genre'] == 'Metal'), 
                                   'punk/metal', 
                                   songsdf['genre'])

# country/folk/rock
songsdf['genre'] = np.where((songsdf['genre'] == 'Country')|
                                   (songsdf['genre'] == 'indie folk')|
                                   (songsdf['genre'] == 'Folk')|
                                   (songsdf['genre'] == 'Indie')|
                                   (songsdf['genre'] == 'classic rock'),
                                   'country/folk/rock', 
                                   songsdf['genre'])

songsdf['lyrics'] = songsdf['lyrics'].astype(str)

# Drop genres that are difficult to classify
songsdf = songsdf.drop(songsdf[(songsdf['lyrics'] == 'nan')|
                               (songsdf['genre'] == 'Other')|
                               (songsdf['genre'] == ' Alkebulan')|
                               (songsdf['genre'] == 'Not Available')|
                               (songsdf['genre'] == 'nan')|
                               (songsdf['genre'] == 'Electronic')|
                               (songsdf['genre'] == 'Pop')|
                               (songsdf['genre'] == 'pop')|
                               (songsdf['genre'] == 'Rock')|
                               (songsdf['genre'] == 'Jazz')].index)

# Our resulting classifications
genres = ['hip hop', 'punk/metal','country/folk/rock']

Now that the songs dataframe is merged and cleaned, let's run through a gentle introduction of the Spotipy package.

## An introduction to Spotipy

We need to authorize this application with my Spotify account so we can receive song data.

In [7]:
client_credentials_manager = SpotifyClientCredentials(client_id='48315eb344ba44bb984931130013905c',
                                                      client_secret='secret key here')

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Once we are authorized, we are able to retrieve song data. In the example below, we will get audio attributes from the song 'Love Lockdown' by Kanye West. The next example retrieves genres from the band Earth, Wind & Fire.

In [8]:
artist = 'kanye west'
track = 'love lockdown'

# Search for song, get the first result's id
results = sp.search(artist + ' ' + track)
track_id = results['tracks']['items'][0]['id']

# Use id to find audio features of song
sp.audio_features(tracks = [track_id])

[{'danceability': 0.756,
  'energy': 0.529,
  'key': 1,
  'loudness': -7.659,
  'mode': 0,
  'speechiness': 0.0329,
  'acousticness': 0.0539,
  'instrumentalness': 0.392,
  'liveness': 0.112,
  'valence': 0.123,
  'tempo': 119.573,
  'type': 'audio_features',
  'id': '1kxeWHF9PrCVZHvVskv8lg',
  'uri': 'spotify:track:1kxeWHF9PrCVZHvVskv8lg',
  'track_href': 'https://api.spotify.com/v1/tracks/1kxeWHF9PrCVZHvVskv8lg',
  'analysis_url': 'https://api.spotify.com/v1/audio-analysis/1kxeWHF9PrCVZHvVskv8lg',
  'duration_ms': 270307,
  'time_signature': 4}]

In [9]:
artist = 'earth, wind & fire'

# Search for an artist, get the first result's artist id
results = sp.search(q = 'artist:' + artist)
artist_id = str(results['tracks']['items'][0]['artists']).split("'id': '",1)[1].split("', 'name")[0]

# Use artist id to find artist details and get the list of genres
sp.artist(artist_id)['genres']

['disco', 'funk', 'jazz funk', 'motown', 'quiet storm', 'soul']

Now that we understand the basic functions of the Spotipy package, lets use it to gather audio features for the songs in our songs dataframe. I used this process to populate the dataset used in the MIR Genre Predictor notebook.

## Fetching the audio feature attributes

First, we will create a new dataframe that will hold the audio attributes of each song in songsdf.

In [10]:
attrdf = pd.DataFrame(columns = ['track', 'artist', 'genre', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature'])
attrdf

Unnamed: 0,track,artist,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature


Then, we will iterate through songsdf, find all the audio attributes for each song, and put each song and its attributes in attrdf. We can then save attrdf to a .csv file to be used in other instances.

In [13]:
# Since this code was ran several times over the course of this project, I shuffled songsdf each time to 
# reduce the likelihood of getting duplicates in my overall data once it's merged together.
songsdf = shuffle(songsdf)

progcounter, updatecounter, totalcount = 0, 0, 0
length = len(songsdf)

for index, row in songsdf.iterrows():
    # Get the artist, track, and genre of each song in songsdf
    artist = row['artist'].strip()
    track = row['song'].strip()
    genre = row['genre']

    progcounter += 1
    updatecounter += 1
    
    # Search for song, get the first result's id
    results = sp.search(artist + ' ' + track)
    
    try:
        track_id = results['tracks']['items'][0]['id']
        # Use id to find audio features of song
        features = sp.audio_features(tracks = [track_id])
        
        # Parse features to find each attribute
        danceability = features[0]['danceability']
        energy = features[0]['energy']
        key = features[0]['key']
        loudness = features[0]['loudness']
        mode = features[0]['mode']
        speechiness = features[0]['speechiness']
        acousticness = features[0]['acousticness']
        instrumentalness = features[0]['instrumentalness']
        liveness = features[0]['liveness']
        valence = features[0]['valence']
        tempo = features[0]['tempo']
        duration_ms = features[0]['duration_ms']
        time_signature = features[0]['time_signature']
        
        # Create a new row to be added to attrdf, then add it
        cols = [track, artist, genre, danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms, time_signature]
        attrdf.loc[totalcount] = cols
        totalcount += 1
    except:
        pass
    
    # Progress counter
    if updatecounter == 500:
        updatecounter = 0
        print('----    progress: {0:.3g}%, song count:'.format(progcounter/length), totalcount)

print('Done -', totalcount, 'songs added.')


# Commented out code below saves attrdf to a csv file, this way we can use 
# data gathered from this notebook in other notebooks

# attrdf.to_csv('songfeaturedata.csv', encoding='utf-8', index=False)