# Extraction of Songs and Spotify Features Using Spotipy and the Spotify API

## Introduction

This notebook delineates the initial data gathering process. A list of artists was gathered from Google with their most associated genres and put into a `csv` file named `singers.csv`. This file is then used to search Spotify using the Spotify API in order to extract the Artist's URI. Once the Artist's URI has been extracted, the top 10 songs of the artist (across albums) are extracted and the preview URI is stored in a dataframe. Using the track URI, the features of the song are extracted and stored in a dataframe. Finally, the music files themselves are downloaded as mp3 files in order to run custom feature extraction later on.

Warning: Some artists do not have preview urls because of regional restrictions, account restrictions or the URI not being extractable. Therefore, the process does not always yield the same results. The Spotify account used in this scraping process has Norway as its default country.

In [None]:
import spotipy
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
import time
import random
import requests
import regex
from string import punctuation
client_id = "6f8bb3e64c2944d586037b9907043601"
client_secret = "81f5f44660364831be8d9118ccb51785"
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Loading the `singers.csv` file which contains our list of singers and their most associated genres. This file will be used to populate the dataframe from which all of the other analysis will be conducted

In [None]:
singers = pd.read_csv('singers.csv')

The `Artist URI` is extracted using the following code. The `URI` is an identifier unique to Spotify using which all the relevant albums, tracks and other details of an artist can be gathered. The code creates appends a column with the `Artist URI` to the dataframe `singers` 

In [None]:
for i in range(len(singers)):
    artist = singers.loc[i,'Singers']
    result = sp.search(artist)
    singers.loc[i, 'URI'] = result['tracks']['items'][0]['artists'][0]['uri']
    print("URI of", result['tracks']['items'][0]['artists'][0]['name'], "has been added to dataframe")

The following code adds the `Track URIs` and `Preview URLs` to lists. The API is pinged using the `Artist API` and their tracks and previews are extracted and stored

In [None]:
artist_uri = []
track_uri = []
track_preview = []
for artist in singers['URI']:
    results = sp.artist_top_tracks(artist)
    for track in results['tracks'][:10]:
        try:  
            artist_uri.append(artist)
            track_uri.append(track['uri'])
            track_preview.append(track['preview_url'])
        except:
            next    

These extracted features are joined together in a dataframe. The dataframe is merged with the earlier `Singers` dataframe. The missing values are dropped because if the `Preview URL` is missing, custom feature extraction cannot be done

In [None]:
track_df = pd.DataFrame()
track_df['Artist'] = artist_uri
track_df['TrackURI'] = track_uri
track_df['Preview'] = track_preview
total_df = pd.merge(singers, track_df, how = 'outer', left_on='URI', right_on = 'Artist')
total_df2 = total_df.dropna()

Using the `Track URI`, spotify's API will be used to extract all the relevant features to be used in the Machine Learning algorithms employed later. These features are stored in a new dataframe

In [None]:
df = pd.DataFrame(total_df2['TrackURI'])
df.columns = ['TrackURI']

for i in range(0,df.shape[0]):
  try:
    time.sleep(random.uniform(3, 6))
    URI = df.TrackURI[i]
    features = sp.audio_features(URI)
    track = sp.track(URI)
    df.loc[i, 'track'] = track['name']
    df.loc[i,'acousticness'] = features[0]['acousticness']
    df.loc[i,'instrumentalness'] = features[0]['instrumentalness']
    df.loc[i,'energy'] = features[0]['energy']
    df.loc[i,'speechiness'] = features[0]['speechiness']
    df.loc[i,'liveness'] = features[0]['liveness']
    df.loc[i,'loudness'] = features[0]['loudness']
    df.loc[i,'danceability'] = features[0]['danceability']
    df.loc[i,'tempo'] = features[0]['tempo']
    df.loc[i,'valence'] = features[0]['valence']
    uri=0
  except:
    next

After extraction of features, NA's are dropped, this feature dataframe is merged with the earlier dataframe containing all the relevant track details. This dataframe is then saved as `subsampled.csv` which is provided and can be used in further analysis

In [None]:
new_df = df[-df['track'].isna()]
new_df2 = pd.merge(total_df2, new_df, how = 'left', left_on = 'TrackURI', right_on = 'TrackURI')
new_df2 = new_df2[-new_df2['track'].isna()]
new_df2.to_csv('subsampled.csv')

Unfortunately, there were only a few qawwali artists so we'll have to manually download the data for the genres which have fewer instances by looking up albums. The code below creates a `get_data` function which is used to extract the same features from playlists provided

In [None]:
def get_data(playlist, genre):
  playlist_link = playlist
  playlist_URI = playlist_link.split("/")[-1].split("?")[0]
  track_uris = [x["track"]["uri"] for x in sp.playlist_tracks(playlist_URI)["items"]]
  df = pd.DataFrame(track_uris)
  df.columns = ['TrackURI']
  for i in range(df.shape[0]):
    try:
      URI = df.TrackURI[i]
      track = sp.track(track_uris[i])
      features = sp.audio_features(URI)
      df.loc[i, 'Singers'] = track['artists'][0]['name']
      df.loc[i, 'Genre'] = genre
      df.loc[i, 'Artist'] = track['artists'][0]['uri']
      df.loc[i, 'Preview'] = track['preview_url']
      df.loc[i, 'track'] = track['name']
      df.loc[i,'acousticness'] = features[0]['acousticness']
      df.loc[i,'instrumentalness'] = features[0]['instrumentalness']
      df.loc[i,'energy'] = features[0]['energy']
      df.loc[i,'speechiness'] = features[0]['speechiness']
      df.loc[i,'liveness'] = features[0]['liveness']
      df.loc[i,'loudness'] = features[0]['loudness']
      df.loc[i,'danceability'] = features[0]['danceability']
      df.loc[i,'tempo'] = features[0]['tempo']
      df.loc[i,'valence'] = features[0]['valence']
    except:
      next
  
  return df

The pieces of code below contain new playlists manually searched and appended to dataframe

In [None]:
qawwali = "https://open.spotify.com/playlist/0USwGwasJVrHVN5Xkdhw0d?si=de90c13a7b4a4d4f"
qawwali2 = "https://open.spotify.com/playlist/0CgbguP9exGwRVdRXLMsPS?si=c902a490d29e4277"
qawwali3 = "https://open.spotify.com/playlist/0CgbguP9exGwRVdRXLMsPS?si=f25dd5cf963048e1"
ghazal = "https://open.spotify.com/playlist/23aE9YFTUW11CcyoQhbGzT?si=79d1f449158d4a01"
ghazal2 = "https://open.spotify.com/playlist/3C50t7049A6xf4w0rhfwgD?si=83083f7f5fc7460b"
edm = "https://open.spotify.com/playlist/2NIe54HdwR5msTjrlHG1Lt?si=ff336b7898a24f98"
metal = "https://open.spotify.com/playlist/27gN69ebwiJRtXEboL12Ih?si=838e862f19144616"


qawwali_df = get_data(qawwali, 'Qawwali') 
qawwal_df2 = get_data(qawwali2, 'Qawwali')
qawwali_df3 = get_data(qawwali3, 'Qawwali') 
ghazal_df = get_data(ghazal, 'Ghazal')
ghazal_df2 = get_data(ghazal2, 'Ghazal')
edm_df = get_data(edm, 'EDM')
metal_df = get_data(metal, 'Metal')

In [None]:
qawwali_df = pd.concat([qawwali_df, qawwali_df2, qawwali_df3], ignore_index=True)
ghazal_df = pd.concat([ghazal_df, ghazal_df2], ignore_index=True)

In [None]:
qawwali_df.dropna(inplace = True)
ghazal_df.dropna(inplace = True)
edm_df.dropna(inplace = True)
metal_df.dropna(inplace = True)

In [None]:
album_data = pd.concat([qawwali_df, ghazal_df, edm_df, metal_df], ignore_index = True)

After the process is repeated to get a similar dataframe as the one saved to `subsampled.csv`, the two dataframes are concatenated. The newer augmented dataframe is once again saved as `subsampled.csv`

In [None]:
subsampled = pd.concat([subsampled, album_data], ignore_index = True)
subsampled.drop_duplicates(inplace = True)
subsampled['Genre'].value_counts()
subsampled.to_csv('subsampled.csv')

The following code uses the `Preview URLs` stored in subsampled to download all the songs that have been used in subsampled

In [None]:
for i in range(len(a)):
    url = subsampled.loc[i, "Preview"]
    genre = subsampled.loc[i, "Genre"]
    track_name = subsampled.loc[i, "track"]
    track_name = ' '.join(regex.findall('[A-Za-z0-9]+', track_name))
      
    mp3file = requests.get(url)
    os.makedirs(f'./music/{genre}', exist_ok=True)
    with open(f'./music/{genre}/{str(i).zfill(3)}_{track_name.strip(punctuation)}.mp3','wb') as output:
          output.write(mp3file.content)