# Retrieving the Dataset

## 1. Introduction

This section of the project focuses on retrieving Taylor Swift's music data using the **Spotipy** package. To achieve this, I followed the guide *[How to Extract Any Artist’s Data Using Spotify’s API, Python, and Spotipy](https://betterprogramming.pub/how-to-extract-any-artists-data-using-spotify-s-api-python-and-spotipy-4c079401bc37)* and referred to the Spotipy [documentation](https://spotipy.readthedocs.io/en/2.18.0/).

Once the data was retrieved, I performed some data cleaning to organize the dataset and prepare it for analysis.

**NOTE:** The data retrieval process requires playlists as inputs with a maximum of 100 songs. Thus, I divided Taylor Swift's discography into three different playlists. 

**UPDATED:**

- 02/05/2023 - Added *Midnights (3am Edition)* album

### 1.1 Importing Packages

These are the packages used in this section of the project.

In [1]:
# Import required packages
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy
import pandas as pd
import os

### 1.2 Defining Variables

In order to retrieve data, you must have a **Spotify Developer** [account](https://developer.spotify.com).

Using my developer account, I retrieved the **Client ID** and the **Client Secret ID** and stored them in my local environment to keep them hidden. I also stored my Spotify username and playlist IDs as variables. Playlist IDs can be retrieved from the link to the respective playlist.

To set environemntal variables, you must run the following code in the terminal, replacing **'key'** and **'secret_key'** with the Client ID and Client Secret ID, respectively:

`export SPOTIFY_CLIENT_KEY=key`

`export SPOTIFY_CLIENT_SECRET=secret_key`

In [2]:
# Obtain Spotify client IDs, username, and playlist IDs
CLIENT_ID = os.environ.get("SPOTIFY_CLIENT_KEY")
CLIENT_SECRET = os.environ.get("SPOTIFY_CLIENT_SECRET")
SPOTIFY_USERNAME = "johncarlomaula"
PLAYLIST_IDS = ["4ea0fs3hTNhM4DEyTxgbLz", "0T7Q1ITkVJUESKSj1lw2l1", "2JqK8hHOs1wQ2tgtLCZ9KA"]

## 2. Defining Functions to Retrieve Data

I defined the functions that will take the playlist IDs as inputs and retrieve the musical features of the songs contained in those playlists.

### 2.1 Retrieving Track IDs

This function iterates through the songs in a playlist and retrieves their respective IDs. It returns a list of song IDs.

In [3]:
# Define function to retrieve track IDs from playlist
def get_track_IDs(user, playlist_id):
    ids = []
    playlist = sp.user_playlist(user, playlist_id)
    for item in playlist['tracks']['items']:
        track = item['track']
        ids.append(track['id'])
    return ids

### 2.2 Extracting Track Features

This function uses the track ID and retrieves the Spotify features of that track. It returns a tuple of the track features.

In [4]:
# Define function to extract track metadata and track features
def get_track_features(id):
    meta = sp.track(id)
    features = sp.audio_features(id)

    # Metadata
    name = meta['name']
    album = meta['album']['name']
    artist = meta['album']['artists'][0]['name']
    release_date = meta['album']['release_date']
    length = meta['duration_ms']
    popularity = meta['popularity']

    # Features
    acousticness = features[0]['acousticness']
    danceability = features[0]['danceability']
    energy = features[0]['energy']
    instrumentalness = features[0]['instrumentalness']
    liveness = features[0]['liveness']
    loudness = features[0]['loudness']
    speechiness = features[0]['speechiness']
    tempo = features[0]['tempo']
    time_signature = features[0]['time_signature']
    valence = features[0]['valence']

    track = [name, album, artist, release_date, length, popularity, danceability, acousticness, energy,
             instrumentalness, liveness, loudness, speechiness, tempo, time_signature, valence]
    return track

## 3. Retrieving the Data

### 3.1 Spotify Authentication

Using my Client IDs, I authenticated my account to gain access to the Spotify data.

In [5]:
# Call spotify client credentials
client_credentials_manager = SpotifyClientCredentials(CLIENT_ID, CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

### 3.2 Retrieving the Data

I called the functions defined in the previous section in order to retrieve the data.

In [6]:
# Call function to retrieve Spotify IDs
IDs = []
for id in PLAYLIST_IDS:
    IDs += get_track_IDs(SPOTIFY_USERNAME, id)
    
# Extract features of each song
tracks = []
for id in IDs:
    track = get_track_features(id)
    tracks.append(track)

### 3.3 Creating the Dataset

Finally, I stored the dataset inside a dataframe.

In [7]:
# Create dataframe
df = pd.DataFrame(tracks, columns = ['name', 'album', 'artist', 'release_date', 'length', 'popularity',
                                    'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness',
                                    'loudness', 'speechiness', 'tempo', 'time_signature', 'valence'])

## 4. Finalizing the Dataset

Before exporting the dataset, I wanted to perform some simple data cleaning to organize the dataset for analysis.

### 4.1 Examining the Dataset

In [8]:
# View data
df.head()

Unnamed: 0,name,album,artist,release_date,length,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,valence
0,Tim McGraw,Taylor Swift,Taylor Swift,2006-10-24,232106,60,0.58,0.575,0.491,0.0,0.121,-6.462,0.0251,76.009,4,0.425
1,Picture To Burn,Taylor Swift,Taylor Swift,2006-10-24,173066,64,0.658,0.173,0.877,0.0,0.0962,-2.098,0.0323,105.586,4,0.821
2,Teardrops On My Guitar - Radio Single Remix,Taylor Swift,Taylor Swift,2006-10-24,203040,62,0.621,0.288,0.417,0.0,0.119,-6.941,0.0231,99.953,4,0.289
3,A Place in this World,Taylor Swift,Taylor Swift,2006-10-24,199200,51,0.576,0.051,0.777,0.0,0.32,-2.881,0.0324,115.028,4,0.428
4,Cold As You,Taylor Swift,Taylor Swift,2006-10-24,239013,52,0.418,0.217,0.482,0.0,0.123,-5.769,0.0266,175.558,4,0.261


In [9]:
# View structure of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247 entries, 0 to 246
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              247 non-null    object 
 1   album             247 non-null    object 
 2   artist            247 non-null    object 
 3   release_date      247 non-null    object 
 4   length            247 non-null    int64  
 5   popularity        247 non-null    int64  
 6   danceability      247 non-null    float64
 7   acousticness      247 non-null    float64
 8   energy            247 non-null    float64
 9   instrumentalness  247 non-null    float64
 10  liveness          247 non-null    float64
 11  loudness          247 non-null    float64
 12  speechiness       247 non-null    float64
 13  tempo             247 non-null    float64
 14  time_signature    247 non-null    int64  
 15  valence           247 non-null    float64
dtypes: float64(9), int64(3), object(4)
memory us

In [10]:
# Check for missing values
df.isna().values.any()

False

### 4.2 Cleaning the Dataset

I performed some data cleaning on the datasets including:

- Replacing the apostrophes for consistency and facilitate future string searches
- Grouping tracks to albums to match the tracklisting of Taylor's Version (e.g., *Ronan* is included on *Red (Taylor's Version)* but not in the original)
- Renaming album titles for readability

In [11]:
# Replace apostrophes for consistency and facilitate string searches
df = df.replace("’", "'", regex = True)

In [12]:
# Group Fearless tracks
df['album'] = df['album'].replace(['Fearless Platinum Edition',
                                   'Today Was A Fairytale'], 'Fearless')

In [13]:
# Group Red tracks
df['album'] = df['album'].replace(['Red (Deluxe Edition)',
                                   'Ronan',
                                   'The Breaker',
                                   'Babe (feat. Taylor Swift)'], 'Red')

In [14]:
# Group miscellaneous tracks
df['album'] = df['album'].replace(['Sweeter Than Fiction',
                                  'The Hunger Games: Songs From District 12 And Beyond',
                                  'Only The Young (Featured in Miss Americana)',
                                  'Beautiful Ghosts (From The Motion Picture "Cats")',
                                  'Christmas Tree Farm'], 'Other')


In [15]:
# Rename album titles for readability
df['album'] = df['album'].replace(['Speak Now (Deluxe Edition)'], 'Speak Now')
df['album'] = df['album'].replace(['1989 (Deluxe Edition)'], '1989')
df['album'] = df['album'].replace(['folklore (deluxe version)'], 'folklore')
df['album'] = df['album'].replace(['evermore (deluxe version)'], 'evermore')
df['album'] = df['album'].replace(['Midnights (3am Edition)'], 'Midnights')
df['album'] = df['album'].replace(['This Love (Taylor\'s Version)'], '1989 (Taylor\'s Version)')

### 4.3 Exporting the Final Dataset

Finally, I exported the dataset to the csv file *swift_data.csv*.

In [16]:
df.to_csv('data/swift_data.csv', sep = ',', index = False)