### Spotipy - Get Playlist Content

In this file the goal is to extract and process all the data about the songs contained in the playlist "A Jukebox do Torres". 

In [1]:
# !pip install spotipy --user
# !pip install wordcloud --user
# !pip install pandas --user
# !pip install numpy --user
# !pip install matplotlib --user
# !pip install seaborn --user
# !pip install seaborn --user
# !pip install python-dotenv --user
# !pip install openpyxl --user
# !pip install pyarrow --user

#The lines below were necessary due to some bugs with the latest updates of spotipy
# !pip install git+https://github.com/plamere/spotipy.git --upgrade
# !pip install urllib3 --upgrade 
# !pip install requests --upgrade 
# !pip install spotipy --upgrade

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth, SpotifyClientCredentials
import pandas as pd
import time
import numpy as np
from functions.spotipyTools import *
from dotenv import load_dotenv
import os
import openpyxl

In [2]:
load_dotenv()

True

In [3]:
#The instructions to replicate this step will be inluded in the repository readme file

e_client_id = os.environ["client_id"]
e_client_key = os.environ["client_key"]
e_playlist_id = os.environ["playlist_id"]
e_user_id = os.environ["user_id"]

client_credentials_manager = SpotifyClientCredentials(e_client_id, e_client_key)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

First we use spotipy to grab all the track's IDs from the playlist, with the respective date when they were added. We use the functions created in the _functions_ folder.

In [4]:
all_tracks, dates_added = get_playlist_info(e_user_id, e_playlist_id, sp)

track_info_list = []

for item in all_tracks:
  track_info_list.append(get_track_info(item, sp))

df_tracks = pd.DataFrame(track_info_list, columns=['name', 'artist_name', 'album_name', 'album_date', 'album_popularity', 'track_duration', 'danceability', 'energy', 'instrumentalness', 'liveness', 'tempo','artist_url'])

df_tracks['date_added'] = dates_added

In [5]:
df_tracks.dtypes

name                 object
artist_name          object
album_name           object
album_date           object
album_popularity      int64
track_duration        int64
danceability        float64
energy              float64
instrumentalness    float64
liveness            float64
tempo               float64
artist_url           object
date_added           object
dtype: object

In [7]:
df_tracks.head()

Unnamed: 0,name,artist_name,album_name,album_date,album_popularity,track_duration,danceability,energy,instrumentalness,liveness,tempo,artist_url,date_added
0,Panteão,Linda Martini,Turbo Lento,2013-01-01,23,252340,0.352,0.911,0.647,0.503,158.025,https://open.spotify.com/artist/4Pv6qAkea25i2D...,2013-10-06T11:12:10Z
1,purr,Tides From Nebula,Aura,2009-01-01,0,259839,0.115,0.646,0.957,0.367,167.241,https://open.spotify.com/artist/1CzKORB9IN0EjP...,2013-10-06T11:12:29Z
2,Dew,Chon,Newborn Sun,2013-06-11,35,195242,0.282,0.897,0.813,0.202,100.089,https://open.spotify.com/artist/2JFljHPanIjYy2...,2013-10-06T11:12:31Z
3,Headache,Metz,METZ,2012-10-09,0,138800,0.248,0.942,0.0321,0.611,154.834,https://open.spotify.com/artist/18TNVFTJ6Wfeic...,2013-10-06T11:12:39Z
4,Wires,Red Fang,Murder the Mountains,2011-04-12,0,343306,0.178,0.894,0.311,0.257,132.86,https://open.spotify.com/artist/3u4HBuoQ4dgPBz...,2013-10-06T11:12:57Z


Apply the correct formatting to the date columns

In [8]:
df_tracks['album_year'] = pd.to_datetime(df_tracks['album_date'], format='%Y-%m-%d').dt.year
df_tracks['date_added'] = df_tracks['date_added'].map(lambda x:pd.to_datetime(x[:10], format='%Y-%m-%d'))

Track's duration are retrieved in miliseconds, so it needs to be converted to minutes and seconds. Keeping the minutes and seconds in a correct format might be worth it later for display purposes. The column with the total number of seconds will also be created for statistical purposes.

In [10]:
track_minutes = np.floor(df_tracks['track_duration']/1000/60).astype(int).astype(str)
track_seconds = np.mod(df_tracks['track_duration']/1000,60).astype(int).astype(str)
track_time = track_minutes.append(track_seconds).groupby(level=0).agg(':'.join)
df_tracks['track_duration'] = pd.to_datetime(track_time, format='%M:%S').dt.time
df_tracks['track_duration_secs'] = df_tracks['track_duration'].apply(lambda x: x.minute * 60 + x.second)

  track_time = track_minutes.append(track_seconds).groupby(level=0).agg(':'.join)


Export in both formats:
- xlsx: for visualization softwares
- parquet: for faster load when performing further analysis

In [None]:
df_tracks.to_excel('PlaylistTracks.xlsx', engine='openpyxl')
df_tracks.to_parquet('PlaylistTracks.parquet', engine='pyarrow')