# Spotify Data Extraction

This notebook is used to extract my listening activitys and songs data from Spotify API and perform 

## Spotify API Access
To fetch data from Spotify, you'll need to go to https://developer.spotify.com/dashboard/login and create Spotify Developer and create an application.

In the application, there are some identifiers you'll need for an OAuth flow:

1. Under **Show Client Secret**:
 - Client ID
 - Client Secret

2. In **Edit Settings**:
 - Redirect URIs (You can add http://localhost:8888/callback if you run on localhost)
 
 
*Note that you will need an ordinary Spotify account to login, create an app, and get the credentials

In [100]:
import os
from dotenv import load_dotenv, find_dotenv

import pandas as pd
import numpy as np
import datetime
# Import spotify wrapper
import spotipy
from spotipy.oauth2 import SpotifyOAuth
from sqlalchemy import create_engine

load_dotenv(find_dotenv())

True

In [101]:
# Insert your Spotify username and credentials rom Spotify Developer
SPOTIFY_CLIENT_ID = os.environ['SPOTIFY_CLIENT_ID']
SPOTIFY_CLIENT_SECRET = os.environ['SPOTIFY_CLIENT_SECRET']
REDIRECT_URL = os.environ['SPOTIFY_REDIRECT_URL']

# Inser your PostgresSQL database configuration
HOST=os.environ['DB_HOST']
DATABASE=os.environ['DB_NAME']
USER=os.environ['DB_USERNAME']
PASSWORD=os.environ['DB_PASSWORD']

In [102]:
# Create a Client Authorization flow and get the token
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
        client_id=SPOTIFY_CLIENT_ID,
        client_secret=SPOTIFY_CLIENT_SECRET,
        redirect_uri=REDIRECT_URL,
        scope="user-read-recently-played"
    ))

## Perform extraction and transformation for each table


### Played_song Table 
The [endpoint](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-recently-played) usesd for get the recently played songs. This endpoint only return 50 recent tracks only


In [103]:
today = datetime.datetime.now()
ndays_ago = today - datetime.timedelta(days=30)
ndays_ago = int(ndays_ago.timestamp()) * 1000

# Structure the dictionary to store the data
songs_dict = {
    "song_id": [],
    "album_id": [],
    "played_at": [],
}

# Get user's recently-played-tracks data
recently_played_songs = sp.current_user_recently_played(after=ndays_ago)

if len(recently_played_songs['items']) != 0:
    for song in recently_played_songs['items']:
        songs_dict['song_id'].append(song["track"]['id'])
        songs_dict['album_id'].append(song["track"]["album"]["id"])
        songs_dict['played_at'].append(song["played_at"])
        
else:
    print("No records of listening")
    
played_songs_df = pd.DataFrame(songs_dict)
played_songs_df.head()

Unnamed: 0,song_id,album_id,played_at
0,7iwunjxHhRFfqKder6oDWK,2CDUJGW1bXZFX36DOvMeib,2022-06-17T07:31:19.783Z
1,02VBYrHfVwfEWXk5DXyf0T,1YgekJJTEueWDaMr7BYqPk,2022-06-17T06:35:06.070Z
2,1H7OG9HvaA6ykBRBxADJZi,0AK8S7tGGL6Cj4ED2Nq2KQ,2022-06-17T06:31:03.411Z
3,58wyJLv6yH1La9NIZPl3ne,0MGcjBIFcL2qaCrgGjIGFb,2022-06-17T06:27:17.209Z
4,6Zg0jDwwyywj8nFz8FJoRS,5ynDNXm908XzgoHBKEkNDI,2022-06-17T06:22:29.305Z


In [104]:
# Check missing values
if played_songs_df.isnull().values.any():
    raise Exception("Null value found in played songs data")
    
timestamp_df = pd.to_datetime(played_songs_df['played_at'])
# Change to GMT+7:00 timezone
timestamp_df = timestamp_df + datetime.timedelta(hours=-5)
#Set UNIX code for played_at as primary key, since only one song can played at a time
played_songs_table_pk = timestamp_df.astype(np.int64) // 10**6
# Format the timestamp
timestamp_df = timestamp_df.dt.strftime("%m-%d-%Y %H:%M:%S")
played_songs_df['played_at'] = timestamp_df
played_songs_df.insert(0, "played_song_id", played_songs_table_pk)

played_songs_df.head()

Unnamed: 0,played_song_id,song_id,album_id,played_at
0,1655433079783,7iwunjxHhRFfqKder6oDWK,2CDUJGW1bXZFX36DOvMeib,06-17-2022 02:31:19
1,1655429706070,02VBYrHfVwfEWXk5DXyf0T,1YgekJJTEueWDaMr7BYqPk,06-17-2022 01:35:06
2,1655429463411,1H7OG9HvaA6ykBRBxADJZi,0AK8S7tGGL6Cj4ED2Nq2KQ,06-17-2022 01:31:03
3,1655429237209,58wyJLv6yH1La9NIZPl3ne,0MGcjBIFcL2qaCrgGjIGFb,06-17-2022 01:27:17
4,1655428949305,6Zg0jDwwyywj8nFz8FJoRS,5ynDNXm908XzgoHBKEkNDI,06-17-2022 01:22:29


## Album Table

The [get several albums endpoint](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-multiple-albums) is used to retrieve the information about albums.
The limitation is that the endpoint has a maximum of 20 items, to solve this limit, using a nested loop to call the API with 20 IDs submitted each time

In [105]:
# Structure the dictionary to store the data
albums_dict = {
    'album_id': [],
    'album_name': [],
    'album_type': [],
    'popularity': [],
    'release_year': [],
    'external_url': []
}

album_ids_list = played_songs_df['album_id'].unique().tolist()

for i in range(0, len(album_ids_list), 20):
    albums_data = sp.albums(album_ids_list[i: i+20])
    
    for album in albums_data['albums']:
        albums_dict['album_id'].append(album['id'])
        albums_dict['album_name'].append(album['name'])
        albums_dict['album_type'].append(album['album_type'])
        albums_dict['popularity'].append(album['popularity'])
        albums_dict['release_year'].append(album['release_date'])
        albums_dict['external_url'].append(album['external_urls']['spotify'])

albums_df = pd.DataFrame(albums_dict)

In [106]:
# Trasform the date to year
albums_df['release_year'] = pd.to_datetime(albums_df['release_year'])
albums_df['release_year'] = albums_df['release_year'].dt.year
albums_df.head()

Unnamed: 0,album_id,album_name,album_type,popularity,release_year,external_url
0,2CDUJGW1bXZFX36DOvMeib,Chẳng làm gì hết,single,8,2021,https://open.spotify.com/album/2CDUJGW1bXZFX36...
1,1YgekJJTEueWDaMr7BYqPk,An Evening With Silk Sonic,album,75,2021,https://open.spotify.com/album/1YgekJJTEueWDaM...
2,0AK8S7tGGL6Cj4ED2Nq2KQ,Comeback,single,11,2020,https://open.spotify.com/album/0AK8S7tGGL6Cj4E...
3,0MGcjBIFcL2qaCrgGjIGFb,Can We Kiss Forever?,single,61,2018,https://open.spotify.com/album/0MGcjBIFcL2qaCr...
4,5ynDNXm908XzgoHBKEkNDI,Quá Lâu,single,29,2019,https://open.spotify.com/album/5ynDNXm908XzgoH...


### Song Table - Dimension Table

This [endpoint](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-tracks) is used to extract song information, the limit for this endpoint is 50 IDs

In [107]:
# Get list of song ID without duplicate 
song_ids_list = played_songs_df['song_id'].unique().tolist()

In [108]:
# Structure the dictionary to store the data
songs_dict = {
    'song_id': [],
    'song_name' : [],
    'album_id': [],
    'duration_ms': [],
    'popularity': [],
    'external_url': []
}

for i in range(0, len(song_ids_list), 50):
    # Get multiple songs data
    songs_data = sp.tracks(tracks=song_ids_list[i:i+50], market=None)
    
    for song in songs_data['tracks']:
        songs_dict['song_id'].append(song['id'])
        songs_dict['song_name'].append(song['name'])
        songs_dict['album_id'].append(song['album']['id'])
        songs_dict['popularity'].append(song['popularity'])
        songs_dict['duration_ms'].append(song['duration_ms'])
        songs_dict['external_url'].append(song['external_urls']['spotify'])

songs_df = pd.DataFrame(songs_dict)

In [109]:
# Check missing values
if songs_df.isnull().values.any():
    raise Exception("Null value found in songs data")

In [110]:
songs_df.head()

Unnamed: 0,song_id,song_name,album_id,duration_ms,popularity,external_url
0,7iwunjxHhRFfqKder6oDWK,Chẳng làm gì hết,2CDUJGW1bXZFX36DOvMeib,248095,21,https://open.spotify.com/track/7iwunjxHhRFfqKd...
1,02VBYrHfVwfEWXk5DXyf0T,Leave The Door Open,1YgekJJTEueWDaMr7BYqPk,242096,85,https://open.spotify.com/track/02VBYrHfVwfEWXk...
2,1H7OG9HvaA6ykBRBxADJZi,Comeback,0AK8S7tGGL6Cj4ED2Nq2KQ,218630,25,https://open.spotify.com/track/1H7OG9HvaA6ykBR...
3,58wyJLv6yH1La9NIZPl3ne,Can We Kiss Forever?,0MGcjBIFcL2qaCrgGjIGFb,187931,77,https://open.spotify.com/track/58wyJLv6yH1La9N...
4,6Zg0jDwwyywj8nFz8FJoRS,Quá Lâu,5ynDNXm908XzgoHBKEkNDI,177169,44,https://open.spotify.com/track/6Zg0jDwwyywj8nF...


### Song_Artists Table

In [111]:
# Structure the dictionary to store the data
song_artists_dict = {
    'song_id': [],
    'artist_id': []
}

for song in songs_data['tracks']:
    for artist in song['artists']:
        song_artists_dict['song_id'].append(song['id'])
        song_artists_dict['artist_id'].append(artist['id'])

song_artists_df = pd.DataFrame(song_artists_dict)
song_artists_df

Unnamed: 0,song_id,artist_id
0,7iwunjxHhRFfqKder6oDWK,0Hxo49G2Y9jzrRGqQeok1m
1,02VBYrHfVwfEWXk5DXyf0T,0du5cEVh5yTK9QJze8zA0C
2,02VBYrHfVwfEWXk5DXyf0T,3jK9MiCrA42lLAdMGUZpwa
3,02VBYrHfVwfEWXk5DXyf0T,6PvvGcCY2XtUcSRld1Wilr
4,1H7OG9HvaA6ykBRBxADJZi,61GRLAWKX8C8l7pLwqJgny
...,...,...
67,2XCkJN8DibL2PyXAmoDkc6,7185Q95lPFld0aoPqO6e0U
68,62YUdF3m6k3PtHen731G6j,7LWdcPFBFcRaamGjIJbPV7
69,62YUdF3m6k3PtHen731G6j,1EpXwbpQDflfGg6juJz89j
70,6pjtGdjPvJIzV9ADVPY568,6sPQwc6lix6K1Gv64v91Ml


### Song_Feature 

This [endpoint](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features) is used to extract the audio features for each song, the limit is 100 IDs per request

In [112]:
# Get audio features data
audio_features = sp.audio_features(tracks=song_ids_list)

In [113]:
# Structure the dictionary to store the data
feature_dict = {
        "song_id": [],
        "danceability": [],
        "energy": [],
        "loudness": [],
        "speechiness": [],
        "acousticness": [],
        "instrumentalness": [],
        "liveness": [],
        "valence": [],
        "tempo": [],
}
for i in range(0, len(song_ids_list), 100):
    # Get audio features data
    audio_features = sp.audio_features(tracks=song_ids_list[i:i+100])
    
    for song in audio_features:
        if song != None:
            feature_dict['song_id'].append(song['id'])
            feature_dict['danceability'].append(song['danceability'])
            feature_dict['energy'].append(song['energy'])
            feature_dict['loudness'].append(song['loudness'])
            feature_dict['speechiness'].append(song['speechiness'])
            feature_dict['acousticness'].append(song['acousticness'])
            feature_dict['instrumentalness'].append(song['instrumentalness'])
            feature_dict['liveness'].append(song['liveness'])
            feature_dict['valence'].append(song['valence'])
            feature_dict['tempo'].append(song['tempo'])

audio_features_df = pd.DataFrame(feature_dict)

In [114]:
# Check missing values
if audio_features_df.isnull().values.any():
    raise Exception("Null value found in song feature data")

In [115]:
audio_features_df.head()

Unnamed: 0,song_id,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,7iwunjxHhRFfqKder6oDWK,0.911,0.787,-4.178,0.0438,0.49,2e-05,0.117,0.967,126.008
1,02VBYrHfVwfEWXk5DXyf0T,0.586,0.616,-7.964,0.0324,0.182,0.0,0.0927,0.719,148.088
2,1H7OG9HvaA6ykBRBxADJZi,0.374,0.614,-7.639,0.0498,0.282,0.0,0.118,0.501,105.917
3,58wyJLv6yH1La9NIZPl3ne,0.522,0.128,-18.717,0.0357,0.894,0.026,0.0941,0.124,109.986
4,6Zg0jDwwyywj8nFz8FJoRS,0.707,0.912,-4.061,0.0859,0.183,0.0,0.102,0.944,106.011


### Artist - Dimension Table

This [endpoint](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-multiple-artists) is used to extract data of artists, the limit is 50 artists per requests

In [116]:
# Get list of artist ID without duplicate 
artist_ids_list = song_artists_df['artist_id'].unique().tolist()

In [117]:
# Get multiple artist data and structuren datafram for Artist Table
artists_dict = {
    'artist_id': [],
    'artist_name': [],
    'followers': [],
    'popularity': [],
    'external_url': []
}

for i in range(0, len(artist_ids_list), 50):
    artist_data = sp.artists(artist_ids_list[i:i+50])

    for artist in artist_data['artists']:
        artists_dict['artist_id'].append(artist['id'])
        artists_dict['artist_name'].append(artist['name'])
        artists_dict['followers'].append(artist['followers']['total'])
        artists_dict['popularity'].append(artist['popularity'])
        artists_dict['external_url'].append(artist['external_urls']['spotify'])

artists_df = pd.DataFrame(artists_dict)

In [118]:
# Check missing values
if artists_df.isnull().values.any():
    raise Exception("Null value found in artist data")

In [119]:
artists_df.head()

Unnamed: 0,artist_id,artist_name,followers,popularity,external_url
0,0Hxo49G2Y9jzrRGqQeok1m,Vinh Khuat,10335,30,https://open.spotify.com/artist/0Hxo49G2Y9jzrR...
1,0du5cEVh5yTK9QJze8zA0C,Bruno Mars,41116353,86,https://open.spotify.com/artist/0du5cEVh5yTK9Q...
2,3jK9MiCrA42lLAdMGUZpwa,Anderson .Paak,2185443,77,https://open.spotify.com/artist/3jK9MiCrA42lLA...
3,6PvvGcCY2XtUcSRld1Wilr,Silk Sonic,684408,73,https://open.spotify.com/artist/6PvvGcCY2XtUcS...
4,61GRLAWKX8C8l7pLwqJgny,Aaron Kellim,1217,30,https://open.spotify.com/artist/61GRLAWKX8C8l7...


### Time - Dimension Table

In [120]:
# Get time data and structure dataframe for Time Table
played_songs_df['played_at'] = pd.to_datetime(played_songs_df['played_at'])
time_df = pd.DataFrame({"start_time": played_songs_df['played_at'],
                        "time": played_songs_df['played_at'].astype(str).str[11:],
                        "day": played_songs_df['played_at'].dt.day,
                        "month": played_songs_df['played_at'].dt.month,
                        "year": played_songs_df['played_at'].dt.year,
                        "weekday": played_songs_df['played_at'].dt.weekday})

time_df.head()

Unnamed: 0,start_time,time,day,month,year,weekday
0,2022-06-17 02:31:19,02:31:19,17,6,2022,4
1,2022-06-17 01:35:06,01:35:06,17,6,2022,4
2,2022-06-17 01:31:03,01:31:03,17,6,2022,4
3,2022-06-17 01:27:17,01:27:17,17,6,2022,4
4,2022-06-17 01:22:29,01:22:29,17,6,2022,4


## Load transformed data into database

To load the data into the database,first,create a temporary table, and use `.to_sql` method to load data into it. Then, inserting data from temporary tables to base tables to avoid the table's primary key already exists.

In [121]:
# Connect to PostgresSQL database
engine = create_engine('postgresql+psycopg2://{username}:{password}@{host}/{database}'.format(username=USER, 
                                                                                            password=PASSWORD,
                                                                                            host=HOST,
                                                                                            database='SpotifyListeningHistory'))
conn = engine.raw_connection()
cur = conn.cursor()

In [122]:
%%time
# Album Table
cur.execute('''
    CREATE TEMP TABLE IF NOT EXISTS temp_album AS 
    SELECT * FROM album LIMIT 0;
''')
albums_df.to_sql("temp_album", con=engine, if_exists='replace', index=False)
engine.execute('''
    INSERT INTO public.album
    SELECT * FROM temp_album
    WHERE temp_album.album_id NOT IN (SELECT album_id FROM public.album);
''')

Wall time: 327 ms


<sqlalchemy.engine.result.ResultProxy at 0x2495d241f48>

In [123]:
%%time
# Song Table
cur.execute('''
    CREATE TEMP TABLE IF NOT EXISTS temp_song AS 
    SELECT * FROM song LIMIT 0;
''')
songs_df.to_sql("temp_song", con=engine, if_exists='replace', index=False)

engine.execute('''
    INSERT INTO public.song
    SELECT * FROM temp_song
    WHERE temp_song.song_id NOT IN (SELECT song_id FROM public.song);
''')

Wall time: 128 ms


<sqlalchemy.engine.result.ResultProxy at 0x2495cdb7348>

In [124]:
%%time
# Artist Table
cur.execute('''
    CREATE TEMP TABLE IF NOT EXISTS temp_artist AS 
    SELECT * FROM artist LIMIT 0;
''')
artists_df.to_sql("temp_artist", con=engine, schema='public', if_exists='replace', index=False)
engine.execute('''
    INSERT INTO public.artist
    SELECT * FROM temp_artist
    WHERE temp_artist.artist_id NOT IN (SELECT artist_id FROM public.artist);
''')

Wall time: 124 ms


<sqlalchemy.engine.result.ResultProxy at 0x2495cce0248>

In [125]:
%%time
# Song_Artist Table
cur.execute('''
    CREATE TEMP TABLE IF NOT EXISTS temp_song_artists AS 
    SELECT * FROM song_artists LIMIT 0;
''')
song_artists_df.to_sql("temp_song_artists", con=engine, schema='public', if_exists='replace', index=False)
engine.execute('''
    INSERT INTO public.song_artists
    SELECT * FROM temp_song_artists tsa
    WHERE NOT EXISTS (SELECT *
                     FROM song_artists sa
                     WHERE tsa.song_id = sa.song_id AND tsa.artist_id = sa.artist_id);
''')

Wall time: 117 ms


<sqlalchemy.engine.result.ResultProxy at 0x2495cea6348>

In [126]:
%%time
# Time Table
cur.execute('''
    CREATE TEMP TABLE IF NOT EXISTS temp_time AS 
    SELECT * FROM Time LIMIT 0;
''')
time_df.to_sql("temp_time", con=engine, schema='public', if_exists='replace', index=False)
engine.execute('''
    INSERT INTO public.time
    SELECT * FROM temp_time
    WHERE temp_time.start_time NOT IN (SELECT start_time FROM public.time);
''')

Wall time: 127 ms


<sqlalchemy.engine.result.ResultProxy at 0x2495cea6488>

In [127]:
%%time
# AudioFeatures Table
cur.execute('''
    CREATE TEMP TABLE IF NOT EXISTS temp_audio_feature AS 
    SELECT * FROM audio_feature LIMIT 0;
''')
audio_features_df.to_sql("temp_audio_feature", con=engine, if_exists='replace', index=False)
engine.execute('''
    INSERT INTO public.audio_feature
    SELECT * FROM temp_audio_feature
    WHERE temp_audio_feature.song_id NOT IN (SELECT song_id FROM public.audio_feature);
''')

Wall time: 101 ms


<sqlalchemy.engine.result.ResultProxy at 0x2495cc57708>

In [128]:
%%time
# PlayedSongs Table
cur.execute('''
    CREATE TEMP TABLE IF NOT EXISTS temp_played_song 
    AS SELECT * FROM played_song LIMIT 0;
''')
played_songs_df.to_sql("temp_played_song", con=engine, if_exists='replace', index=False)

engine.execute('''
    INSERT INTO public.played_song
    SELECT * FROM temp_played_song
    WHERE CAST(temp_played_song.played_song_id AS TEXT) NOT IN (SELECT played_song_id FROM public.played_song);
''')

Wall time: 109 ms


<sqlalchemy.engine.result.ResultProxy at 0x2495e5e0048>

In [129]:
# Close connection
cur.close()
conn.close()