# ETL Pipeline

The purpose of this exercise is to demonstrate creation of an ETL pipeline and how to ingest different data sources, attempt to pre-process, clean and transform the information coming in before preparing it for an analytics portal or data dashboard to be utilized by customers (or business end-users).

## Extract
To start with I need to download, ingest or import the data. I have set up a free Spotify account and generated some data by creating a playlist of music to be used for the purpose of this notebook. In the Spotify Developer section it's possible to connect to their API, obtain an OAuth token and determine the type of data you would like access to.

In [1]:
# Generate an OAuth access token for the Spotify API at this URL:  https://developer.spotify.com/console/get-recently-played/
# Check the 'Get Recently Played Tracks' box
# Then click on 'Generate Token' at the bottom of the Spotify developer page
# Finally, copy the curl command containing the API key and paste it below (This key changes each time you request a new one)

!curl -X "GET" "https://api.spotify.com/v1/me/player/recently-played?limit=50" -H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer BQAE6FtRf-slrjFKGVFTvGvpJUtDCNCmPkQirD8neH5DrOPakR0dj9A-seHX6j3kXFo701zdBBfnCin4sJ-kxjuyDn2JdsDQOfFiZsQwJjkV_cg6-juCm5xKs1w5XsDKe_iQN0QfwZWphGgnMVbFnKYgacJbuG--AdP5SZ4nCuNGWAXV_Q"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

{
  "items" : [ {
    "track" : {
      "album" : {
        "album_type" : "album",
        "artists" : [ {
          "external_urls" : {
            "spotify" : "https://open.spotify.com/artist/4toEjJSZu1rbfX2hfVdZFA"
          },
          "href" : "https://api.spotify.com/v1/artists/4toEjJSZu1rbfX2hfVdZFA",
          "id" : "4toEjJSZu1rbfX2hfVdZFA",
          "name" : "Boogie Down Productions",
          "type" : "artist",
          "uri" : "spotify:artist:4toEjJSZu1rbfX2hfVdZFA"
        } ],
        "available_markets" : [ "AD", "AE", "AG", "AL", "AM", "AO", "AR", "AT", "AU", "AZ", "BA", "BB", "BD", "BE", "BF", "BG", "BH", "BI", "BJ", "BN", "BO", "BR", "BS", "BT", "BW", "BY", "BZ", "CA", "CD", "CG", "CH", "CI", "CL", "CM", "CO", "CR", "CV", "CW", "CY", "CZ", "DE", "DJ", "DK", "DM", "DO", "DZ", "EC", "EE", "EG", "ES", "FI", "FJ", "FM", "FR", "GA", "GB", "GD", "GE", "GH", "GM", "GN", "GQ", "GR", "GT", "GW", "GY", "HK", "HN", "HR", "HT", "HU", "ID", "IE", "IL", "IN", "IQ", "IS", "IT",


                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 44899  100 44899    0     0   113k      0 --:--:-- --:--:-- --:--:--  114k


Import the libraries.

In [2]:
import sqlalchemy
import pandas as pd 
from sqlalchemy.orm import sessionmaker
import requests
import json
from datetime import datetime
import datetime
import sqlite3

In order to do this, open SQLite, save a copy of the dataframe as a table in SQLite. This should allow any queries to be run using sqlalchemy and sqlite packages in python.

In [3]:
def run_spotify_etl():
    DATABASE_LOCATION = "sqlite:///my_track_list.sqlite"
    USER_ID = 'lynstanford'
    TOKEN = 'BQAE6FtRf-slrjFKGVFTvGvpJUtDCNCmPkQirD8neH5DrOPakR0dj9A-seHX6j3kXFo701zdBBfnCin4sJ-kxjuyDn2JdsDQOfFiZsQwJjkV_cg6-juCm5xKs1w5XsDKe_iQN0QfwZWphGgnMVbFnKYgacJbuG--AdP5SZ4nCuNGWAXV_Q'

In [4]:
TOKEN = 'BQAE6FtRf-slrjFKGVFTvGvpJUtDCNCmPkQirD8neH5DrOPakR0dj9A-seHX6j3kXFo701zdBBfnCin4sJ-kxjuyDn2JdsDQOfFiZsQwJjkV_cg6-juCm5xKs1w5XsDKe_iQN0QfwZWphGgnMVbFnKYgacJbuG--AdP5SZ4nCuNGWAXV_Q'

if __name__ == "__main__":

    # Extract part of the ETL process
 
    headers = {
        "Accept" : "application/json",
        "Content-Type" : "application/json",
        "Authorization" : "Bearer {token}".format(token=TOKEN)
    }

In [5]:
# Convert time to Unix timestamp in miliseconds      
today = datetime.datetime.now()                    # maybe try datetime.datetime.today()
yesterday = today - datetime.timedelta(days=1)
yesterday_unix_timestamp = int(yesterday.timestamp()) * 1000

# Download all songs you've listened to "after yesterday", which means in the last 24 hours      
r = requests.get("https://api.spotify.com/v1/me/player/recently-played?after={time}".format(time=yesterday_unix_timestamp), 
                 headers = headers)

data = r.json()

print(data)

{'items': [{'track': {'album': {'album_type': 'album', 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4toEjJSZu1rbfX2hfVdZFA'}, 'href': 'https://api.spotify.com/v1/artists/4toEjJSZu1rbfX2hfVdZFA', 'id': '4toEjJSZu1rbfX2hfVdZFA', 'name': 'Boogie Down Productions', 'type': 'artist', 'uri': 'spotify:artist:4toEjJSZu1rbfX2hfVdZFA'}], 'available_markets': ['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT', 'AU', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BN', 'BO', 'BR', 'BS', 'BT', 'BW', 'BY', 'BZ', 'CA', 'CD', 'CG', 'CH', 'CI', 'CL', 'CM', 'CO', 'CR', 'CV', 'CW', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EE', 'EG', 'ES', 'FI', 'FJ', 'FM', 'FR', 'GA', 'GB', 'GD', 'GE', 'GH', 'GM', 'GN', 'GQ', 'GR', 'GT', 'GW', 'GY', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IS', 'IT', 'JM', 'JO', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KR', 'KW', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 

In [17]:
song_names = []
artist_names = []
played_at_list = []
timestamps = []

# Extracting only the relevant bits of data from the json object      
for song in data["items"]:
    song_names.append(song["track"]["name"])
    artist_names.append(song["track"]["album"]["artists"][0]["name"])
    played_at_list.append(song["played_at"])
    timestamps.append(song["played_at"][0:10])

So the 'data' variable reads in the json information but it appears incredibly messy still. How can I tidy this up and make it ready to be used as a Pandas dataframe? Convert it to a 'dict'.

In [18]:
# Prepare a dictionary in order to turn it into a pandas dataframe below       
song_dict = {
    "song_name" : song_names,
    "artist_name": artist_names,
    "played_at" : played_at_list,
    "timestamp" : timestamps
}

In [19]:
song_df = pd.DataFrame(song_dict, columns = ["song_name", "artist_name", "played_at", "timestamp"])
print(song_df)

  song_name              artist_name                 played_at   timestamp
0    Poetry  Boogie Down Productions  2022-11-07T21:23:01.787Z  2022-11-07


## Transform
This could also be named the 'Validation' stage as I am confirming the presence of data, checking for Null or missing values and generally cleaning it up.

In [None]:
import sqlalchemy
import pandas as pd 
from sqlalchemy.orm import sessionmaker
import requests
import json
from datetime import datetime
import datetime
import sqlite3

In [None]:
DATABASE_LOCATION = "sqlite:///my_track_list.sqlite"
USER_ID = ''
TOKEN = ''

In [20]:
def check_if_valid_data(df: pd.DataFrame) -> bool:
    # Check if dataframe is empty
    if df.empty:
        print("No songs downloaded. Finishing execution")
        return False 

    # Primary Key Check
    if pd.Series(df['played_at']).is_unique:
        pass
    else:
        raise Exception("Primary Key check is violated")

    # Check for nulls
    if df.isnull().values.any():
        raise Exception("Null values found")

    # Check that all timestamps are of yesterday's date
    yesterday = datetime.datetime.now() - datetime.timedelta(days=1)
    yesterday = yesterday.replace(hour=0, minute=0, second=0, microsecond=0)

    timestamps = df["timestamp"].tolist()
    for timestamp in timestamps:
        if datetime.datetime.strptime(timestamp, '%Y-%m-%d') != yesterday:
            raise Exception("At least one of the returned songs does not have a yesterday's timestamp")

    return True

if __name__ == "__main__":

    # Extract part of the ETL process
 
    headers = {
        "Accept" : "application/json",
        "Content-Type" : "application/json",
        "Authorization" : "Bearer {token}".format(token=TOKEN)
    }
    
    # Convert time to Unix timestamp in miliseconds      
    today = datetime.datetime.now()
    yesterday = today - datetime.timedelta(days=1)
    yesterday_unix_timestamp = int(yesterday.timestamp()) * 1000

    # Download all songs you've listened to "after yesterday", which means in the last 24 hours      
    r = requests.get("https://api.spotify.com/v1/me/player/recently-played?after={time}".format(time=yesterday_unix_timestamp), headers = headers)

    data = r.json()
    
    print(data)


{'items': [{'track': {'album': {'album_type': 'album', 'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4toEjJSZu1rbfX2hfVdZFA'}, 'href': 'https://api.spotify.com/v1/artists/4toEjJSZu1rbfX2hfVdZFA', 'id': '4toEjJSZu1rbfX2hfVdZFA', 'name': 'Boogie Down Productions', 'type': 'artist', 'uri': 'spotify:artist:4toEjJSZu1rbfX2hfVdZFA'}], 'available_markets': ['AD', 'AE', 'AG', 'AL', 'AM', 'AO', 'AR', 'AT', 'AU', 'AZ', 'BA', 'BB', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BJ', 'BN', 'BO', 'BR', 'BS', 'BT', 'BW', 'BY', 'BZ', 'CA', 'CD', 'CG', 'CH', 'CI', 'CL', 'CM', 'CO', 'CR', 'CV', 'CW', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DM', 'DO', 'DZ', 'EC', 'EE', 'EG', 'ES', 'FI', 'FJ', 'FM', 'FR', 'GA', 'GB', 'GD', 'GE', 'GH', 'GM', 'GN', 'GQ', 'GR', 'GT', 'GW', 'GY', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IS', 'IT', 'JM', 'JO', 'JP', 'KE', 'KG', 'KH', 'KI', 'KM', 'KN', 'KR', 'KW', 'KZ', 'LA', 'LB', 'LC', 'LI', 'LK', 'LR', 'LS', 'LT', 'LU', 'LV', 'LY', 'MA', 'MC', 

Copy and paste this data and store it in a separate JSON file before converting it to a Python dictionary. I have decided to name it 'my_track_list.json'.

Download the dataset so it can be used in Python. The actual data format will play a big role here when it comes to cleaning up the information and manipulating the data before it can be presented for analytical use.

There are no error messages so it's safe to resume. Don't forget to store the dataframe in a 'sqlite' format before running any SQL queries.


In [22]:
song_names = []
artist_names = []
played_at_list = []
timestamps = []

# Extracting only the relevant bits of data from the json object      
for song in data["items"]:
    song_names.append(song["track"]["name"])
    artist_names.append(song["track"]["album"]["artists"][0]["name"])
    played_at_list.append(song["played_at"])
    timestamps.append(song["played_at"][0:10])

In [23]:
# Prepare a dictionary in order to turn it into a pandas dataframe below       
song_dict = {
    "song_name" : song_names,
    "artist_name": artist_names,
    "played_at" : played_at_list,
    "timestamp" : timestamps
}

In [24]:
song_df = pd.DataFrame(song_dict, columns = ["song_name", "artist_name", "played_at", "timestamp"])
print(song_df)

  song_name              artist_name                 played_at   timestamp
0    Poetry  Boogie Down Productions  2022-11-07T21:23:01.787Z  2022-11-07
