## Methodology:

**Python:**
1. Install Spotipy <br> (*documentation*: https://spotipy.readthedocs.io/en/2.24.0/index.html#spotipy.oauth2.SpotifyPKCE)
1. Extract playlist URI from the top 100 trending songs website 
1. Object to store json structured playlist URI's info:
1. Extracting only the relevant information and creating pandas dataframe from it
    - Album id
    - Album release date
    - Album name
    - Album total_tracks
    - Album top_song
    - Artist name
    - Album external urls
1. Extract the following lists:
    - artists
    - songs
    - albums
1. Create 3 dataframes from these lists
1. Avoid extracting duplicates
1. Rectify release date to datetime object

**AWS Services used**:
- CloudWatch: setting Billing Alarms 
- AWS S3 buckets: for storing JSON and .csv files 
- AWS Lambda: compute service used for runnning transformation jobs and create event based triggers. You will  create a lambda function as follows:
> Extraction:<br>
- Before writing code in lambda function, set environment variables for storing *client_id* and *client_secret* key <br>{ AWS Lambda Fucntion > Configuration > Environment Variables }
- Edit code for hiding client id and client secret keys
- Deploy the code first and then configure the test event using "Test" 
- Now Lambda does not support External functions like "spotipy". Hence you will use a **Layers** from lambda where you will manually upload the spotipy installation file to use spotipy module in the code. You simply upload the .zip file of spotipy in Lambda Layers.
- Now write code for dumping files into S3 bucket by giving the function access to S3 bucket using **boto3** module in AWS, this module helps communicate with AWS services on your account. eg: spotify-etl-project-dawny bucket.
- Now if the code gives RunTime error modify the execution runtime to 5-6 minutes.
- Ensure the function has the IAM permission of *Amazon S3 Full Access*
    
> Transformation and load:<br>
- You now create a new function for transformation and load, make sure to select the IAM role to existing role in advanced setting(role you just created)

In [1]:
# !pip install spotipy

In [2]:
# Imports:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd

### Extraction:

In [3]:
# Provide secret key and client key id:
client_credential_manager = SpotifyClientCredentials(client_id = 'd9b12c57e70d4f6c9876eeaf53992871', 
                                                     client_secret = '9baf7f2888734b27a459ebb152747cb6')


In [4]:
# Object to extract data from spotify website:
sp = spotipy.Spotify(client_credentials_manager= client_credential_manager)
playlists = sp.user_playlists('spotify')

In [5]:
# Extract playlist URI from the top 100 trnedin songs website 

playlist_link = 'https://open.spotify.com/playlist/5ABHKGoOzxkaa28ttQV9sE'
playlist_URI = playlist_link.split('/')[-1]

print('Playlist_URI:', playlist_URI)

Playlist_URI: 5ABHKGoOzxkaa28ttQV9sE


In [6]:
# Object to store json structured playlist URI's info:

data = sp.playlist_tracks(playlist_URI)

print(data)

{'href': 'https://api.spotify.com/v1/playlists/5ABHKGoOzxkaa28ttQV9sE/tracks?offset=0&limit=100&additional_types=track', 'items': [{'added_at': '2024-02-15T21:57:53Z', 'added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/jonathan.holgersson'}, 'href': 'https://api.spotify.com/v1/users/jonathan.holgersson', 'id': 'jonathan.holgersson', 'type': 'user', 'uri': 'spotify:user:jonathan.holgersson'}, 'is_local': False, 'primary_color': None, 'track': {'preview_url': None, 'available_markets': ['AR', 'AU', 'AT', 'BE', 'BO', 'BR', 'BG', 'CA', 'CL', 'CO', 'CR', 'CY', 'CZ', 'DK', 'DO', 'DE', 'EC', 'EE', 'SV', 'FI', 'FR', 'GR', 'GT', 'HN', 'HK', 'HU', 'IS', 'IE', 'IT', 'LV', 'LT', 'LU', 'MY', 'MT', 'MX', 'NL', 'NZ', 'NI', 'NO', 'PA', 'PY', 'PE', 'PH', 'PL', 'PT', 'SG', 'SK', 'ES', 'SE', 'CH', 'TW', 'TR', 'UY', 'US', 'GB', 'AD', 'LI', 'MC', 'ID', 'JP', 'TH', 'VN', 'RO', 'IL', 'ZA', 'SA', 'AE', 'BH', 'QA', 'OM', 'KW', 'EG', 'MA', 'DZ', 'TN', 'LB', 'JO', 'PS', 'IN', 'BY', 'KZ', 'M

In [7]:
# Dict Object content
for i in data: print(i)

href
items
limit
next
offset
previous
total


In [8]:
# Fetching only the albums info:
info = data['items']

print('The link comprises of', len(info), 'albums.')

The link comprises of 100 albums.


### Extracting only the relevant information:
- Album id
- Album release date
- Album name
- Album total_tracks
- Album top_song
- Artist name
- Album external urls


In [9]:
# Album id
info[0]['track']['album']['id']

'4yP0hdKOZPNshxUOjY0cZj'

In [10]:
# Album name
info[0]['track']['album']['name']

'After Hours'

In [11]:
# Album release date
info[0]['track']['album']['release_date']

'2020-03-20'

In [12]:
# Total Tracks
info[0]['track']['album']['total_tracks']

14

In [13]:
# External urls
info[0]['track']['album']['external_urls']['spotify']

'https://open.spotify.com/album/4yP0hdKOZPNshxUOjY0cZj'

In [14]:
# Artist Name
info[0]['track']['album']['artists'][0]['name']

'The Weeknd'

In [15]:
# Top Song from album
info[0]['track']['name']

'Blinding Lights'

In [16]:
# Fucntion for creating a dataframe
def top_100_albums(info):
    """
    Create a DataFrame from a list of JSON dictionaries representing album information.
    
    Parameters:
    info (list): List of dictionaries containing album information.
    
    Returns:
    pd.DataFrame: DataFrame with album information.
    """
    # Extracting data
    data = {
        "album_id": [item['track']['album']['id'] for item in info],
        "album_release_date": [item['track']['album']['release_date'] for item in info],
        "album_name": [item['track']['album']['name'] for item in info],
        "total_tracks": [item['track']['album']['total_tracks'] for item in info],
        "album_top_song": [item['track']['name'] for item in info],
        "artist_name ": [item['track']['album']['artists'][0]['name']for item in info],
         "album_external_urls": [item['track']['album']['external_urls']['spotify'] for item in info]
    }

    # Creating DataFrame
    df = pd.DataFrame(data)
    return df


In [17]:
# Creating dataframe using function 
df = top_100_albums(info)
df.head()

Unnamed: 0,album_id,album_release_date,album_name,total_tracks,album_top_song,artist_name,album_external_urls
0,4yP0hdKOZPNshxUOjY0cZj,2020-03-20,After Hours,14,Blinding Lights,The Weeknd,https://open.spotify.com/album/4yP0hdKOZPNshxU...
1,3T4tUhGYeRNVUGevb0wThu,2017-03-03,÷ (Deluxe),16,Shape of You,Ed Sheeran,https://open.spotify.com/album/3T4tUhGYeRNVUGe...
2,5658aM19fA3JVwTK6eQX70,2019-05-17,Divinely Uninspired To A Hellish Extent,12,Someone You Loved,Lewis Capaldi,https://open.spotify.com/album/5658aM19fA3JVwT...
3,35s58BRTGAEWztPo9WqCIs,2018-12-14,Spider-Man: Into the Spider-Verse (Soundtrack ...,13,Sunflower - Spider-Man: Into the Spider-Verse,Various Artists,https://open.spotify.com/album/35s58BRTGAEWztP...
4,5r36AJ6VOJtp00oxSkBZ5h,2022-05-20,Harry's House,13,As It Was,Harry Styles,https://open.spotify.com/album/5r36AJ6VOJtp00o...


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   album_id             100 non-null    object
 1   album_release_date   100 non-null    object
 2   album_name           100 non-null    object
 3   total_tracks         100 non-null    int64 
 4   album_top_song       100 non-null    object
 5   artist_name          100 non-null    object
 6   album_external_urls  100 non-null    object
dtypes: int64(1), object(6)
memory usage: 5.6+ KB


In [19]:
# Convert release date to datetime column
df['album_release_date'] = pd.to_datetime(df['album_release_date'])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   album_id             100 non-null    object        
 1   album_release_date   100 non-null    datetime64[ns]
 2   album_name           100 non-null    object        
 3   total_tracks         100 non-null    int64         
 4   album_top_song       100 non-null    object        
 5   artist_name          100 non-null    object        
 6   album_external_urls  100 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 5.6+ KB


In [20]:
# Avoid extacting duplicates
df = df.drop_duplicates()

### Extracting relevant information for three tables: songs, artists and albums

In [21]:
# List of albums
album_list = []
for row in data['items']:
    album_id = row['track']['album']['id']
    album_name = row['track']['album']['name']
    album_release_date = row['track']['album']['release_date']
    album_total_tracks = row['track']['album']['total_tracks']
    album_url = row['track']['album']['external_urls']['spotify']
    album_element = {'album_id':album_id,'name':album_name,'release_date':album_release_date,
                        'total_tracks':album_total_tracks,'url':album_url}
    album_list.append(album_element)
    
album_list

[{'album_id': '4yP0hdKOZPNshxUOjY0cZj',
  'name': 'After Hours',
  'release_date': '2020-03-20',
  'total_tracks': 14,
  'url': 'https://open.spotify.com/album/4yP0hdKOZPNshxUOjY0cZj'},
 {'album_id': '3T4tUhGYeRNVUGevb0wThu',
  'name': '÷ (Deluxe)',
  'release_date': '2017-03-03',
  'total_tracks': 16,
  'url': 'https://open.spotify.com/album/3T4tUhGYeRNVUGevb0wThu'},
 {'album_id': '5658aM19fA3JVwTK6eQX70',
  'name': 'Divinely Uninspired To A Hellish Extent',
  'release_date': '2019-05-17',
  'total_tracks': 12,
  'url': 'https://open.spotify.com/album/5658aM19fA3JVwTK6eQX70'},
 {'album_id': '35s58BRTGAEWztPo9WqCIs',
  'name': 'Spider-Man: Into the Spider-Verse (Soundtrack From & Inspired by the Motion Picture)',
  'release_date': '2018-12-14',
  'total_tracks': 13,
  'url': 'https://open.spotify.com/album/35s58BRTGAEWztPo9WqCIs'},
 {'album_id': '5r36AJ6VOJtp00oxSkBZ5h',
  'name': "Harry's House",
  'release_date': '2022-05-20',
  'total_tracks': 13,
  'url': 'https://open.spotify.com/

In [22]:
# List of artists:
artist_list = []
for row in data['items']:
    for key, value in row.items():
        if key == "track":
            for artist in value['artists']:
                artist_dict = {'artist_id':artist['id'], 'artist_name':artist['name'], 'external_url': artist['href']}
                artist_list.append(artist_dict)
                
artist_list

[{'artist_id': '1Xyo4u8uXC1ZmMpatF05PJ',
  'artist_name': 'The Weeknd',
  'external_url': 'https://api.spotify.com/v1/artists/1Xyo4u8uXC1ZmMpatF05PJ'},
 {'artist_id': '6eUKZXaKkcviH0Ku9w2n3V',
  'artist_name': 'Ed Sheeran',
  'external_url': 'https://api.spotify.com/v1/artists/6eUKZXaKkcviH0Ku9w2n3V'},
 {'artist_id': '4GNC7GD6oZMSxPGyXy4MNB',
  'artist_name': 'Lewis Capaldi',
  'external_url': 'https://api.spotify.com/v1/artists/4GNC7GD6oZMSxPGyXy4MNB'},
 {'artist_id': '246dkjvS1zLTtiykXe5h60',
  'artist_name': 'Post Malone',
  'external_url': 'https://api.spotify.com/v1/artists/246dkjvS1zLTtiykXe5h60'},
 {'artist_id': '1zNqQNIdeOUZHb8zbZRFMX',
  'artist_name': 'Swae Lee',
  'external_url': 'https://api.spotify.com/v1/artists/1zNqQNIdeOUZHb8zbZRFMX'},
 {'artist_id': '6KImCVD70vtIoJWnq6nGn3',
  'artist_name': 'Harry Styles',
  'external_url': 'https://api.spotify.com/v1/artists/6KImCVD70vtIoJWnq6nGn3'},
 {'artist_id': '1Xyo4u8uXC1ZmMpatF05PJ',
  'artist_name': 'The Weeknd',
  'external_

In [23]:
# Song list:
song_list = []
for row in data['items']:
    song_id = row['track']['id']
    song_name = row['track']['name']
    song_duration = row['track']['duration_ms']
    song_url = row['track']['external_urls']['spotify']
    song_popularity = row['track']['popularity']
    song_added = row['added_at']
    album_id = row['track']['album']['id']
    artist_id = row['track']['album']['artists'][0]['id']
    song_element = {'song_id':song_id,'song_name':song_name,'duration_ms':song_duration,'url':song_url,
                    'popularity':song_popularity,'song_added':song_added,'album_id':album_id,
                    'artist_id':artist_id
                   }
    song_list.append(song_element)

song_list

[{'song_id': '0VjIjW4GlUZAMYd2vXMi3b',
  'song_name': 'Blinding Lights',
  'duration_ms': 200040,
  'url': 'https://open.spotify.com/track/0VjIjW4GlUZAMYd2vXMi3b',
  'popularity': 84,
  'song_added': '2024-02-15T21:57:53Z',
  'album_id': '4yP0hdKOZPNshxUOjY0cZj',
  'artist_id': '1Xyo4u8uXC1ZmMpatF05PJ'},
 {'song_id': '7qiZfU4dY1lWllzX7mPBI3',
  'song_name': 'Shape of You',
  'duration_ms': 233712,
  'url': 'https://open.spotify.com/track/7qiZfU4dY1lWllzX7mPBI3',
  'popularity': 80,
  'song_added': '2024-02-15T21:57:53Z',
  'album_id': '3T4tUhGYeRNVUGevb0wThu',
  'artist_id': '6eUKZXaKkcviH0Ku9w2n3V'},
 {'song_id': '7qEHsqek33rTcFNT9PFqLf',
  'song_name': 'Someone You Loved',
  'duration_ms': 182160,
  'url': 'https://open.spotify.com/track/7qEHsqek33rTcFNT9PFqLf',
  'popularity': 83,
  'song_added': '2024-02-15T21:57:53Z',
  'album_id': '5658aM19fA3JVwTK6eQX70',
  'artist_id': '4GNC7GD6oZMSxPGyXy4MNB'},
 {'song_id': '3KkXRkHbMCARz0aVfEt68P',
  'song_name': 'Sunflower - Spider-Man: Into

In [24]:
# Create data frames from the list
album_df = pd.DataFrame.from_dict(album_list)
song_df = pd.DataFrame.from_dict(song_list)
artist_df = pd.DataFrame.from_dict(artist_list)

In [25]:
album_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   album_id      100 non-null    object
 1   name          100 non-null    object
 2   release_date  100 non-null    object
 3   total_tracks  100 non-null    int64 
 4   url           100 non-null    object
dtypes: int64(1), object(4)
memory usage: 4.0+ KB


In [26]:
song_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   song_id      100 non-null    object
 1   song_name    100 non-null    object
 2   duration_ms  100 non-null    int64 
 3   url          100 non-null    object
 4   popularity   100 non-null    int64 
 5   song_added   100 non-null    object
 6   album_id     100 non-null    object
 7   artist_id    100 non-null    object
dtypes: int64(2), object(6)
memory usage: 6.4+ KB


In [27]:
artist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   artist_id     126 non-null    object
 1   artist_name   126 non-null    object
 2   external_url  126 non-null    object
dtypes: object(3)
memory usage: 3.1+ KB


In [28]:
album_df['release_date'] = pd.to_datetime(album_df['release_date'])
song_df['song_added'] = pd.to_datetime(song_df['song_added'])

In [29]:
album_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   album_id      100 non-null    object        
 1   name          100 non-null    object        
 2   release_date  100 non-null    datetime64[ns]
 3   total_tracks  100 non-null    int64         
 4   url           100 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 4.0+ KB


In [30]:
song_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype              
---  ------       --------------  -----              
 0   song_id      100 non-null    object             
 1   song_name    100 non-null    object             
 2   duration_ms  100 non-null    int64              
 3   url          100 non-null    object             
 4   popularity   100 non-null    int64              
 5   song_added   100 non-null    datetime64[ns, UTC]
 6   album_id     100 non-null    object             
 7   artist_id    100 non-null    object             
dtypes: datetime64[ns, UTC](1), int64(2), object(5)
memory usage: 6.4+ KB


In [31]:
# Drop duplicates
song_df = song_df.drop_duplicates()
artist_df = artist_df.drop_duplicates()
album_df = album_df.drop_duplicates()

In [32]:
song_df

Unnamed: 0,song_id,song_name,duration_ms,url,popularity,song_added,album_id,artist_id
0,0VjIjW4GlUZAMYd2vXMi3b,Blinding Lights,200040,https://open.spotify.com/track/0VjIjW4GlUZAMYd...,84,2024-02-15 21:57:53+00:00,4yP0hdKOZPNshxUOjY0cZj,1Xyo4u8uXC1ZmMpatF05PJ
1,7qiZfU4dY1lWllzX7mPBI3,Shape of You,233712,https://open.spotify.com/track/7qiZfU4dY1lWllz...,80,2024-02-15 21:57:53+00:00,3T4tUhGYeRNVUGevb0wThu,6eUKZXaKkcviH0Ku9w2n3V
2,7qEHsqek33rTcFNT9PFqLf,Someone You Loved,182160,https://open.spotify.com/track/7qEHsqek33rTcFN...,83,2024-02-15 21:57:53+00:00,5658aM19fA3JVwTK6eQX70,4GNC7GD6oZMSxPGyXy4MNB
3,3KkXRkHbMCARz0aVfEt68P,Sunflower - Spider-Man: Into the Spider-Verse,158040,https://open.spotify.com/track/3KkXRkHbMCARz0a...,76,2024-02-15 21:57:53+00:00,35s58BRTGAEWztPo9WqCIs,0LyfQWJT6nXafLPZqxe9Of
4,4Dvkj6JhhA12EX05fT7y2e,As It Was,167303,https://open.spotify.com/track/4Dvkj6JhhA12EX0...,87,2024-02-15 21:57:53+00:00,5r36AJ6VOJtp00oxSkBZ5h,6KImCVD70vtIoJWnq6nGn3
...,...,...,...,...,...,...,...,...
95,60a0Rd6pjrkxjPbaKzXjfq,In the End,216880,https://open.spotify.com/track/60a0Rd6pjrkxjPb...,82,2024-05-16 07:38:07+00:00,6hPkbAV3ZXpGZBGUvL6jVM,6XyY86QOPPrYVGvF9ch6wz
96,7DSAEUvxU8FajXtRloy8M0,Flowers,200600,https://open.spotify.com/track/7DSAEUvxU8FajXt...,85,2024-06-04 19:13:07+00:00,5DvJgsMLbaR1HmAI6VhfcQ,5YGY8feqx7naU7z4HrwZM6
97,7miLbD9XU1SOhXxbZPv7dR,Cuidado,179597,https://open.spotify.com/track/7miLbD9XU1SOhXx...,36,2024-05-04 09:42:29+00:00,6YCqciyc3cT72TTIddvULV,6PzulX5BJXNwDxql6gslSA
98,4CeeEOM32jQcH3eN9Q2dGj,Smells Like Teen Spirit,301920,https://open.spotify.com/track/4CeeEOM32jQcH3e...,77,2024-06-06 16:25:20+00:00,2UJcKiJxNryhL050F5Z1Fk,6olE6TJLqED3rqDCT0FyPh


In [33]:
song_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype              
---  ------       --------------  -----              
 0   song_id      100 non-null    object             
 1   song_name    100 non-null    object             
 2   duration_ms  100 non-null    int64              
 3   url          100 non-null    object             
 4   popularity   100 non-null    int64              
 5   song_added   100 non-null    datetime64[ns, UTC]
 6   album_id     100 non-null    object             
 7   artist_id    100 non-null    object             
dtypes: datetime64[ns, UTC](1), int64(2), object(5)
memory usage: 7.0+ KB


In [34]:
album_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 93 entries, 0 to 99
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   album_id      93 non-null     object        
 1   name          93 non-null     object        
 2   release_date  93 non-null     datetime64[ns]
 3   total_tracks  93 non-null     int64         
 4   url           93 non-null     object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 4.4+ KB


In [35]:
artist_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89 entries, 0 to 125
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   artist_id     89 non-null     object
 1   artist_name   89 non-null     object
 2   external_url  89 non-null     object
dtypes: object(3)
memory usage: 2.8+ KB


In [36]:
song_df.song_added

0    2024-02-15 21:57:53+00:00
1    2024-02-15 21:57:53+00:00
2    2024-02-15 21:57:53+00:00
3    2024-02-15 21:57:53+00:00
4    2024-02-15 21:57:53+00:00
                ...           
95   2024-05-16 07:38:07+00:00
96   2024-06-04 19:13:07+00:00
97   2024-05-04 09:42:29+00:00
98   2024-06-06 16:25:20+00:00
99   2024-06-06 16:25:23+00:00
Name: song_added, Length: 100, dtype: datetime64[ns, UTC]

In [37]:
album_df.release_date

0    2020-03-20
1    2017-03-03
2    2019-05-17
3    2018-12-14
4    2022-05-20
        ...    
95   2000-01-01
96   2023-08-18
97   2024-05-01
98   1991-09-26
99   2021-09-17
Name: release_date, Length: 93, dtype: datetime64[ns]