## MSDS 696   Notebook 1 Spotify Data Pull Using Spotipy Lightweight API

## Project Title:
Create and Build A Data Engineering Pipeline to Collect, Process, and Store Spotify Data. This is intended to be a fun project to look at who the most popular artists are, what their most popular tracks are, and look at some characteristics of the songs.

### Mary J Hollon
### Due 8-22-2024

#### Notebook Purpose:

The purpose of this file is to pull data from Spotify using the spotipy library see here: https://pypi.org/project/spotipy/. Spotipy is a lightweight Python library for the Spotify Web API. Verify that spotipy is installed before you begin. Before beginning, you must obtain developer credentials from Spotify. My personal credentials are contained in a python file named spotify_credentials.py. Credentials for Spotify can be obtained here: https://developer.spotify.com/documentation/web-api

### NOTE: DO NOT EXECUTE WITHOUT Spotify Developer Credentials !!!

In [1]:
# Import necessary libraries
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import time
from spotify_credentials import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
import random

# Authenticate with Spotify
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, client_secret=SPOTIFY_CLIENT_SECRET))

# Function to get top 200 artists for a given year
def get_top_artists(year, limit=400):
    all_artists = []
    for offset in range(0, limit, 50):
        results = sp.search(q=f'year:{year}', type='artist', limit=50, offset=offset)
        artists = results['artists']['items']
        if not artists:
            break
        all_artists.extend(artists)
        time.sleep(random.uniform(5, 30))  # Sleep for a random time between 5 and 30 seconds
    return all_artists

# Function to get top 200 tracks for a given year
def get_top_tracks_for_year(year, limit=400):
    all_tracks = []
    for offset in range(0, limit, 50):
        results = sp.search(q=f'year:{year}', type='track', limit=50, offset=offset)
        tracks = results['tracks']['items']
        if not tracks:
            break
        all_tracks.extend(tracks)
        time.sleep(random.uniform(5, 30))  # Sleep for a random time between 5 and 30 seconds
    return all_tracks

# Function to fetch and process data for a given year
def fetch_year_data(year):
    artists = get_top_artists(year)
    tracks = get_top_tracks_for_year(year)

    # Process artist data
    artist_data = []
    for artist in artists:
        artist_data.append({
            'id': artist['id'],
            'name': artist['name'],
            'popularity': artist['popularity'],
            'genres': artist['genres'],
            'year': year
        })

    # Process track data
    track_data = []
    for track in tracks:
        release_date = track['album']['release_date']
        if release_date.startswith(str(year)):
            track_data.append({
                'id': track['id'],
                'name': track['name'],
                'artist_id': track['artists'][0]['id'],
                'year': year,
                'popularity': track['popularity'],
                'release_date': release_date
            })

    # Convert to DataFrames
    artists_df = pd.DataFrame(artist_data)
    tracks_df = pd.DataFrame(track_data)

    return artists_df, tracks_df

# Function to save DataFrame to CSV
def save_to_csv(df, filename):
    df.to_csv(filename, index=False)

# Main execution part
def main(years):
    # Start the timer
    start_time = time.time()

    for year in years:
        print(f"Fetching data for the year {year}...")
        artists_df, tracks_df = fetch_year_data(year)
        save_to_csv(artists_df, f'artists_{year}.csv')
        save_to_csv(tracks_df, f'tracks_{year}.csv')
        time.sleep(random.uniform(5, 30))  # Sleep for a random time between 5 and 10 seconds between each year

    # End the timer
    end_time = time.time()

    # Calculate and print the elapsed time
    elapsed_time = (end_time - start_time)/60
    print(f"Time to Execute: {elapsed_time} minutes")

# Specify the range of years
years = range(2020, 2025)

# Run the main function
main(years)

# Function to read and print the head of DataFrames
def read_and_print_files(years):
    for year in years:
        artists_filename = f'artists_{year}.csv'
        tracks_filename = f'tracks_{year}.csv'

        try:
            artists_df = pd.read_csv(artists_filename)
            tracks_df = pd.read_csv(tracks_filename)

            print(f"\nArtists DataFrame for {year}:")
            print(artists_df.head())

            print(f"\nTracks DataFrame for {year}:")
            print(tracks_df.head())

        except FileNotFoundError as e:
            print(f"Error: {e}")

# Call the function to read and print the DataFrames
read_and_print_files(years)


Fetching data for the year 2020...
Fetching data for the year 2021...
Fetching data for the year 2022...
Fetching data for the year 2023...
Fetching data for the year 2024...
Time to Execute: 24.177425674597423 minutes

Artists DataFrame for 2020:
                       id           name  popularity  \
0  6JMGrupbzJZ3yuQhTGyeHr      Year 200X          15   
1  06HL4z0CvFAxyc27GXpf02   Taylor Swift         100   
2  3TVXtAsR1Inumwj472S9r4          Drake          95   
3  40ZNYROS4zLfyyBSs2PGe2     Zach Bryan          91   
4  4oUHIQIBe0LHzYfvXNW4QM  Morgan Wallen          91   

                                              genres  year  
0                                      ['scorecore']  2020  
1                                            ['pop']  2020  
2  ['canadian hip hop', 'canadian pop', 'hip hop'...  2020  
3                       ['classic oklahoma country']  2020  
4                           ['contemporary country']  2020  

Tracks DataFrame for 2020:
                     

In [2]:
# Function to read and print the head of DataFrames
def read_and_print_files(years):
    for year in years:
        artists_filename = f'artists_{year}.csv'
        tracks_filename = f'tracks_{year}.csv'

        try:
            artists_df = pd.read_csv(artists_filename)
            tracks_df = pd.read_csv(tracks_filename)

            print(f"\nArtists DataFrame for {year}:")
            print(artists_df.head())

            print(f"\nTracks DataFrame for {year}:")
            print(tracks_df.head())

        except FileNotFoundError as e:
            print(f"Error: {e}")

# Specify the range of years
years = range(2020, 2025)

# Call the function to read and print the DataFrames
read_and_print_files(years)



Artists DataFrame for 2020:
                       id           name  popularity  \
0  6JMGrupbzJZ3yuQhTGyeHr      Year 200X          15   
1  06HL4z0CvFAxyc27GXpf02   Taylor Swift         100   
2  3TVXtAsR1Inumwj472S9r4          Drake          95   
3  40ZNYROS4zLfyyBSs2PGe2     Zach Bryan          91   
4  4oUHIQIBe0LHzYfvXNW4QM  Morgan Wallen          91   

                                              genres  year  
0                                      ['scorecore']  2020  
1                                            ['pop']  2020  
2  ['canadian hip hop', 'canadian pop', 'hip hop'...  2020  
3                       ['classic oklahoma country']  2020  
4                           ['contemporary country']  2020  

Tracks DataFrame for 2020:
                       id                                     name  \
0  0zirWZTcXBBwGsevrsIpvT  Clean Baby Sleep White Noise (Loopable)   
1  3FU6urUVsgXa6RBuV2PdRk          Heartless (feat. Morgan Wallen)   
2  2UikqkwBv7aIvlixeVXHWt     

In [3]:
# Run the Main execution function for years 2015 through 2020
def main(years):
    # Start the timer
    start_time = time.time()

    for year in years:
        print(f"Fetching data for the year {year}...")
        artists_df, tracks_df = fetch_year_data(year)
        save_to_csv(artists_df, f'artists_{year}.csv')
        save_to_csv(tracks_df, f'tracks_{year}.csv')
        time.sleep(random.uniform(5, 30))  # Sleep for a random time between 5 and 30 seconds between each year

    # End the timer
    end_time = time.time()

    # Calculate and print the elapsed time
    elapsed_time = (end_time - start_time)/60
    print(f"Time to Execute: {elapsed_time} minutes")

# Specify the range of years
years = range(2015, 2020)

# Run the main function
main(years)

Fetching data for the year 2015...
Fetching data for the year 2016...
Fetching data for the year 2017...
Fetching data for the year 2018...
Fetching data for the year 2019...
Time to Execute: 25.33900886774063 minutes


In [4]:
# Specify the range of years
years = range(2015, 2020)

# Call the function to read and print the DataFrames
read_and_print_files(years)



Artists DataFrame for 2015:
                       id            name  popularity  \
0  06HL4z0CvFAxyc27GXpf02    Taylor Swift         100   
1  3TVXtAsR1Inumwj472S9r4           Drake          95   
2  4oUHIQIBe0LHzYfvXNW4QM   Morgan Wallen          91   
3  2YZyLoL8N0Wb9xBt1NhZWg  Kendrick Lamar          92   
4  5K4W6rqBFWDnAN6FQUkS6x      Kanye West          92   

                                              genres  year  
0                                            ['pop']  2015  
1  ['canadian hip hop', 'canadian pop', 'hip hop'...  2015  
2                           ['contemporary country']  2015  
3  ['conscious hip hop', 'hip hop', 'rap', 'west ...  2015  
4                  ['chicago rap', 'hip hop', 'rap']  2015  

Tracks DataFrame for 2015:
                       id                        name               artist_id  \
0  3fqwjXwUGN6vbzIwvyFMhx           Tennessee Whiskey  4YLtscXsxbVgi031ovDDdh   
1  3pXF1nA74528Edde4of9CC                       Don't  2EMAnMvWE2eb56ToJ

### Summary:
This code pulls data from Spotify for for 2 files.The first file contains artist id, name, popularity score, associated genres and year and writes the data to a .csv file.The second file contains track information id, name, artist_id, energy, danceability, instrumentalness, loudness, tempo,valence, and year and writes the track data to a .csv file. 

Final files for artist and tracks produced by this code are listed below:
- artists_2015          tracks_2015
- artists_2016          tracks_2016
- artists_2017          tracks_2017
- artists_2018          tracks_2018
- artists_2019          tracks_2019
- artists_2020          tracks_2020
- artists_2021          tracks_2021
- artists_2022          tracks_2022
- artists_2023          tracks_2023
- artists_2024          tracks_2024


Now that the track data is pulled, I want to add track audio features to the track files.
I am splitting this up so I do not get 429 error codes

In [5]:

# Import necessary libraries
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import time
from spotify_credentials import SPOTIFY_CLIENT_ID, SPOTIFY_CLIENT_SECRET
import random

# Authenticate with Spotify
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, client_secret=SPOTIFY_CLIENT_SECRET))

# Function to get audio features for a list of track IDs
def get_audio_features_for_tracks(track_ids):
    features = []
    for i in range(0, len(track_ids), 100):  # Spotify API allows up to 100 IDs per request
        batch = track_ids[i:i + 100]
        results = sp.audio_features(batch)
        features.extend(results)
        time.sleep(random.uniform(5, 30))  # Sleep for a random time between 5 and 30 seconds
    return features

# Function to read the existing tracks CSV, fetch audio features, and save the updated DataFrame
def update_tracks_with_audio_features(year):
    # Read the existing tracks CSV
    tracks_filename = f'tracks_{year}.csv'
    try:
        tracks_df = pd.read_csv(tracks_filename)
    except FileNotFoundError:
        print(f"Error: File {tracks_filename} not found.")
        return

    # Extract track IDs
    track_ids = tracks_df['id'].tolist()

    # Fetch audio features
    audio_features = get_audio_features_for_tracks(track_ids)
    audio_features_dict = {feature['id']: feature for feature in audio_features if feature}

    # Append audio features to the track data
    audio_features_data = []
    for track_id in track_ids:
        feature = audio_features_dict.get(track_id)
        if feature:
            audio_features_data.append({
                'id': track_id,
                'energy': feature['energy'],
                'danceability': feature['danceability'],
                'instrumentalness': feature['instrumentalness'],
                'loudness': feature['loudness'],
                'tempo': feature['tempo'],
                'valence': feature['valence']
            })

    audio_features_df = pd.DataFrame(audio_features_data)

    # Merge the audio features with the original tracks DataFrame
    updated_tracks_df = tracks_df.merge(audio_features_df, on='id', how='left')

    # Save the updated DataFrame back to the CSV
    updated_tracks_filename = f'updated_tracks_{year}.csv'
    updated_tracks_df.to_csv(updated_tracks_filename, index=False)
    print(f"Updated file saved as {updated_tracks_filename}")

# Main execution part
def main(years):
    # Start the timer
    start_time = time.time()

    for year in years:
        print(f"Updating tracks for the year {year}...")
        update_tracks_with_audio_features(year)
        time.sleep(random.uniform(5, 30))  # Sleep for a random time between 5 and 30 seconds between each year

    # End the timer
    end_time = time.time()

    # Calculate and print the elapsed time
    elapsed_time = (end_time - start_time)/60
    print(f"Time to Execute: {elapsed_time} minutes")

# Specify the range of years
years = range(2020, 2025)

# Run the main function
main(years)

# Function to read and print the head of updated DataFrames
def read_and_print_files(years):
    for year in years:
        updated_tracks_filename = f'updated_tracks_{year}.csv'
        try:
            updated_tracks_df = pd.read_csv(updated_tracks_filename)
            print(f"\nUpdated Tracks DataFrame for {year}:")
            print(updated_tracks_df.head())
        except FileNotFoundError as e:
            print(f"Error: {e}")

# Call the function to read and print the updated DataFrames
read_and_print_files(years)


Updating tracks for the year 2020...
Updated file saved as updated_tracks_2020.csv
Updating tracks for the year 2021...
Updated file saved as updated_tracks_2021.csv
Updating tracks for the year 2022...
Updated file saved as updated_tracks_2022.csv
Updating tracks for the year 2023...
Updated file saved as updated_tracks_2023.csv
Updating tracks for the year 2024...
Updated file saved as updated_tracks_2024.csv
Time to Execute: 6.686703463395436 minutes

Updated Tracks DataFrame for 2020:
                       id                                     name  \
0  0zirWZTcXBBwGsevrsIpvT  Clean Baby Sleep White Noise (Loopable)   
1  3FU6urUVsgXa6RBuV2PdRk          Heartless (feat. Morgan Wallen)   
2  2UikqkwBv7aIvlixeVXHWt                You Should Probably Leave   
3  3hxIUxnT27p5WcmjGUXNwx                  Shut up My Moms Calling   
4  3jHdKaLCkuNEkWcLVmQPCX                            BEST INTEREST   

                artist_id  year  popularity release_date  energy  \
0  6Cqtx9fpxzggIM

Let's get the remaining years track audio data for years 2015 - 2019

In [6]:
# Authenticate with Spotify
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=SPOTIFY_CLIENT_ID, client_secret=SPOTIFY_CLIENT_SECRET))

# Function to get audio features for a list of track IDs
def get_audio_features_for_tracks(track_ids):
    features = []
    for i in range(0, len(track_ids), 100):  # Spotify API allows up to 100 IDs per request
        batch = track_ids[i:i + 100]
        results = sp.audio_features(batch)
        features.extend(results)
        time.sleep(random.uniform(5, 30))  # Sleep for a random time between 5 and 30 seconds
    return features

# Function to read the existing tracks CSV, fetch audio features, and save the updated DataFrame
def update_tracks_with_audio_features(year):
    # Read the existing tracks CSV
    tracks_filename = f'tracks_{year}.csv'
    try:
        tracks_df = pd.read_csv(tracks_filename)
    except FileNotFoundError:
        print(f"Error: File {tracks_filename} not found.")
        return

    # Extract track IDs
    track_ids = tracks_df['id'].tolist()

    # Fetch audio features
    audio_features = get_audio_features_for_tracks(track_ids)
    audio_features_dict = {feature['id']: feature for feature in audio_features if feature}

    # Append audio features to the track data
    audio_features_data = []
    for track_id in track_ids:
        feature = audio_features_dict.get(track_id)
        if feature:
            audio_features_data.append({
                'id': track_id,
                'energy': feature['energy'],
                'danceability': feature['danceability'],
                'instrumentalness': feature['instrumentalness'],
                'loudness': feature['loudness'],
                'tempo': feature['tempo'],
                'valence': feature['valence']
            })

    audio_features_df = pd.DataFrame(audio_features_data)

    # Merge the audio features with the original tracks DataFrame
    updated_tracks_df = tracks_df.merge(audio_features_df, on='id', how='left')

    # Save the updated DataFrame back to the CSV
    updated_tracks_filename = f'updated_tracks_{year}.csv'
    updated_tracks_df.to_csv(updated_tracks_filename, index=False)
    print(f"Updated file saved as {updated_tracks_filename}")

# Main execution part
def main(years):
    # Start the timer
    start_time = time.time()

    for year in years:
        print(f"Updating tracks for the year {year}...")
        update_tracks_with_audio_features(year)
        time.sleep(random.uniform(5, 30))  # Sleep for a random time between 5 and 30 seconds between each year

    # End the timer
    end_time = time.time()

    # Calculate and print the elapsed time
    elapsed_time = (end_time - start_time)/60
    print(f"Time to Execute: {elapsed_time} minutes")

# Specify the range of years
years = range(2015, 2020)

# Run the main function
main(years)

# Function to read and print the head of updated DataFrames
def read_and_print_files(years):
    for year in years:
        updated_tracks_filename = f'updated_tracks_{year}.csv'
        try:
            updated_tracks_df = pd.read_csv(updated_tracks_filename)
            print(f"\nUpdated Tracks DataFrame for {year}:")
            print(updated_tracks_df.head())
        except FileNotFoundError as e:
            print(f"Error: {e}")

# Call the function to read and print the updated DataFrames
read_and_print_files(years)


Updating tracks for the year 2015...
Updated file saved as updated_tracks_2015.csv
Updating tracks for the year 2016...
Updated file saved as updated_tracks_2016.csv
Updating tracks for the year 2017...
Updated file saved as updated_tracks_2017.csv
Updating tracks for the year 2018...
Updated file saved as updated_tracks_2018.csv
Updating tracks for the year 2019...
Updated file saved as updated_tracks_2019.csv
Time to Execute: 7.36453865369161 minutes

Updated Tracks DataFrame for 2015:
                       id                        name               artist_id  \
0  3fqwjXwUGN6vbzIwvyFMhx           Tennessee Whiskey  4YLtscXsxbVgi031ovDDdh   
1  3pXF1nA74528Edde4of9CC                       Don't  2EMAnMvWE2eb56ToJVfCWs   
2  0QZ5yyl6B6utIWkxeBDxQN            The Night We Met  6ltzsmQQbmdoHHbLZ4ZN25   
3  6K4t31amVTZDgR3sKmwUJJ  The Less I Know The Better  5INjqkS1o8h1imAzPqGZBb   
4  43PuMrRfbyyuz4QpZ3oAwN                    Exchange  2EMAnMvWE2eb56ToJVfCWs   

   year  popularity 

The updated track files contain the following information: id, name, artist_id, energy, danceability, instrumentalness, loudness, tempo and valence, and year. 

The final track data which now includes audio feature data is contained in the files:
 - updated_tracks_2015
-  updated_tracks_2016
-  updated_tracks_2017
-  updated_tracks_2018
-  updated_tracks_2019
-  updated_tracks_2020
-  updated_tracks_2021
-  updated_tracks_2022
-  updated_tracks_2023
-  updated_tracks_2024

# END of Notebook