## Gathering Data from Spotify Charts and Using the Spotify API

Using data collected from Spotify Charts and the Spotify Web API, we can explore how the attributes of popular songs streamed on Spotify. For example, what is the proportion of top 10 songs that are classified as explicit? How has this changed over the course of a year?

First, we import the necessary libraries/modules. `spotipy` is used to access data from the Spotify Web API. `pandas` allows us to perform data wrangling and analysis. The `os` module helps us manage files, directories, and environment variables.

In [37]:
# Import libraries
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy
import pandas as pd
from dotenv import load_dotenv
import os


We load environment variables that permit access to the Spotify API using `load_dotenv()` so that we can call the Spotify API as an authorized user.

In [38]:
# Initialize the Spotify API client, use client id from command terminal 
load_dotenv()

# Access the variables
SPOTIPY_CLIENT_ID = os.getenv('SPOTIPY_CLIENT_ID')
SPOTIPY_CLIENT_SECRET = os.getenv('SPOTIPY_CLIENT_SECRET')
SPOTIPY_REDIRECT_URI = os.getenv('SPOTIPY_REDIRECT_URI')

sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials
                    (client_id=SPOTIPY_CLIENT_ID,
                    client_secret=SPOTIPY_CLIENT_SECRET))

We downloaded several CSV files containing data of the weekly top 200 songs in the United States from charts.spotify.com and stored them in a directory. We loop through each file, adding the corresponding week as a column variable to each row of the CSV file before converting it into a dataframe. 

In [39]:
# Define the directory containing the CSV files
csv_directory = 'data'

# Get a list of all CSV files in the directory
# This assumes all files are .csv and are in the csv_directory
csv_files = [file for file in os.listdir(csv_directory) if file.endswith('.csv')]

# Initialize an empty list to store the dataframes
dataframes = []

# Loop through the list of CSV files and read each one
for file in csv_files:
    file_path = os.path.join(csv_directory, file)
    
    # Get the current week from the file path name
    week_date = '-'.join(file_path.split('-')[3:]).split('.')[0]
    
    with open(file_path, 'r') as f:
        songs = f.readlines()
        
    # Add the week to each line
    for i in range(len(songs)):
        songs[i] = songs[i].split('\n')[0]
        if i == 0:
            songs[i] += f',week\n'
        else:
            songs[i] += f',{week_date}\n'

    # Write the modified file content to a new file path and convert to dataframe
    new_file_path = os.path.join(csv_directory, f'new {file}')
    with open(new_file_path, 'w') as f:
        f.writelines(songs)
    df = pd.read_csv(new_file_path)
    os.remove(new_file_path)
    dataframes.append(df)

# Concatenate all dataframes into one
charts = pd.concat(dataframes, ignore_index=True)

# Display the combined dataframe
charts = charts.drop(columns=['source'])
charts

Unnamed: 0,rank,uri,artist_names,track_name,peak_rank,previous_rank,weeks_on_chart,streams,week
0,1,spotify:track:7K3BhSpAxZBznislvUMVtn,Morgan Wallen,Last Night,1,1,21,10241241,2023-06-22
1,2,spotify:track:3qQbCzHBycnDpGskqOWY0E,"Eslabon Armado, Peso Pluma",Ella Baila Sola,1,2,14,9427182,2023-06-22
2,3,spotify:track:1BxfuPKGuaTgP7aM0Bbdwr,Taylor Swift,Cruel Summer,3,13,32,6268357,2023-06-22
3,4,spotify:track:1Lo0QY9cvc8sUB2vnIOxDT,Luke Combs,Fast Car,4,5,13,6152329,2023-06-22
4,5,spotify:track:7KA4W4McWYRpgf0fWsJZWB,"Tyler, The Creator, Kali Uchis",See You Again (feat. Kali Uchis),5,7,103,5898838,2023-06-22
...,...,...,...,...,...,...,...,...,...
10395,196,spotify:track:58ge6dfP91o9oXMzq3XkIS,Arctic Monkeys,505,18,192,178,2575446,2024-06-13
10396,197,spotify:track:2ZWlPOoWh0626oTaHrnl2a,Frank Ocean,Ivy,169,183,9,2574631,2024-06-13
10397,198,spotify:track:53IRnAWx13PYmoVYtemUBS,Chappell Roan,Femininomenon,198,-1,1,2572362,2024-06-13
10398,199,spotify:track:4obHzpwGrjoTuZh2DItEMZ,Morgan Wallen,7 Summers,3,-1,68,2571426,2024-06-13


The `charts` dataframe has the data from all the files from the 52 weeks. Since the Spotify Charts data doesn't have all the data we want, we use `spotipy` to call the Spotify API. The `tracks()` method takes a list of "URI" elements that represent a unique Spotify song, with a maximum of 50 URIs. Since our data contains 200 songs for each of the 52 weeks, we have to make separate calls in 50 element intevals. Since calling the Web API can impose rate limits, we want to gather as much data as we can within a single API call, so we made sure each call using `tracks()` contained a list of 50 URIs. After executing the cell, `tracks` contains a list of dictionaries that contain the track information from the Web API.

In [40]:
start = 0
end = 50
step_size = 50
tracks = []

# Get track info from the Spotify API 50 tracks at a time 
while end <= charts.shape[0]:
    uri_list = list(charts['uri'][start:end])
    tracks.append(sp.tracks(uri_list))
    start = end
    end += step_size

The output from calling `tracks()` was not the most ideal output, as it returned a dictionary where the value for each key in the dictionary is a list of dictionaries. These inner dictionaries contain the data we actually want, so we go through each outer dictionary and list and append the inner dictionary containing the track data to a new list.

In [41]:
track_list = []

# Put all track info (dictionaries) into a single list
for dct in tracks:
    for key, val in dct.items():
        for track in val:
            track_list.append(track)

`track_list` now contains the track data for each song in our `charts` dataframe. Since we are interested in the "explicit" value of each track, we find the key for "explicit" and save the corresponding value (a boolean) to a new list.

In [42]:
# Extract the explicit value from each track and store in a list
exp_list = [track['explicit'] for track in track_list]

`exp_list` contains the `True` or `False` explicit value for each song in `charts`. We can add a new column to `charts` called `explicit` that stores the data in `exp_list`.

In [43]:
# Create a new column for explicit value
charts_info = charts.assign(explicit = exp_list)
charts_info

Unnamed: 0,rank,uri,artist_names,track_name,peak_rank,previous_rank,weeks_on_chart,streams,week,explicit
0,1,spotify:track:7K3BhSpAxZBznislvUMVtn,Morgan Wallen,Last Night,1,1,21,10241241,2023-06-22,True
1,2,spotify:track:3qQbCzHBycnDpGskqOWY0E,"Eslabon Armado, Peso Pluma",Ella Baila Sola,1,2,14,9427182,2023-06-22,False
2,3,spotify:track:1BxfuPKGuaTgP7aM0Bbdwr,Taylor Swift,Cruel Summer,3,13,32,6268357,2023-06-22,False
3,4,spotify:track:1Lo0QY9cvc8sUB2vnIOxDT,Luke Combs,Fast Car,4,5,13,6152329,2023-06-22,False
4,5,spotify:track:7KA4W4McWYRpgf0fWsJZWB,"Tyler, The Creator, Kali Uchis",See You Again (feat. Kali Uchis),5,7,103,5898838,2023-06-22,True
...,...,...,...,...,...,...,...,...,...,...
10395,196,spotify:track:58ge6dfP91o9oXMzq3XkIS,Arctic Monkeys,505,18,192,178,2575446,2024-06-13,False
10396,197,spotify:track:2ZWlPOoWh0626oTaHrnl2a,Frank Ocean,Ivy,169,183,9,2574631,2024-06-13,True
10397,198,spotify:track:53IRnAWx13PYmoVYtemUBS,Chappell Roan,Femininomenon,198,-1,1,2572362,2024-06-13,True
10398,199,spotify:track:4obHzpwGrjoTuZh2DItEMZ,Morgan Wallen,7 Summers,3,-1,68,2571426,2024-06-13,False


Using `groupby` operation, we can count the number of True and False explicit values. It turns out there were more top 200 songs that were not explicit compared to explicit ones over the past 52 weeks.

In [44]:
charts_count = charts_info.groupby('explicit').count()
charts_count = charts_count.assign(Count = charts_count.get('rank')).get(['Count'])
charts_count

Unnamed: 0_level_0,Count
explicit,Unnamed: 1_level_1
False,5509
True,4891


Perhaps, finding the number of unique songs would be more informative as many songs in the top 200 weekly songs stay popular for many weeks, meaning there are multiple occurences of the same song. Using a `groupby` on the `uri` and `explicit` columns, we see that the number of rows of the dataframe is reduced signficantly to 1186.

In [45]:
unique_exp_counts = charts_info.groupby(['uri', 'explicit']).count().reset_index()
unique_exp_counts

Unnamed: 0,uri,explicit,rank,artist_names,track_name,peak_rank,previous_rank,weeks_on_chart,streams,week
0,spotify:track:003vvx7Niy0yvhvHt4a68B,False,52,52,52,52,52,52,52,52
1,spotify:track:00n83h3zn2IrySO4Q4aTrG,True,1,1,1,1,1,1,1,1
2,spotify:track:00syWkRGIVQvYsg2OwfBUw,True,6,6,6,6,6,6,6,6
3,spotify:track:01Ho3efkRrIbYnWxISj05V,True,1,1,1,1,1,1,1,1
4,spotify:track:01Lr5YepbgjXAWR9iOEyH1,True,4,4,4,4,4,4,4,4
...,...,...,...,...,...,...,...,...,...,...
1181,spotify:track:7xapw9Oy21WpfEcib2ErSA,False,6,6,6,6,6,6,6,6
1182,spotify:track:7yNf9YjeO5JXUE3JEBgnYc,False,15,15,15,15,15,15,15,15
1183,spotify:track:7yogx3TwxGwSxO2QITsT2q,False,2,2,2,2,2,2,2,2
1184,spotify:track:7yq4Qj7cqayVTp3FF9CWbm,False,18,18,18,18,18,18,18,18


Taking this a step further, we can do a new `groupby` operation after resetting the dataframe from the previous `groupby`, which yields the total number of unique True and False explicit songs.

In [46]:
unique_exp_counts = unique_exp_counts.groupby('explicit').count()
unique_exp_counts = unique_exp_counts.assign(Count = unique_exp_counts.get('rank')).get(['Count'])
unique_exp_counts

Unnamed: 0_level_0,Count
explicit,Unnamed: 1_level_1
False,604
True,582


It appears there are still more non-explicit songs that explicit songs when looking at unique songs from the Spotify Charts. We can also see how many non-explicit and explicit songs there are on a weekly basis:

In [47]:
weekly_charts = charts_info.groupby(['week', 'explicit']).count()
weekly_charts = weekly_charts.assign(Count = weekly_charts.get('rank')).get(['Count'])
weekly_charts

Unnamed: 0_level_0,Unnamed: 1_level_0,Count
week,explicit,Unnamed: 2_level_1
2023-06-22,False,106
2023-06-22,True,94
2023-06-29,False,98
2023-06-29,True,102
2023-07-06,False,106
...,...,...
2024-05-30,True,84
2024-06-06,False,117
2024-06-06,True,83
2024-06-13,False,117


Let's take a deeper look at the differences. 