# Spotify Hit Songs: What non-musical factors make a song successful?

### SoftDes Mid-term Project - Spring 2023 - Olin College of Engineering
### Sparsh Gupta and Sohum Kothavade

## Introduction

Music as an industry has revolved significantly in the past couple decades. With digital inventions such as iTunes by Apple (2001) and music streaming services like Spotify, Apple Music, etc. revolutionizing this industry, listening to music has become very accessible for every individual.Therefore, music creators and artists have increasingly delved into this industry to pursue their passion and create original music/songs. Consequently, the music industry has become extremely competitive for artists willing to make a mark in the field and be successful by creating 'hit' songs.

Spotify, a music streaming service launched in 2006, has become the world's largest music provider having 433 million total users as of 2022 in 184 markets/regions across the world. This has resulted in most songs having their primary listeners on Spotify and most number of listens/streams on this platform as well. Therefore, obtaining statistics from Spotify data would enable us to look into some interesting patterns/trends which make a 'hit' song.

Now, how do we define a 'hit' song? Spotify's data includes a parameter for every music/song track called the 'popularity'. This is a value between 0 and 100, with 100 being the most popular. This 'popularity' is based on spotify's algorithm which utilizes statistics such as the total number of plays of the track and how recent those plays are. We use this metric to classify songs as a 'hit' song or not which we will explore in this essay later.

In this computational essay, we will primarily explore how certain non-musical factors, for example, release date of songs, album, etc. are correlated to the 'popularity' of a song and how these factors affect the successfulness of a song. To do this, we used Spotify's API to obtain data from 'The Million Playlist' dataset. We process this data to extract statistics for the non-musical factors being considered in this study and use visualizations to understand how these are related to the 'popularity' of a song.

## Methodology

The data used in this study is from Spotify's ['The Million Playlist'](https://research.atspotify.com/2020/09/the-million-playlist-dataset-remastered/) dataset. We obtained the data used in this study from Spotify's API which gives us access to a million playlists of songs consisting of more than a hundred million songs/tracks.


 Spotify's Web API is based on REST principles and the data resources are accessed via standard HTTPS requests to an API endpoint from Python. We used an authorized valid access token to make successful Web API requests and get access to the music statistics data. We also utilized Python's SpotiPy package to extract data from unique song identifiers. To do this, we import the required modules from SpotiPy:


In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

We authenticate from our Spotify web API Developer User Account to access the data which prompts a user to sign in. We then authenticate our API token (client ID and client secret) to access the data, while creating our "Spotify" object using the following lines of code:

In [None]:
# Setting up the SpotiPy client with Spotify app credentials
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

The dataset is formatted like a Python dictionary, and we read and append the desired data into a list which consists the data of tracks for several chosen non-musical factors. We use Pandas to convert this list to a Pandas DataFrame by the following:


In [None]:
# Converting list to Pandas DataFrame
data = pd.DataFrame(data)

Then, we export this DataFrame to a comma-seperated values (csv) file using Pandas (consisting of a header of our chosen non-musical factors) and store it on our local desktop.

In [None]:
# Exporting track data into a csv file
data.to_csv("data.csv",
            header=["playlist_name", "track_name", "track_album",
                    "track_artists", "track_release_date",
                    "track_length", "track_popularity", "track_explicit",
                    "track_markets", "track_album_type"],
            index=False)

In our study, we extract a total of ten data columns including the 'popularity' metric and nine major non-musical factors that we have identified listed below:

- **track_popularity**: This is a measure of the track's popularity on the streaming service, usually based on the number of plays or listens.
- **playlist_name**: This is the name of the playlist, which is a collection of tracks organized by a user or the streaming service.
- **track_name**: This is the name of the individual track within the playlist.
- **track_album**: This refers to the album that the track is a part of, which may have multiple tracks.
- **track_artists**: These are the artists who performed the track.
- **track_release_date**: This is the date when the track was officially released by the artist.
- **track_length**: This is the duration of the track in milliseconds.
- **track_explicit**: This indicates whether the track contains explicit lyrics or content.
- **track_markets**: This is the number of geographic regions where the track is available for streaming.
- **track_album_type**: This describes the type of album that the track belongs to, such as a studio album, live album, or compilation album.


We then use a function 'load_data' which utilizes Pandas 'read_csv' module to load the stored csv file (data.csv) from our local machine to the Python script as a pandas DataFrame:

In [None]:
# Load the Spotify data from csv file
spotify_data = load_data('data.csv')

We also utilize another function 'extract_column' to extract the desired non-musical factor data column from this DataFrame in our code scripts:

In [None]:
# Extract the desired non-musical factor data column
# Sample: extracting the track release date column
track_release_date = extract_column(spotify_data, 'track_release_date')

Overall, we obtain the desired features/non-musical factors and our metric into one DataFrame/csv file by using the 'Pandas' module which we can utilize to plot visualizations and understand the data as we will see in this essay further.

## Results

## Conclusion