# Project 2 Proposal - Predicting the Next Spotify Hit

##### 67-364 Practical Data Science
##### Austin Leung & Ivana Lin

## Overview of Goals
With our project, we would like to discover what makes a song popular. There are over twenty thousand songs uploaded every day to music streaming services like Spotify. Each one of them has its own audio features that make it unique. Based off this information, we hope to identify what the perfect formula of features for a Spotify hit consists of. We want to predict the popularity of any given song based on these audio features.

This would fall into supervised learning as we will be training our model knowing the popularity values of each song beforehand in our dataset. Our model will be doing prediction for popularity values. Within our audio features domain, we will be assessing each audio feature's effect on popularity.
Within this domain, we will hope to optimize P(T, E+ΔE) > P(T,E), where
- T = Predicting popularity of songs based off audio features
- P = Error of predicting popularity of songs (e.g. mean squared error, mean absolute error)
- E = Training on Spotify's song audio feature data with known popularity

Along the way, we hope to discover the features that are most important for popularity according to our trained model. We're interested in whether our results vary by factors such as genre.

## Motivation
Our project was inspired by a New York Times [article](https://www.nytimes.com/2021/03/22/technology/streaming-music-economics.html) about the economics of music streaming. It detailed how streaming changed the music industry because it provided regular monthly revenue, something the industry never experienced before the likes of Spotify came around. The article goes on to discuss the inequity in how streaming revenue is shared between the platform, record labels, and artists themselves. However, it was clear from the overall article that a lot of money is at stake for both record labels and artists if they can make a hit song that gets a lot of streams. This inspired us to investigate what audio features can lead to a popular song since such information would massively benefit record labels and artists in terms of generating income from the art they make.

Additionally, we also saw another [article](https://www.buzzfeednews.com/article/blakemontgomery/spotify-billboard-charts) by Buzzfeed News which comments on how charting on places like the Billboard chart is affected by streaming popularity. The more streams a song gets on platforms like Spotify and Apple Music, the more likely the song is to chart higher. Such recognition in turn brings in revenue for record labels along with branding and exposure for artists trying to make a name for themselves.

Given that Spotify has 345 million users, including 155 million paid subscribers, predicting song popularity on Spotify is massively important for record labels and artists. If it's possible to discern the musical tastes of users on Spotify, producers can leverage the number of Spotify users and the potential for streams by writing songs they know will be popular. This provides labels and artists a way to advertise their music and increase how much they make from their songs. In light of how the pandemic has affected our daily lives, this study is especially relevant since COVID-19 has exposed how volatile the economics of the music industry are. Without physical concerts and performances, revenue in the music industry has been gutted, but streaming has been what has saved the music industry. As such, we plan to explore predicting song popularity from audio features not only because of an interest in increasing revenue for record labels and artists, but also so that they can maintain a steady flow of income in an extremely volatile industry.

## Related Work
Given the dataset is from Kaggle, there have been other projects that attempt to do prediction of popularity from the other variables in the data. We, however, are attempting to explore what technical aspects and audio features of a song predict popularity in order to use those to write a popular and possibly high earning and high charting song.

There are also other studies from places like the University of San Francisco on the topic of predicting hit songs from audio features. Two students predicted if a song would be a hit or not a hit based on variables like tempo, key, tune, valence, energy, and acousticity of the song. Our study differs in that we are attempting to see popularity as a continuous target variable. We see whether or not a song is a hit as a scale of popularity rather than a binary outcome. There may be some mild hits where the costs of producing are low enough to still make a profit with mild popularity. As such, we want to examine the degree of popularity and not so strictly if it was a hit or not.

## Dataset
The [dataset](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv) we are using is from Kaggle. It describes songs on Spotify collected from the Spotify Web API and the songs range from being released in 1921 to 2021. All the data we will use is in 1 file and 1 table with 174390 rows of data. The dataset has 19 columns with variables like popularity (measured on a scale from 0 to 100), the name of the song, and the artist. The dataset also measured the audio features of the songs like tempo, liveliness, danceability, valence, acousticness, duration, energy, and more. The audio features extracted from Spotify’s API are measured as a value between 0 and 1, except loudness which is measured in decibels.

Given that each variable forms a column and each observation forms a row, we can see that the data is tidy, however, it is not clean. Most of the columns are the correct data type but artists is stored as a list and is not atomic. We will likely be dropping this column anyway since the value of our study comes from predicting how to write and produce a popular song and artist will be a confounding variable since popularity may be from the artist and not necessarily because people like the sound of the song. The data also has categorical variables like whether or not it’s an explicit song so we will need to one hot encode them in order to run ML algorithms on the data.


## Methods
To start off, we will need to do a bit of data cleaning to remove bad data and retain only the columns necessary for our model. Columns like artist will be removed since they're not an audio feature of a song and some artists will be naturally more popular.

Then we will perform initial EDA with basic statistics and visualizations to look for trends. In particular, we want to check whether certain types of values for columns are overly represented given that small sample size for some values in features could negatively bias our model.

Preparing for the machine learning process, we will first encode our categorical data such as genre and whether or not the song is explicit. We'll also apply feature scaling to make sure our features are approximately on the same scale. Then we can split our data into train, validation, and test.

Moving onto training, we'll try various regression methods, including linear regression, logistic regression, decision trees, and random forests. We'll test each model on the validation dataset with various error metrics such as mean squared error and do k-fold cross validation.

Lastly, we'll decide on our strongest model and try it on the test dataset. We can conclude what the most important features were according to our model.

## Possible Findings and Implications
Our expectation for our model is that it should do quite well given that we perceive the most popular songs as being similar. For example, genre might be an important feature since genres like "Pop" are often played on the radio and are indeed popular. On the other hand, it might be an interesting finding if genre isn't an important feature. In that case, this would help lots of aspiring artists who believe that they need to stick to specific genres to make it big. They could follow their musical passion instead!

Our model could also be very useful for music producers who might feel pressured to churn out a big song for their artist. The model could predict whether the song would be popular before anyone has even heard it. This would be invaluable in fine tuning the perfect set of attributes since songwriting is so often constrained by whether the public will appreciate it. For example, we think pumping up the danceability in a track might be key to getting it played in dance clubs. But we're also hoping our model might pick up on unexpected features, such as high speechiness potentially being crucial nowadays for blowing up on TikTok.

Understanding what makes a song popular—tapping into our desires as humans—can rejuvenate the music industry. With the rise of streaming services making all types of music accessible in recent years, competition is tougher than ever, but we hope our model can help!
