# Billboard Hot 100 Characteristics
##### Amit Kumar, Omar Salih

In [3]:
%load_ext autoreload
%autoreload 2
import pandas as pd

### Introduction
Since its inception in 1958, the Billboard Hot 100 Chart has been the de facto way to measure a song's popularity in the United States. It combines sales data, radio play time, and online streaming statistics to rank the most popular songs each week. But what is it that makes a song more popular than others? Using historical Billboard charts and song data from Spotify, we sought to answer the question of whether or not there was some defining characteristic that more popular songs shared, and how songs that are popular have changed over time.

### Methodology
To answer this question, we needed to gather data from two sources, described below.

#### Billboard Data Acquisition

The first is Billboard itself - we needed to scrape the historical Billboard charts and find the titles, artists, and peak ranks of all the songs that had ever been on the Billboard Hot 100 so we knew which songs we should be analyzing. We accomplished this by web scraping each week's chart (today's chart as an example can be found [here](https://www.billboard.com/charts/hot-100/)) using the Requests library to fetch HTML data and Beautiful Soup to parse it. 
- Data collection from Billboard was done in two stages. The first phase, found in `scraper.py` simply scraped each chart as a pandas DataFrame and saved it to an individual feather file within the data directory (feather files were chosen for their integration with pandas and quick read/write times).
- The next phase, found in `merging.py`, combined all of those individual dataframes into one singular dataframe, then performed some preliminary data cleaning by removing duplicate songs (keeping the week of their peak rank in the dataset) and reformatting some artist names for better compatibility with the Spotify API.
    - The reason for this split is because the scraping process can take some time, and if there was to be an issue such as a connection loss, technical issue or data inconsistency that causes an error, the user can simply change the start date in the scraper to where they left off without losing any progress. In our case, there was one chart in the 1970s that only had 99 songs and caused us to lose our progress to that point. We made the code robust to varying chart lengths, but also took the opportunity to refactor and save each phase as it was scraped to minimize any potential data loss.

The results from this were just under 25,000 individual songs that have ever been on the Billboard Hot 100, along with their artist and peak position (and the week they reached it). This data was saved to `charts_merged.feather`, a preview of which can be seen below. 



In [10]:
merged = pd.read_feather('charts_merged.feather')
merged.head(5)

Unnamed: 0,rank,title,artist,week
0,1,Poor Little Fool,Ricky Nelson,1958-08-04
1,1,Who's That Girl,Madonna,1987-08-17
2,1,I Still Haven't Found What I'm Looking For,U2,1987-08-10
3,1,"Shakedown (From ""Beverly Hills Cop II"")",Bob Seger,1987-07-27
4,1,Alone,Heart,1987-07-20


#### Spotify Data Acqusition

With a list of songs in hand, we turned our attention to acquiring the song characteristics for our list of 25,000 songs. To this end, we used [Spotipy](https://spotipy.readthedocs.io), a Python library that provides us access to the [Spotify Web API](https://developer.spotify.com/documentation/web-api/). The code we used for this section can be found in `spotify_api.py`, which contains a class we wrote that creates a pre-authenticated Spotify object as well as wrappers for some of the functions we used, and `spotify_data.py`, which contains the code that requested and merged all of the song characteristics.

Data acqusition from Spotify also had to be conducted in two separate stages, although for a different reason than the Billboard data.

1. First, we needed to find the Spotify track IDs for each of the songs in our dataset (which we needed to look up the song characteristics). The API had a function for this that allowed us to input a song title and artist name and returned a track ID. These were saved as a column in the songs dataframe.
    - Unfortunately, due to some naming inconsistencies between Spotify and Billboard, we did end up with about 6,000 songs (roughly 24%) that returned without a matching track ID. If we had more time and resources, we could potentially have tried to find some patterns in how these inconsistencies occurred and try to programatically solve at least some of them. Nevertheless, this still left us with over 18,000 songs, which we felt was a high enough number to continue on and draw conclusions from.
    - This also necessitated an intermediate stage of data cleaning, where we dropped NaN values from the dataframe and reset the index column to run sequentially in preparation for the next phase.
2. Secondly, we were able to then use that track ID in a different function that returned a dictionary of that song's chartacteristics. Each characteristic was then also saved to a dataframe column, and the result was saved to `charts_clean.feather`. Below are the same 5 songs from above, but with the song characteristics included.

In [11]:
charts = pd.read_feather('charts_clean.feather')
charts.head(5)

Unnamed: 0,rank,title,artist,week,trackid,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,1,Poor Little Fool,Ricky Nelson,1958-08-04,5ayybTSXNwcarDtxQKqvWX,0.474,0.338,0.0,-11.528,1.0,0.0299,0.648,0.0,0.13,0.81,154.596,153933.0,4.0
1,1,Who's That Girl,Madonna,1987-08-17,3G0NNqwQ1sqRpySr6soHlH,0.625,0.646,9.0,-13.592,0.0,0.0389,0.214,0.0533,0.0525,0.826,103.893,239173.0,4.0
2,1,I Still Haven't Found What I'm Looking For,U2,1987-08-10,6wpGqhRvJGNNXwWlPmkMyO,0.564,0.774,1.0,-9.424,1.0,0.0368,0.0135,0.00191,0.0861,0.657,100.894,277477.0,4.0
3,1,Alone,Heart,1987-07-20,54b8qPFqYqIndfdxiLApea,0.418,0.452,1.0,-13.099,1.0,0.0356,0.638,0.00026,0.0959,0.168,175.088,218733.0,4.0
4,1,I Wanna Dance With Somebody (Who Loves Me),Whitney Houston,1987-06-29,2tUBqZG2AbRi7Q0BIrVrEj,0.709,0.824,1.0,-8.824,1.0,0.0453,0.207,0.000307,0.0888,0.867,118.818,291293.0,4.0


The characteristics we analyzed were danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration\_ms, and time\_signature. For more information on their official definitions, or anything else related to the Spotify Web API, see its documentation [here](https://developer.spotify.com/documentation/web-api/reference/#/)

### Visualizations and Analysis

Using `matplotlib` and `plotly`, we were able to create multiple plots to demonstrate the changes in song characteristics over time:

- `matplotlib` was used to create a line-plot, displaying the change of several characteristics over the decades. We also created a multiple histogram plot to visualize the distributions of the characteristics.  
- `plotly` was used to create some radar graphs, which allowed us to compare the levels of several characteristics at a time.