## Billboard Data Analysis

Pairing Spotify's song metrics with Billboard's chart performance data in order to do meaningful analysis on the Billboard Year-End Top 100 Songs from 2017 and 2020.

In [1]:
# Importing modules
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Loading in 2017 data from CSV
billboard2017_df = pd.read_csv("../data/billboard2017.csv")
billboard2017_df.head()

Unnamed: 0.1,Unnamed: 0,Title,Artist,Rank,name,album,artist,release_date,duration_ms,popularity,...,danceability.1,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,positiveness,explicit
0,0,Shape Of You,Ed Sheeran,1,Shape of You,÷ (Deluxe),Ed Sheeran,2017-03-03,233712,86,...,0.825,0.652,0.0,0.0931,-3.183,0.0802,95.977,4,0.931,False
1,1,Despacito,Luis Fonsi & Daddy Yankee Featuring Justin Bieber,2,Despacito - Remix,Despacito Feat. Justin Bieber (Remix),Luis Fonsi,2017-04-17,228826,72,...,0.653,0.816,0.0,0.0967,-4.353,0.167,178.085,4,0.816,False
2,2,That's What I Like,Bruno Mars,3,That's What I Like,24K Magic,Bruno Mars,2016-11-17,206693,82,...,0.853,0.56,0.0,0.0944,-4.961,0.0406,134.066,4,0.86,False
3,3,Humble.,Kendrick Lamar,4,HUMBLE.,DAMN.,Kendrick Lamar,2017-04-14,177000,82,...,0.908,0.621,5.4e-05,0.0958,-6.638,0.102,150.011,4,0.421,True
4,4,Something Just Like This,The Chainsmokers & Coldplay,5,Something Just Like This,Memories...Do Not Open,The Chainsmokers,2017-04-07,247160,83,...,0.617,0.635,1.4e-05,0.164,-6.769,0.0317,103.019,4,0.446,False


In [3]:
# Loading in 2020 data from CSV
billboard2020_df = pd.read_csv("../data/billboard2020.csv")
billboard2020_df.head()

Unnamed: 0.1,Unnamed: 0,Title,Artist,Rank,name,album,artist,release_date,duration_ms,popularity,...,danceability.1,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,positiveness,explicit
0,0,Blinding Lights,The Weeknd,1,Blinding Lights,Blinding Lights,The Weeknd,2019-11-29,201573,19,...,0.513,0.796,0.000209,0.0938,-4.075,0.0629,171.017,4,0.345,False
1,1,Circles,Post Malone,2,Circles,Hollywood's Bleeding,Post Malone,2019-09-06,215280,87,...,0.695,0.762,0.00244,0.0863,-3.497,0.0395,120.042,4,0.553,False
2,2,The Box,Roddy Ricch,3,The Box,Please Excuse Me for Being Antisocial,Roddy Ricch,2019-12-06,196652,84,...,0.896,0.586,0.0,0.79,-6.687,0.0559,116.971,4,0.642,True
3,3,Don't Start Now,Dua Lipa,4,Don't Start Now,Don't Start Now,Dua Lipa,2019-10-31,183290,83,...,0.794,0.793,0.0,0.0952,-4.521,0.0842,123.941,4,0.677,False
4,4,Rockstar,DaBaby Featuring Roddy Ricch,5,ROCKSTAR (feat. Roddy Ricch),BLAME IT ON BABY,DaBaby,2020-04-17,181733,84,...,0.746,0.69,0.0,0.101,-7.956,0.164,89.977,4,0.497,True


In [4]:
# Examining DataFrame dimensions
print(billboard2017_df.shape, billboard2020_df.shape)

(100, 22) (100, 22)


Both datasets contain data from the Top 100 Billboard Year-End Songs from the years 2017 and 2020, and currently 22 attributes are being tracked. However, it may be worthwhile to clean up the data before we do preliminary analysis:

In [5]:
billboard2017_df.columns

Index(['Unnamed: 0', 'Title', 'Artist', 'Rank', 'name', 'album', 'artist',
       'release_date', 'duration_ms', 'popularity', 'danceability',
       'acousticness', 'danceability.1', 'energy', 'instrumentalness',
       'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature',
       'positiveness', 'explicit'],
      dtype='object')

'Title' and 'name' are redundant columns, as are 'Artist' and 'artist'. Additionally, the columns 'Unnamed: 0' and 'danceability.1' appear to be extraneous. I also like what Srinidhi did to convert 'duration_ms' to seconds. I'm going to make new datasets with this all cleaned up:

In [7]:
## 2017 Billboard Data Cleaning ##
# Dropping extraneous columns
billboard2017_clean = billboard2017_df.drop(['Title', 'Artist', 'Unnamed: 0', 'danceability.1'], axis = 1)

# Converting and creating duration column
billboard2017_clean['duration_sec'] = billboard2017_df['duration_ms']/1000

# Export cleaned data to CSV
billboard2017_clean.to_csv('../data/billboard2017cleaned.csv', sep = ',')

billboard2017_clean.sample()

Unnamed: 0,Rank,name,album,artist,release_date,duration_ms,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,positiveness,explicit,duration_sec
3,4,HUMBLE.,DAMN.,Kendrick Lamar,2017-04-14,177000,82,0.908,0.000282,0.621,5.4e-05,0.0958,-6.638,0.102,150.011,4,0.421,True,177.0


In [8]:
## 2020 Billboard Data Cleaning ##
# Dropping extraneous columns
billboard2020_clean = billboard2020_df.drop(['Title', 'Artist', 'Unnamed: 0', 'danceability.1'], axis = 1)

# Converting and creating duration column
billboard2020_clean['duration_sec'] = billboard2020_df['duration_ms']/1000

# Export cleaned data to CSV
billboard2020_clean.to_csv('../data/billboard2020cleaned.csv', sep = ',')

billboard2020_clean.sample()

Unnamed: 0,Rank,name,album,artist,release_date,duration_ms,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature,positiveness,explicit,duration_sec
66,67,All I Want for Christmas Is You,Merry Christmas,Mariah Carey,1994-11-01,241106,88,0.336,0.164,0.627,0.0,0.0708,-7.463,0.0384,150.273,4,0.35,False,241.106


#### Question 1: 