# Gathering Data

## Problem Statement

#### When is the best time to release a new song/album on Spotify?

As a major label artist, planning and preparing a successful release goes further than finding a great single or recording a critically acclaimed album. A strategic rollout is essential to standing out and cutting through noise in this highly competitive industry. The target of this project is to pinpoint an effective time to release new music.

In order to achieve this, I decided to gather data from the largest music streaming service, Spotify, along with chart data from the industry publication Billboard to analyze the songs that see the most success while on the best selling/most listened to charts.

This notebook walks through the processes of gathering information by calling the Billboard Web API and organizing csv files on most listened to spotify songs collected directly from the web. Sources can be found below.

#### Note
This repository consists of Spotify top charts from 2016-12-23 to 2019-05-10.
To re-create, ensure the current charts are downloaded from https://spotifycharts.com/regional/us/weekly/latest
and saved to the data folder in your fork of this repository.

In [1]:
# Importing the packages needed to gather the data.
# The Billboard API wrapper and documentation can be found at https://github.com/guoguo12/billboard-charts

import billboard
import pandas as pd
import glob

In [4]:
# function to create dataframe for each peak position on billbooard charts i will be using
seen_weeks = []
hot_100 = pd.DataFrame(columns=list(range(1, 101)))

def chart_generator(chart_type, df_name, i):
    chart = billboard.ChartData(chart_type) # Start at most recent
    while len(df_name) < 123:
        seen_weeks = []
        for n in range(0, i):
            song = chart[n].title
            #artist = chart[n].artist
            seen_weeks.append(song)
        df_name.loc[len(df_name)] = seen_weeks
        chart = billboard.ChartData(chart_type, date=chart.previousDate)    
    print('Done')

In [5]:
chart_generator('hot-100', hot_100, 100)

Done


In [6]:
hot_100.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
0,Old Town Road,If I Can't Have You,ME!,Sucker,Homicide,Wow.,Sunflower (Spider-Man: Into The Spider-Verse),Without Me,Bad Guy,7 Rings,...,Middle Child,Shot Clock,Ocean Eyes,Wish You Were Gay,I've Been Waiting,All To Myself,Secreto,Paradise,Die Young,Faucet Failure
1,Old Town Road,ME!,Wow.,Sucker,Sunflower (Spider-Man: Into The Spider-Verse),7 Rings,Without Me,Dancing With A Stranger,Bad Guy,Talk,...,Love Someone,Power Is Power,Faucet Failure,Knockin' Boots,Numb Numb Juice,Love Me Anyway,24/7,Stop Snitching,SOS,Kill This Love
2,Old Town Road,Wow.,Sunflower (Spider-Man: Into The Spider-Verse),7 Rings,Sucker,Without Me,Dancing With A Stranger,Talk,Bad Guy,Middle Child,...,Saturday Nights,Pure Cocaine,Love Someone,This Is It,My Strange Addiction,Shotta Flow,Let Me Down Slowly,Xanny,One That Got Away,ME!
3,Old Town Road,Wow.,Sunflower (Spider-Man: Into The Spider-Verse),7 Rings,Without Me,Sucker,Dancing With A Stranger,Boy With Luv,Bad Guy,Please Me,...,Ocean Eyes,Last Time That I Checc'd,Talk You Out Of It,I'm So Tired...,Make It Right,"Hey Look Ma, I Made It",This Is It,There Was This Girl,Let Me Down Slowly,On My Way To You
4,Old Town Road,Sunflower (Spider-Man: Into The Spider-Verse),Wow.,7 Rings,Without Me,Sucker,Please Me,Better,Middle Child,Happier,...,Faucet Failure,All The Good Girls Go To Hell,Right Back,Talk You Out Of It,Inmortal,Ocean Eyes,Dedication,Let Me Down Slowly,Big Ole Freak,Victory Lap


In [7]:
# Saving to csv for later cleaning

hot_100.to_csv('data/hot_100.csv', index=False)

In [8]:
# Calling in Spotify csvs and combining to prepare to sum stream totals.
# Code adapted from 
#https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe

# See note above for current chart data added after 2019-05-02

path = r'data/spotify_streaming_docs/'
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=1)
    df.drop(['Position', 'URL'], axis=1, inplace=True)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

In [9]:
# Calling in Spotify csvs and combining to prepare peak positions.

# See note above for current chart data added after 2019-05-02

path = r'data/spotify_streaming_docs/' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=1)
    df.drop(['Artist', 'URL', 'Streams'], axis=1, inplace=True)
    li.append(df)

positions = pd.concat(li, axis=0, ignore_index=True)

In [10]:
positions.head()

Unnamed: 0,Position,Track Name
0,1,Despacito - Remix
1,2,"I'm the One (feat. Justin Bieber, Quavo, Chanc..."
2,3,Wild Thoughts (feat. Rihanna & Bryson Tiller)
3,4,XO TOUR Llif3
4,5,HUMBLE.


In [11]:
# Saving to csv for cleaning in next notebook

positions.to_csv('data/spotify_chart_positions.csv', index=False)

In [12]:
frame.head()

Unnamed: 0,Track Name,Artist,Streams
0,Despacito - Remix,Luis Fonsi,11395183
1,"I'm the One (feat. Justin Bieber, Quavo, Chanc...",DJ Khaled,10195398
2,Wild Thoughts (feat. Rihanna & Bryson Tiller),DJ Khaled,10135936
3,XO TOUR Llif3,Lil Uzi Vert,9842380
4,HUMBLE.,Kendrick Lamar,9330372


In [13]:
# Saving to csv for cleaning in next notebook

frame.to_csv('data/spotify_stream_totals.csv', index=False)

### Sources:

https://www.billboard.com/charts/hot-100 <br>
https://github.com/guoguo12/billboard-charts <br>
https://spotifycharts.com/regional/us/weekly/latest