ANALYSIS </br>

Bilboard Exploration: </br>
1. How many songs reached the number 1 position during the sample period </br>
   Will looking at only #1 songs be useful for analysis, or should we look at songs that entered the top 5/10? </br>
2. How many weeks did each of those songs appear on the charts? </br>

Spoitfy Exploration: </br>
1. Of 1,000,000 playlists, how many were updated three or fewer times? </br>

Preparation: </br>
1. Artist Name, Track Title of all #1 songs </br>
2. Match Artist Names and Track Titles to Spotify IDs </br>
3. Iterate over Spotify data to identify playlists that have one or more #1 tracks in the playlist </br>
4. Isolate relevant song lines and write to a new Spotify dataframe/file </br>
5. Of x playlists with 3 or fewer edits, how many had a Billboard Charting song on it? </br>

Analysis: </br>
1. Which #1 songs were most present in playlists? Are there any #1 songs that were not on any playlists? </br>
2. Do the top 10 performing billboard songs appear most on the user playlists? </br>
3. For the 10 number one songs that have the most playlist adds, what was the playlist activity in relation to chart activity?</br>

In [53]:
import json
import pandas as pd
import os
import numpy as np
from datetime import date, timedelta

### Billboard Exploration

In [5]:
# How many songs reached the number 1 position during the sample period?
# Will looking at only #1 songs be useful for analysis, or should we look at songs that entered the top 5/10? all?
billboard = pd.read_csv('../data/Billboard/billboard_chart_data.csv')

In [7]:
billboard.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41700 entries, 0 to 41699
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   week_of            41700 non-null  object
 1   rank_current_week  41700 non-null  int64 
 2   title              41700 non-null  object
 3   artist             41700 non-null  object
 4   rank_prior_week    41700 non-null  object
 5   peak_pos           41700 non-null  int64 
 6   weeks_on_chart     41700 non-null  int64 
 7   base_artist        41700 non-null  object
 8   expand             12075 non-null  object
 9   key                41700 non-null  object
dtypes: int64(3), object(7)
memory usage: 3.2+ MB


In [8]:
billboard['artist_title'] = billboard.artist+ '-' + billboard.title

In [9]:
# total number of songs that entered the billboard top 100 during the sample period
billboard.artist_title.nunique()

3460

In [10]:
# all number 1 position during the sample period
no1 = billboard[billboard.rank_current_week == 1]
no1.to_csv('../data/Billboard/number_1_by_week.csv', index=False)

In [11]:
no1_unique = no1[['week_of','artist','title']].drop_duplicates(subset=['artist','title'], keep='first').reset_index()

no1_unique[['base_artist','expand']] = no1_unique['artist'].str.split(' Featuring ', expand=True)
#no1_unique[['base_artist','expand']] = no1_unique['artist'].str.split(' &', expand=True)
no1_unique['base_artist'] = no1_unique['base_artist'].str.lower()
no1_unique['base_artist'] = no1_unique['base_artist'].replace({'ke$ha':'kesha', 'far*east movement':'far east movement','luis fonsi & daddy yankee':'luis fonsi'})
no1_unique['title'] = no1_unique['title'].str.lower()
no1_unique['title'] = no1_unique['title'].replace('uptown funk!','uptown funk')
no1_unique['key'] = no1_unique['base_artist'] + '_' + no1_unique['title'].str.lower().str.split(" \(").str[0]

In [314]:
#number of songs that went number one
len(no1_unique)

90

In [315]:
artist_list = no1_unique.base_artist.unique().tolist()

In [14]:
billboard_key = no1_unique['key'].str.lower().tolist()
billboard_key

['kesha_tik tok',
 'the black eyed peas_imma be',
 'taio cruz_break your heart',
 'rihanna_rude boy',
 "b.o.b_nothin' on you",
 'usher_omg',
 'eminem_not afraid',
 'katy perry_california gurls',
 'eminem_love the way you lie',
 'katy perry_teenage dream',
 'bruno mars_just the way you are',
 'far east movement_like a g6',
 'kesha_we r who we r',
 "rihanna_what's my name?",
 'rihanna_only girl',
 'p!nk_raise your glass',
 'katy perry_firework',
 'bruno mars_grenade',
 'britney spears_hold it against me',
 'wiz khalifa_black and yellow',
 'lady gaga_born this way',
 'katy perry_e.t.',
 'rihanna_s&m',
 'adele_rolling in the deep',
 'pitbull_give me everything',
 'lmfao_party rock anthem',
 'katy perry_last friday night',
 'maroon 5_moves like jagger',
 'adele_someone like you',
 'rihanna_we found love',
 'lmfao_sexy and i know it',
 'adele_set fire to the rain',
 'kelly clarkson_stronger',
 'katy perry_part of me',
 'fun._we are young',
 'gotye_somebody that i used to know',
 'carly rae j

In [15]:
title_list = []
base_list = no1_unique.title.unique().tolist()

for x in base_list:
    x = x.split(' (')
    x = x[0]
    title_list.append(x)

In [16]:
len(title_list)

90

In [None]:
# all songs that reached at least the number 5 position during the sample period
top5 = billboard[billboard.rank_current_week <= 5]
top5.artist_title.nunique()

In [None]:
# all songs that reached at least the number 10 position during the sample period
top10 = billboard[billboard.rank_current_week <= 10]
top10.artist_title.nunique()

In [None]:
#How many weeks did each of those songs appear on the charts?
week_counts = pd.DataFrame(billboard.groupby(['artist','title'])[['artist','title']].value_counts())
week_counts.sort_values('count', ascending=False).head(10)

### Spotify Exploration

In [20]:
#Of 1,000,000 playlists, how many were updated three or fewer times? 

In [21]:
#file names - create list of flie strings to leverage in loops
file_list = []                                                    #empty list for strings to land
for file in os.listdir('..\data\Spotify\data'):                   #for loop to locate each file in source folder
    file_name = '..\\data\\Spotify\\data\\' + os.fsdecode(file)   #create a file string name to be read in
    file_list.append(file_name)                                   #add name to list
#file_list                                                         #print resulting list

In [22]:
# get playlist IDs where the playlists were updated three or fewer times
pid_list = []
for file in file_list: #for loop to iterate through files
    with open(file) as data_file:
        d = json.load(data_file)
        playlists = pd.json_normalize(d['playlists'])
        edit_reqs = playlists[playlists.num_edits <= 3]
        pid_list.append(edit_reqs.pid.unique())

In [23]:
#list of pid arrays to single list of pids
pid_list = np.concatenate(pid_list).ravel().tolist()


In [24]:
#count of playlists with three or fewer edits
len(pid_list)

174083

In [25]:
#percent of playlists from Spotify's million dataset
round(len(pid_list) / 1000000 * 100,2)

17.41

### Preparation

In [316]:
# Artist Name, Track Title of all #1 songs
no1_unique[['artist','title']]

Unnamed: 0,artist,title
0,Ke$ha,tik tok
1,The Black Eyed Peas,imma be
2,Taio Cruz Featuring Ludacris,break your heart
3,Rihanna,rude boy
4,B.o.B Featuring Bruno Mars,nothin' on you
...,...,...
85,Luis Fonsi & Daddy Yankee Featuring Justin Bieber,despacito
86,Taylor Swift,look what you made me do
87,Cardi B,bodak yellow (money moves)
88,Post Malone Featuring 21 Savage,rockstar


#### Uncomment the cells below for first time use
First cell creates an empty dataframe. </br>

Second cell is a loop that iterates through Spotify Million Playlist Dataset.  The result: </br>
<li> Returns only playlists with three or fewer edits </li>
<li> Returns playlist and track information for instances in which a track that reached No. 1 on the Billboard Charts was added to a user playlist during the sample period </li>
Note the cells above must be executed to generate a list of PIDs (playlist identifiers) and tracks that hit No. 1 (billboard_key)

Third cell checks for number of unique tracks in the results; 79 songs reached number one during the period, so the cell should return 79.

Fourth cell checks the results and prints the title of any track that hit No. 1 on Billboard, but had no occurrences in the Spotify dataframe.

Fifth cell prints results to a .csv


Iterate over Spotify data to identify playlists that have one or more #1 tracks in the playlist </br>
Isolate relevant song lines and write to a new Spotify dataframe/file

In [None]:
# #intiate empty dataframe
# top_song_playlists = pd.DataFrame()

In [None]:
# hold = []
# for file in file_list: 
#     with open(file) as data_file:
#         d = json.load(data_file)
#         playlists = pd.json_normalize(d['playlists'])
#         playlists = playlists[playlists.pid.isin(pid_list)]
#         tracks = pd.json_normalize(d, record_path=['playlists','tracks'],meta=[['playlists','pid']])
#         tracks = tracks[tracks['playlists.pid'].isin(pid_list)]
#         tracks['artist_name'] = tracks['artist_name'].str.lower()
#         tracks['artist_name'] = tracks['artist_name'].str.split(',').str[0]
#         tracks['artist_name'] = tracks['artist_name'].str.split(" \(").str[0]
#         tracks['track_name'] = tracks['track_name'].str.lower()
#         tracks['track_name'] = tracks['track_name'].str.split(" \(").str[0]
#         tracks['track_name'] = tracks['track_name'].str.split(' -').str[0]
#         tracks['key'] = tracks['artist_name'] + '_' + tracks['track_name'].str.lower()
#         tracks = tracks[tracks['key'].isin(billboard_key)]
#         df = tracks.merge(playlists, how='inner', left_on='playlists.pid', right_on='pid')
#         df = df.drop(columns=['tracks','description','playlists.pid'])
#         df.modified_at = pd.to_datetime(df.modified_at, unit = 's')
#         hold.append(df)
#         print(f'file {file} complete')
        
# top_song_playlists = pd.concat(hold)

In [None]:
# #check to see if all #1 songs are present in the result dataframe
# spotify_tracks = top_song_playlists.track_name.unique().tolist()
# len(spotify_tracks)

In [None]:
tracks.key.unique()

In [None]:
spotify_tracks.head()

In [None]:
#check songs present in the billboard top songs list that aren't present in the spotify result set
for x in title_list:
    if x.lower() in spotify_tracks:
        pass
    else:
        print(f'{x} not found')

In [None]:
#print results
top_song_playlists.to_csv('../data/Spotify/top_song_playlists_tracks.csv')

### #1 Song Analysis

In [279]:
#read in spotify results csv:
spotify = pd.read_csv('../data/Spotify/top_song_playlists_tracks.csv')
spotify.modified_at = pd.to_datetime(spotify.modified_at)

In [243]:
#convert all billboard data to compatible format
billboard_data = billboard
billboard_data[['base_artist','expand']] = billboard_data['artist'].str.split(' Featuring ', expand=True)
billboard_data['base_artist'] = billboard_data['base_artist'].str.lower()
billboard_data['base_artist'] = billboard_data['base_artist'].replace({'ke$ha':'kesha', 'far*east movement':'far east movement','luis fonsi & daddy yankee':'luis fonsi'})
billboard_data['title'] =billboard_data['title'].replace('Uptown Funk!','uptown funk')
billboard_data['title'] =billboard_data['title'].str.lower()
billboard_data['key'] = billboard_data['base_artist'] + '_' + billboard_data['title'].str.lower().str.split(" \(").str[0]

billboard_data = billboard_data[billboard_data.key.isin(billboard_key)]

In [244]:
# Of x playlists with 3 or fewer edits, how many had a Billboard Charting song on it?
print('Total playlist count, original dataset: 1000000')
print('Total playlist count, three or fewer edits: ' + str(len(pid_list)))
print('Total playlists with one or more Billboard No. 1 song(s): ' + str(spotify.pid.nunique()))

Total playlist count, original dataset: 1000000
Total playlist count, three or fewer edits: 174083
Total playlists with one or more Billboard No. 1 song(s): 43751


In [245]:
# #Which artists had the most number one songs on the billboard charts?
# billboard_data.groupby(['base_artist'],group_keys=True)[['title']].nunique().sort_values('title', ascending=False) #.head(20)

In [246]:
# #Which artists had the most #1 songs in playlists?
# spotify.groupby(['artist_name'],group_keys=True)[['track_name']].nunique().sort_values('track_name', ascending=False)

In [247]:
#Which songs stayed #1 longest?
weeks_on_chart_rank = billboard_data.groupby(['base_artist', 'title', 'key'],group_keys=True)[['week_of']].nunique().sort_values('week_of', ascending=False).reset_index()
weeks_on_chart_rank

Unnamed: 0,base_artist,title,key,week_of
0,lmfao,party rock anthem,lmfao_party rock anthem,68
1,adele,rolling in the deep,adele_rolling in the deep,65
2,gotye,somebody that i used to know,gotye_somebody that i used to know,59
3,john legend,all of me,john legend_all of me,59
4,katy perry,dark horse,katy perry_dark horse,57
...,...,...,...,...
85,baauer,harlem shake,baauer_harlem shake,20
86,taylor swift,look what you made me do,taylor swift_look what you made me do,17
87,britney spears,hold it against me,britney spears_hold it against me,17
88,ed sheeran,perfect,ed sheeran_perfect,16


In [248]:
# #Which #1 songs were most present in playlists? 
# s_track_count = spotify.groupby(['artist_name', 'track_name'],group_keys=True)[['pid']].count().sort_values('pid', ascending=False).reset_index()
# s_track_count
# #Are there any #1 songs that were not on any playlists?
# #No - our count of 90 unique titles shows this.

In [249]:
#10 Top Performing Billboard songs
b_top10 = weeks_on_chart_rank[['base_artist','title','key']].head(10)
b_top10

Unnamed: 0,base_artist,title,key
0,lmfao,party rock anthem,lmfao_party rock anthem
1,adele,rolling in the deep,adele_rolling in the deep
2,gotye,somebody that i used to know,gotye_somebody that i used to know
3,john legend,all of me,john legend_all of me
4,katy perry,dark horse,katy perry_dark horse
5,mark ronson,uptown funk,mark ronson_uptown funk
6,justin timberlake,can't stop the feeling!,justin timberlake_can't stop the feeling!
7,sia,cheap thrills,sia_cheap thrills
8,the chainsmokers,closer,the chainsmokers_closer
9,wiz khalifa,see you again,wiz khalifa_see you again


In [250]:
#Do the top 10 performing billboard songs appear most on the user playlists?
#10 Top Performing (most playlisted) songs
s_top10 = spotify.groupby(['artist_name', 'track_name','key'],group_keys=True)[['pid']].count().sort_values('pid', ascending=False).reset_index().head(10)
s_top10

Unnamed: 0,artist_name,track_name,key,pid
0,drake,one dance,drake_one dance,3789
1,the chainsmokers,closer,the chainsmokers_closer,3219
2,kendrick lamar,humble.,kendrick lamar_humble.,2997
3,mark ronson,uptown funk,mark ronson_uptown funk,2732
4,luis fonsi,despacito,luis fonsi_despacito,2724
5,justin bieber,sorry,justin bieber_sorry,2693
6,the weeknd,the hills,the weeknd_the hills,2658
7,the weeknd,can't feel my face,the weeknd_can't feel my face,2602
8,ed sheeran,shape of you,ed sheeran_shape of you,2540
9,rihanna,work,rihanna_work,2458


In [251]:
b_toplist = b_top10.title.str.lower().tolist()
s_toplist = s_top10.track_name.str.lower().tolist()

b_keys = b_top10.key.str.lower().tolist()
s_keys = s_top10.key.str.lower().tolist()

In [252]:
b_keys

['lmfao_party rock anthem',
 'adele_rolling in the deep',
 'gotye_somebody that i used to know',
 'john legend_all of me',
 'katy perry_dark horse',
 'mark ronson_uptown funk',
 "justin timberlake_can't stop the feeling!",
 'sia_cheap thrills',
 'the chainsmokers_closer',
 'wiz khalifa_see you again']

In [253]:
#which songs were top performers on billboard and spotify?
def both(b_list, s_list):
    return [x for x in b_list if x in s_list]

both(b_toplist, s_toplist)

['uptown funk', 'closer']

In [254]:
#what are the titles of the top performing songs across billboard and spotify?
top_song_list = list(set(b_toplist + s_toplist))
top_song_keys = list(set(b_keys + s_keys))

In [317]:
len(top_song_list)

18

### Visualization Exports

In [None]:
#billboard data export - top 10 on billboard by weeks at number 1
billboard_data[billboard_data.key.isin(b_keys)].to_csv('../viz exports/billboard_top_performers.csv', index=False)

In [None]:
#spotify data export - top 10 on spotify by playlist count
spotify[spotify.key.isin(s_keys)].to_csv('../viz exports/spotify_top_perfomrers.csv', index=False)

In [230]:
#billboard data export - all 18 top performers
billboard18 =  billboard_data[billboard_data.key.isin(top_song_keys)]
billboard18.to_csv('../viz exports/all_top_performers_billboard.csv', index=False)

In [325]:
#spotify data export - all 18 top performers
spotify18 = spotify[spotify.key.isin(top_song_keys)] 

In [312]:
# Append the spotify dataframe with a "week of" grouping to sum playlist modifications over the billboard reporting week.
#function - creates a date string for every 7 days from start date
def daterange(start_date, end_date):
    for n in range(0, int((end_date - start_date).days) +1, 7):
        yield start_date + timedelta(n)
        
#initiation - list
billboard_weeks=[]

# intitiation - parameters
start = date(2010,1,9)
end = date(2017,12,30)

# append loop
for dt in daterange(start, end):
     billboard_weeks.append(str(dt.strftime('%Y-%m-%d')))

In [289]:
billboard_weeks

['2010-01-09',
 '2010-01-16',
 '2010-01-23',
 '2010-01-30',
 '2010-02-06',
 '2010-02-13',
 '2010-02-20',
 '2010-02-27',
 '2010-03-06',
 '2010-03-13',
 '2010-03-20',
 '2010-03-27',
 '2010-04-03',
 '2010-04-10',
 '2010-04-17',
 '2010-04-24',
 '2010-05-01',
 '2010-05-08',
 '2010-05-15',
 '2010-05-22',
 '2010-05-29',
 '2010-06-05',
 '2010-06-12',
 '2010-06-19',
 '2010-06-26',
 '2010-07-03',
 '2010-07-10',
 '2010-07-17',
 '2010-07-24',
 '2010-07-31',
 '2010-08-07',
 '2010-08-14',
 '2010-08-21',
 '2010-08-28',
 '2010-09-04',
 '2010-09-11',
 '2010-09-18',
 '2010-09-25',
 '2010-10-02',
 '2010-10-09',
 '2010-10-16',
 '2010-10-23',
 '2010-10-30',
 '2010-11-06',
 '2010-11-13',
 '2010-11-20',
 '2010-11-27',
 '2010-12-04',
 '2010-12-11',
 '2010-12-18',
 '2010-12-25',
 '2011-01-01',
 '2011-01-08',
 '2011-01-15',
 '2011-01-22',
 '2011-01-29',
 '2011-02-05',
 '2011-02-12',
 '2011-02-19',
 '2011-02-26',
 '2011-03-05',
 '2011-03-12',
 '2011-03-19',
 '2011-03-26',
 '2011-04-02',
 '2011-04-09',
 '2011-04-

In [326]:
billboard_week_df = pd.DataFrame(billboard_weeks, columns=["week_of"])
billboard_week_df["week_of"] = pd.to_datetime(billboard_week_df["week_of"])
billboard_week_df["Int Week"] = billboard_week_df["week_of"].apply(lambda x: x.weekofyear)
billboard_week_df["Int Year"] = billboard_week_df["week_of"].apply(lambda x: x.year)

spotify18["modified_at"] = pd.to_datetime(spotify18["modified_at"])
spotify18["Int Week"] = spotify18["modified_at"].apply(lambda x: x.weekofyear)
spotify18["Int Year"] = spotify18["modified_at"].apply(lambda x: x.year)
spotify18[["modified_at", "Int Week", "Int Year"]]

spotify18 = spotify18.merge(billboard_week_df, on=["Int Week", "Int Year"], how="left")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spotify18["modified_at"] = pd.to_datetime(spotify18["modified_at"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spotify18["Int Week"] = spotify18["modified_at"].apply(lambda x: x.weekofyear)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spotify18["Int Year"] = spotify18["modified_at"].apply(lam

In [327]:
spotify18.head()

Unnamed: 0.1,Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,key,...,modified_at,num_tracks,num_albums,num_followers,num_edits,duration_ms_y,num_artists,Int Week,Int Year,week_of
0,3,21,lmfao,spotify:track:7mitXLIMCflkhZiD34uEQI,spotify:artist:3sgFRtyBnxXD5ESfmbK4dl,party rock anthem,spotify:album:0D49RvtlLCKyxeDKDnBU2R,262146,Sorry For Party Rocking,lmfao_party rock anthem,...,2015-05-07,80,71,1,3,19156557,56,19,2015,2015-05-09
1,11,9,drake,spotify:track:1xznGGDReH1oQq0xzbwXa3,spotify:artist:3TVXtAsR1Inumwj472S9r4,one dance,spotify:album:3hARKC8cinq3mZLLAEaBh9,173986,Views,drake_one dance,...,2016-09-04,26,24,1,3,5819956,23,35,2016,2016-09-03
2,13,0,ed sheeran,spotify:track:7qiZfU4dY1lWllzX7mPBI3,spotify:artist:6eUKZXaKkcviH0Ku9w2n3V,shape of you,spotify:album:3T4tUhGYeRNVUGevb0wThu,233712,÷,ed sheeran_shape of you,...,2017-05-02,26,25,1,3,5549763,24,18,2017,2017-05-06
3,14,11,sia,spotify:track:27SdWb2rFzO6GWiYDBTD9j,spotify:artist:5WUlDfRSoLAfcVSX1WnrxN,cheap thrills,spotify:album:77jAfTh3KH9K2reMOmTgOh,211666,This Is Acting,sia_cheap thrills,...,2017-05-02,26,25,1,3,5549763,24,18,2017,2017-05-06
4,16,7,drake,spotify:track:1xznGGDReH1oQq0xzbwXa3,spotify:artist:3TVXtAsR1Inumwj472S9r4,one dance,spotify:album:3hARKC8cinq3mZLLAEaBh9,173986,Views,drake_one dance,...,2016-07-21,18,12,1,2,4337022,4,29,2016,2016-07-23


In [328]:
spotify18.to_csv('../viz exports/all_top_performers_spotify.csv')

In [358]:
#billboard data read in - all top perfomers
topspotify = pd.read_csv('../viz exports/all_top_performers_spotify.csv')
topspotify.modified_at = pd.to_datetime(topspotify.modified_at)

In [330]:
#billboard data read in - all top perfomers
topbillboard = pd.read_csv('../viz exports/all_top_performers_billboard.csv')
topbillboard.week_of = pd.to_datetime(topbillboard.week_of)

In [359]:
spotify_update_count = topspotify[["key", "week_of","modified_at"]]

In [360]:
spotify_update_count = pd.DataFrame(spotify_update_count.groupby(["key", "week_of"])[["modified_at"]].count().reset_index())

In [361]:
spotify_update_count.to_csv('../viz exports/spotify_updates_by_week.csv')