## 01. Import dataset

"billboard_dataset.csv" is a csv file originated from Dhruvil Dave's [Billboard "The Hot 100" Songs](https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs) which contains the collection of all "The Hot 100" charts released since its inception in 1958 until 2021-11-06.

<br>

I made two modifications on the original dataset.

1. I reduced the data range to cover from January 2, 2016, to November 6, 2021, due to Spotify's API rate limits for artist genre queries. This modification, however, does not affect my thesis, which examines changes in popularity before and after COVID-19.

2. I arranged the dates in descending order to better visualize the changes in popularity over time.

In [1]:
import pandas as pd

input_file = "billboard_dataset.csv"

try:
    df = pd.read_csv(input_file, on_bad_lines='skip', encoding='latin1')
except Exception as e:
    print(e)

In [2]:
df.head(20)

Unnamed: 0,date,rank,song,artist,last-week,peak-rank,weeks-on-board
0,2016-01-02,1,Hello,Adele,1.0,1,8
1,2016-01-02,2,Sorry,Justin Bieber,2.0,2,8
2,2016-01-02,3,Hotline Bling,Drake,3.0,2,20
3,2016-01-02,4,Love Yourself,Justin Bieber,5.0,4,5
4,2016-01-02,5,What Do You Mean?,Justin Bieber,4.0,1,16
5,2016-01-02,6,Same Old Love,Selena Gomez,9.0,6,14
6,2016-01-02,7,The Hills,The Weeknd,6.0,1,30
7,2016-01-02,8,Here,Alessia Cara,8.0,8,20
8,2016-01-02,9,Stitches,Shawn Mendes,7.0,4,30
9,2016-01-02,10,Like I'm Gonna Lose You,Meghan Trainor Featuring John Legend,10.0,8,24


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30600 entries, 0 to 30599
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   date            30600 non-null  object 
 1   rank            30600 non-null  int64  
 2   song            30600 non-null  object 
 3   artist          30600 non-null  object 
 4   last-week       26534 non-null  float64
 5   peak-rank       30600 non-null  int64  
 6   weeks-on-board  30600 non-null  int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 1.6+ MB


## 02. Preprocess given data

In [4]:
import pandas as pd

output_file = "preprocessed_billboard_data.csv"

# artist column       
df['artist'] = df['artist'].str.strip()
df['artist'] = df['artist'].apply(lambda x: str(x).split('Featuring')[0])
df['artist'] = df['artist'].apply(lambda x: str(x).split('&')[0])
df['artist'] = df['artist'].apply(lambda x: str(x).split('x')[0])
df['artist'] = df['artist'].apply(lambda x: str(x).split('X')[0])
df['artist'] = df['artist'].str.replace('"', '')

# drop unnecessary columns
features_drop = ['last-week', 'peak-rank', 'weeks-on-board']
df.drop(features_drop, axis=1, inplace=True)

In [5]:
df.head(20)

Unnamed: 0,date,rank,song,artist
0,2016-01-02,1,Hello,Adele
1,2016-01-02,2,Sorry,Justin Bieber
2,2016-01-02,3,Hotline Bling,Drake
3,2016-01-02,4,Love Yourself,Justin Bieber
4,2016-01-02,5,What Do You Mean?,Justin Bieber
5,2016-01-02,6,Same Old Love,Selena Gomez
6,2016-01-02,7,The Hills,The Weeknd
7,2016-01-02,8,Here,Alessia Cara
8,2016-01-02,9,Stitches,Shawn Mendes
9,2016-01-02,10,Like I'm Gonna Lose You,Meghan Trainor


## 03. Use Spotify API to get the genre information of each artists.

note: Unfortunately, I could not find a way to get the genre of the song itself. So as an alternative, I used Spotify's API to get the genre information of an aritst.

In [63]:
import requests
import base64
from dotenv import load_dotenv
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy


load_dotenv()

def get_access_token(client_id, client_secret):
    auth_url = 'https://accounts.spotify.com/api/token'
    auth_headers = {
        'Authorization': 'Basic ' + base64.b64encode((client_id + ':' + client_secret).encode()).decode()
    }
    auth_data = {
        'grant_type': 'client_credentials'
    }

    response = requests.post(auth_url, headers=auth_headers, data=auth_data)

    if response.status_code == 200:
        token = response.json()['access_token']
        print(f"Access token generated successfully: {token[:5]}...")
        return token
    else:
        return 'Error:', response.status_code


def spotify_auth(client_id, client_secret):
    access_token = get_access_token(client_id, client_secret)
    try:
        client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
        sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
        print("Spotify authentication successful")
        return (access_token, sp)
    
    except spotipy.oauth2.SpotifyOauthError as e:
        print(f"Error: {e}")
        exit(1)


def get_artist_genres(artist, song, access_token, sp):
    try:
        # search for artist id
        # improved search by enhancing the query
        query = f"track:{song} artist:{artist}"
        result = sp.search(q=query, type='track', limit=1, offset=0)
        artist_id = result['tracks']['items'][0]['artists'][0]['id'] if result['tracks']['items'] else None

        if not artist_id:
             return None
        
        headers = {
            "Authorization": f"Bearer {access_token}"
        }
        # get artist information using artist id
        response = requests.get("https://api.spotify.com/v1/artists/{id}".format(id=artist_id), headers=headers)
        data = response.json()
        return data["genres"]
    except spotipy.SpotifyException as e:
        print(f"Error: {e}")
        return None     

현재 여기까지 진행

많은 artist의 genre 가 빈 list 형태로 나오고 있는중. Spotify API 측에서 변경이 일어난 것으로 확인됨. 

관련 이슈: https://github.com/jpotw/billboard_genre_analysis/issues/1

In [None]:
import pandas as pd
import os


client_id = os.getenv('CLIENT_ID')
client_secret = os.getenv('CLIENT_SECRET')
access_token, sp = spotify_auth(client_id, client_secret)

df['genres'] = None

for artist in df['artist'].unique():
    try:
        song = list(df.loc[df['artist'] == artist, 'song'])[0]
        genres = get_artist_genres(artist, song, access_token, sp)
        if genres:
            df.loc[df['artist'] == artist, 'genres'] = genres * len(df[df['artist'] == artist])
        else:
            df[df['artist'] == artist]['genres'] = None
        print(f"Genres retrieved for {artist}: {genres}")
    except ValueError as e:
        print(f"Error retrieving genres for {artist}: {e}")

Access token generated successfully: BQBos...
Spotify authentication successful
Genres retrieved for Adele: ['soft pop']
Genres retrieved for Justin Bieber: []


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['artist'] == artist]['genres'] = None


Error retrieving genres for Drake: Must have equal len keys and value when setting with an iterable
Genres retrieved for Selena Gomez: ['pop']
Genres retrieved for The Weeknd: []


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['artist'] == artist]['genres'] = None


Genres retrieved for Alessia Cara: []


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['artist'] == artist]['genres'] = None


Genres retrieved for Shawn Mendes: ['pop']


## 04. Simplify Multiple genres into common genres

In [125]:
def convert_song_genres(batch):
    """
    converts the genre of all the songs in a single batch (size of 100) to simplified genres
    """
    batch['genres'] = batch['genres'].apply(simplify_genres)
    return batch

def simplify_genres(genre_str):
    """
    simplify the genre string of a single song to a list of mapped genres
    """
    if 'Need Manual Check' in genre_str:
        return None
    
    else: 
        genre_list = eval(genre_str)
        simplified_genres = set()
        
        genre_mapping = {
            'pop': 'Pop',
            'rock': 'Rock',
            'hip hop': 'Hip Hop',
            'rap': 'Hip Hop',
            'r&b': 'R&B',
            'electronic': 'Electronic/Dance',
            'edm': 'Electronic/Dance',
            'dance': 'Electronic/Dance',
            'country': 'Country',
            'jazz': 'Jazz',
            'reggae': 'Reggae',
            'latin': 'Latin',
            'classical': 'Classical'
        }
        
        for g in genre_list:
            g = g.lower()
            for k, v in genre_mapping.items():
                if k == g:
                    simplified_genres.add(v)
        
        return list(simplified_genres) if simplified_genres else None

In [126]:
input_folder = "output"
output_folder = "simplified_output"

os.makedirs(output_folder, exist_ok=True)

batch_list = os.listdir(input_folder)

for batch in batch_list:
    if batch.endswith(".csv"):
        file_path = os.path.join(input_folder, batch)
        
        if os.path.getsize(file_path) == 0:
            print(f"Empty file: {batch}")
            continue
                
        df = pd.read_csv(file_path)
        new_file_path = os.path.join(output_folder, batch)
        df2 = convert_song_genres(df)
        df2.to_csv(new_file_path, index=False)

print("파일 수정 및 저장이 완료되었습니다.")

파일 수정 및 저장이 완료되었습니다.


In [127]:
df2.head()

Unnamed: 0,date,rank,song_title,artist,genres
0,2017-11-18,1.0,Rockstar,Post Malone,"[Pop, Hip Hop]"
1,2017-11-18,2.0,Havana,Camila Cabello,[Pop]
2,,,Bodak Yellow (Money Moves),Cardi B,
3,2017-11-18,4.0,1-800-273-8255,Logic,[Hip Hop]
4,2017-11-18,5.0,Thunder,Imagine Dragons,"[Rock, Pop]"


In [130]:
# merge all the files into one

import glob

files = os.path.join("simplified_output/", "genres_data_batch_*.csv")
joined_list = glob.glob(files)

joined_df = pd.concat([pd.read_csv(f) for f in joined_list], ignore_index=True)
joined_df

Unnamed: 0,date,rank,song_title,artist,genres
0,2016-01-02,1.0,Hello,Adele,['Pop']
1,2016-01-02,2.0,Sorry,Justin Bieber,['Pop']
2,2016-01-02,3.0,Hotline Bling,Drake,['Hip Hop']
3,2016-01-02,4.0,Love Yourself,Justin Bieber,['Pop']
4,2016-01-02,5.0,What Do You Mean?,Justin Bieber,['Pop']
...,...,...,...,...,...
30595,,,Too Hotty,"Quavo, Takeoff",
30596,2017-11-18,97.0,All The Pretty Girls,Kenny Chesney,['Country']
30597,2017-11-18,98.0,All On Me,Devin Dawson,
30598,2017-11-18,99.0,More Girls Like You,Kip Moore,['Country']


In [131]:
joined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30600 entries, 0 to 30599
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        29953 non-null  object 
 1   rank        29953 non-null  float64
 2   song_title  30600 non-null  object 
 3   artist      30364 non-null  object 
 4   genres      23852 non-null  object 
dtypes: float64(1), object(4)
memory usage: 1.2+ MB


## Feature Engineering

Columns that need to be filled: 1) date, 2) rank, 3) artist 4) genres

In [156]:
len(df[joined_df['date'].isnull()]['song_title'].unique().tolist())

63

In [175]:
# rank resets to every 100th song (since the index start from 0)

if any(df[joined_df['date'].isnull()].index % 100 == 0):
    joined_df['date'].bfill(inplace=True)
    joined_df['rank'].bfill(inplace=True)
else:
    joined_df['date'].ffill(inplace=True)
    joined_df['rank'].ffill(inplace=True)

All the unique lists of song turned out to be the song of 'XXXTENTACION'.

According to the [Spotify artist page](https://open.spotify.com/artist/15UsOTVnJzReFVN1VCnxy4), the main genres of 'XXXTENTACION' are 'Hip Hop', 'R&B', and 'Rock'

In [180]:
df[joined_df['artist'].isnull()]['song_title'].unique().tolist()

['Jocelyn Flores',
 'F**k Love',
 'Sad!',
 'Changes',
 'Moonlight',
 'The Remedy For A Broken Heart (Why Am I So In Love)',
 'Numb',
 'Infinity (888)',
 'Going Down!',
 'Everybody Dies In Their Nightmares',
 'Hope',
 'Arms Around You',
 'BAD!',
 'whoa (mind in awe)',
 'Guardian Angel',
 "I Don't Let Go",
 'One Minute',
 'Train Food',
 'What Are You So Afraid Of',
 'Staring At The Sky',
 'Difference (Interlude)',
 'Sauce!',
 'Bad Vibes Forever',
 'Unsteady',
 'Look At Me!',
 'Revenge',
 'Depression & Obsession',
 'Save Me',
 'Carry On']

In [184]:
if any(df[joined_df['artist'].isnull()]['song_title']):
    joined_df['artist'].fillna(value='XXXTENTACION', inplace=True)
    joined_df['genres'].fillna(value=['Hip Hop', 'R&B', 'Rock'], inplace=True)

In [188]:
joined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30600 entries, 0 to 30599
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        30600 non-null  object 
 1   rank        30600 non-null  float64
 2   song_title  30600 non-null  object 
 3   artist      30600 non-null  object 
 4   genres      23852 non-null  object 
dtypes: float64(1), object(4)
memory usage: 1.2+ MB


There were still 6748 data that doesn't contain genre information.

In [192]:
df[joined_df['genres'].isnull()]['artist'].unique().tolist()

['Jordan Smith',
 'Silento',
 'Cam',
 'R. City ',
 'Brenda Lee',
 'OMI',
 'iLoveMemphis',
 'Emily Ann Roberts',
 'DLOW',
 'LOCASH',
 'Rudimental ',
 'Rachel Platten',
 'Lil Dicky ',
 'Andy Grammer',
 'Macklemore ',
 'Jordan Smith ',
 'Granger Smith',
 'Dawin',
 'Lukas Graham',
 'Mark Ronson ',
 'Maren Morris',
 'Chris Stapleton',
 'DJ Luke Nasty',
 'Chase Bryant',
 'Cardi B',
 'Portugal. The Man',
 'J Balvin ',
 'Luis Fonsi ',
 'GoldLink ',
 'LANCO',
 nan,
 'MA',
 'Becky G ',
 'Garth Brooks',
 '6i',
 'Lil ',
 'Lil Peep',
 'Machine Gun Kelly, ',
 'Farruko, Nicki Minaj, Bad Bunny, 21 Savage ',
 'Jake Paul ',
 'Devin Dawson',
 'Chloe Kohanski',
 'Nat King Cole',
 'Andy Williams',
 'Burl Ives',
 'Romeo Santos ',
 'Natti Natasha ',
 'Wham!',
 'Ozuna ',
 'Huncho Jack ',
 'Huncho Jack',
 'WALK THE MOON',
 'Keala Settle ',
 'Zac Efron ',
 'Loren Allred',
 'Maluma ',
 'Hugh Jackman, Keala Settle, Zac Efron, Zendaya ',
 'Lil Skies ',
 'Jay Rock, Kendrick Lamar, Future ',
 'Enrique Iglesias ',
 '