<span style="font-family:Helvetica Light">
    
# The Bar Chart Race - Top Artist and Top Song Race

## The goal of this notebook:
The goal of this notebook is to:
* visualize, with a racing bar chart, which artists through out the year ended up as my Top Artists of 2021
* visualize, with a racing bar chart, which artists through out the year ended up as my Top Songs of 2021

## About Bar Chart Race library

Bar Chart Race is one of the libraries from under the Dexplo hood.
From the <a href="https://www.dexplo.org/bar_chart_race/" target="_blank">official documentation</a> of the library:

<blockquote>
The overall aim of the dexplo suite of libraries is supply a powerful and efficient set of tools for doing data analysis and visualization in Python. [...] Make animated bar chart races in Python with matplotlib.
</blockquote>

#### References:
1. https://www.dexplo.org/bar_chart_race/tutorial/

</span>

<span style="font-family:Helvetica Light">
    
# 1. Set-up & Data Loading
    
Loading neccessary libraries.
    
Loading the streaming history file prepared in the previous step.
    
</span>

In [1]:
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

import bar_chart_race as bcr

In [8]:
df = pd.read_csv('~/ProjectsDataScience/data_science_environment/data/spotify_my_streaming_history_2021_enriched_w_pod.csv',index_col=0)
df = df[['endTime','artistName','trackName','msPlayed']]
df.head(3)

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2021-01-01 06:01:00,Julia Wieniawa,Niezadowolona (piosenka do filmu „Wszyscy moi ...,44827
1,2021-01-01 06:01:00,Ariana Grande,god is a woman - live,31750
2,2021-01-01 06:04:00,Justin Bieber,Anyone,190779


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45085 entries, 0 to 45084
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   endTime     45085 non-null  object
 1   artistName  45085 non-null  object
 2   trackName   45085 non-null  object
 3   msPlayed    45085 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 1.7+ MB


In [10]:
# double checking for duplicates
df.duplicated(subset=['endTime','artistName','trackName','msPlayed']).sum()

0

In [11]:
# remove duplicates
df.drop_duplicates(inplace=True)

# double checking for duplicates
df.duplicated().sum()

0

<span style="font-family:Helvetica Light">

# 2. Dates & Time Transformations
    
In this step I change the _endTime_ column from UTC timezone to Europe/Berlin.
    
Additional columns are crated based on the _endTime_ and _msPlayed_ columns:
* hour
* date (with no timestamp)
* week start date
* month start date
* seconds played   
* minutes played   
* hours played   
    
</span>   

In [13]:
# change column data type from object to datetime
df.endTime = pd.to_datetime(df.endTime) 

# set the current time to UTC timezone
df['endTimeTz']= df.endTime.dt.tz_localize(tz='UTC')

# convert US Pacific to Europe/Berlin Timezone
df['endTimeTzEU']=df.endTimeTz.dt.tz_convert(tz='Europe/Berlin')

df['endTime']=df['endTimeTzEU'].apply(lambda d: d.replace(tzinfo=None))

df.drop(['endTimeTz', 'endTimeTzEU'], axis=1, inplace=True)

# additional time related transofmations 
df['hour'] = df['endTime'].dt.hour
df['date'] = df['endTime'].dt.to_period('D').apply(lambda r: r.start_time)
df['week'] = df['endTime'].dt.to_period('W').apply(lambda r: r.start_time)
df['month'] = df['endTime'].dt.to_period('M').apply(lambda r: r.start_time)

# rounding up miliseconds Played to more readable formats
df['sPlayed'] = df['msPlayed']/(1000)
df['mPlayed'] = df['sPlayed']/(60)
df['hPlayed'] = df['sPlayed']/(60*60)

df.head()

Unnamed: 0,endTime,artistName,trackName,msPlayed,hour,date,week,month,sPlayed,mPlayed,hPlayed
0,2021-01-01 07:01:00,Julia Wieniawa,Niezadowolona (piosenka do filmu „Wszyscy moi ...,44827,7,2021-01-01,2020-12-28,2021-01-01,44.827,0.747117,0.012452
1,2021-01-01 07:01:00,Ariana Grande,god is a woman - live,31750,7,2021-01-01,2020-12-28,2021-01-01,31.75,0.529167,0.008819
2,2021-01-01 07:04:00,Justin Bieber,Anyone,190779,7,2021-01-01,2020-12-28,2021-01-01,190.779,3.17965,0.052994
3,2021-01-01 07:05:00,Justin Bieber,Anyone,9140,7,2021-01-01,2020-12-28,2021-01-01,9.14,0.152333,0.002539
4,2021-01-01 07:08:00,Shawn Mendes,Monster (Shawn Mendes & Justin Bieber),178994,7,2021-01-01,2020-12-28,2021-01-01,178.994,2.983233,0.049721


<span style="font-family:Helvetica Light">
    
# 3. Data Prep
    
From the <a href="https://www.dexplo.org/bar_chart_race/tutorial/" target="_blank">official guide</a> of the library:
<blockquote>    
The data you choose to animate as a bar chart race must be provided in a specific format. The data must be within a pandas DataFrame containing 'wide' data where:

* Each row represents a single period of time
* Each column holds the value for a particular category
* The index contains the time component (optional)
</blockquote>
    
The Top Artists and Top Songs are based on the playcount. 

For the purpose of this analysis I excluded all the records with time played under 10 seconds assuming those were just 'skip to the next one' cases.

    
## 3.1. Top Artsits Bar Chart Race    
</span>

In [17]:
# based on playcount

# exclude song skips (for clearer results)
df_no_skips = df.loc[df['sPlayed']>10]

# calculate the playcount
weekly_artist = df_no_skips.groupby([pd.Grouper(key='endTime', freq='W'),'artistName'])['trackName'].size().reset_index()
weekly_artist['no_csum'] = weekly_artist.groupby(['artistName'])['trackName'].cumsum()

#choosing only top10 artist in a given week
weekly_artist_top_10 = weekly_artist.set_index(['endTime', 'artistName']).groupby(level=0, group_keys=False)['no_csum'].nlargest(10)

#reformatting the data into format suitable for the bar_chart_race package
weekly_artist_top_10 = weekly_artist_top_10.unstack()
weekly_artist_top_10.fillna(method='ffill', inplace=True)
weekly_artist_top_10.fillna(0, inplace=True)
weekly_artist_top_10.head(10)

artistName,Adele,Alessia Cara,Antonio Vivaldi,Ariana Grande,BJ The Chicago Kid,Beyoncé,Biig Piig,Billie Eilish,Calvin Harris,Charlotte Lawrence,...,Selena Gomez,Snow Patrol,Sufjan Stevens,Surf Mesa,Tate McRae,Taylor Swift,The Weeknd,Two Feet,Ty Dolla $ign,ZAYN
endTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-01-03,0.0,0.0,0.0,94.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,11.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-01-10,0.0,0.0,0.0,273.0,2.0,0.0,0.0,1.0,0.0,0.0,...,0.0,11.0,8.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0
2021-01-17,0.0,0.0,0.0,417.0,2.0,0.0,0.0,16.0,15.0,0.0,...,0.0,11.0,8.0,0.0,0.0,20.0,22.0,0.0,0.0,39.0
2021-01-24,0.0,0.0,0.0,426.0,2.0,0.0,64.0,278.0,15.0,0.0,...,0.0,11.0,8.0,0.0,0.0,25.0,25.0,0.0,0.0,47.0
2021-01-31,0.0,0.0,29.0,431.0,2.0,0.0,86.0,307.0,15.0,0.0,...,0.0,11.0,8.0,0.0,0.0,25.0,25.0,0.0,0.0,58.0
2021-02-07,0.0,0.0,29.0,598.0,2.0,0.0,86.0,324.0,15.0,0.0,...,0.0,11.0,8.0,0.0,0.0,25.0,58.0,0.0,0.0,58.0
2021-02-14,0.0,0.0,29.0,777.0,2.0,29.0,86.0,324.0,15.0,0.0,...,0.0,11.0,8.0,0.0,0.0,37.0,77.0,0.0,0.0,58.0
2021-02-21,0.0,0.0,29.0,784.0,2.0,57.0,86.0,326.0,15.0,0.0,...,0.0,11.0,8.0,0.0,0.0,37.0,79.0,0.0,169.0,58.0
2021-02-28,0.0,0.0,29.0,788.0,2.0,57.0,86.0,336.0,15.0,0.0,...,0.0,11.0,8.0,0.0,0.0,37.0,85.0,0.0,173.0,58.0
2021-03-07,0.0,0.0,29.0,802.0,2.0,64.0,86.0,468.0,15.0,0.0,...,0.0,11.0,8.0,0.0,0.0,37.0,85.0,0.0,174.0,58.0


In [19]:
bcr.bar_chart_race(df=weekly_artist_top_10, 
                   n_bars=10
                   #filename='medium__artists_og.mp4'
                  )

  font.set_text(s, 0.0, flags=flags)
  font.set_text(s, 0.0, flags=flags)
  font.set_text(s, 0.0, flags=flags)
  font.set_text(s, 0, flags=flags)
  font.set_text(s, 0, flags=flags)
  font.set_text(s, 0, flags=flags)
  ax.set_yticklabels(self.df_values.columns)
  ax.set_xticklabels([max_val] * len(ax.get_xticks()))


In [8]:
def initiate_chart(title):
    
    plt.rcParams['font.family'] = 'Helvetica'
    
    #initiate fig
    fig, ax = plt.subplots(figsize=(12,8), facecolor='white', dpi= 80)

    ax.margins(0, 0.01)
    ax.set_axisbelow(True)

    #ticks
    ax.grid(which='major', axis='x', linestyle='-', linewidth=0.2, color='dimgrey')
    ax.tick_params(axis='x', colors='dimgrey', labelsize=12, length=0)
    ax.tick_params(axis='y', colors='dimgrey', labelsize=12, length=0)

    for pos in ['top', 'bottom', 'right', 'left']:
            if pos == 'top':
                ax.spines[pos].set_edgecolor('dimgrey')
            else:
                ax.spines[pos].set_edgecolor('white')

    ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
    ax.xaxis.set_ticks_position('top')
    
    ax.set_title(title, fontsize=18, color='dimgrey')
    
    return fig, ax

In [None]:
%%time
# help(bcr.bar_chart_race)

palette = sns.color_palette("summer", 24).as_hex()
title = 'Spotify: The most listened artists in 2021'

#initiate chart
fig, ax = initiate_chart(title)

bcr.bar_chart_race(df=weekly_artist_top_10, 
                   n_bars=10, 
                   fig=fig, 
                   period_length=400, 
                   cmap=palette, 
                   period_fmt='%b %-d, %Y',
                   filter_column_colors=True,
                   filename='medium__artists.mp4')

<span style="font-family:Helvetica Light">
    
## 3.2. Top Songs Bar Chart Race 

</span>

In [None]:
# based on playcount
# plays for less than 10 seconds are assumed song skips (for clearer results)

df_no_skips = df_no_skips.copy()
df_no_skips['artistTrackName'] = df['artistName']+' - '+df['trackName']

weekly_song = df_no_skips.groupby([pd.Grouper(key='endTime', freq='W'),'artistTrackName'])['trackId'].size().reset_index()
weekly_song['no_csum'] = weekly_song.groupby(['artistTrackName'])['trackId'].cumsum()

# shortening the track name for clearer representation on the chart
weekly_song['artistTrackName'] = weekly_song['artistTrackName'].str.split('(').str[0]
weekly_song['artistTrackName'] = [label.replace(' - ', ': \n') for label in weekly_song['artistTrackName']]

#choosing only top10 artist in a given week
weekly_song_top_10 = weekly_song.set_index(['endTime', 'artistTrackName']).groupby(level=0, group_keys=False)['no_csum'].nlargest(10)

#reformatting the data into format suitable for the bar_chart_race package
weekly_song_top_10 = weekly_song_top_10.unstack()
weekly_song_top_10.fillna(method='ffill', inplace=True)
weekly_song_top_10.fillna(0, inplace=True)


In [None]:
%%time
# help(bcr.bar_chart_race)

palette = sns.color_palette("summer_r", 12).as_hex()
title = 'Spotify: The most listened songs in 2021'

#initiate chart
fig, ax = initiate_chart(title)

bcr.bar_chart_race(df=weekly_song_top_10, 
                   n_bars=10, 
                   fig=fig, 
                   period_length=400, 
                   cmap=palette, 
                   period_fmt='%b %-d, %Y',
                   filter_column_colors=True,
                   filename='medium__songs.mp4')