# Making My Own Spotify Wrapped!
When the yearly Spotify Wrapped drops you get to see some statistics on music and podcast streaming. We will explore the underlying data for Spotify Wrapped even more by digging into our own streaming history for the past year!

The tool we will be using for visualizing our data is plotly, a Python library which uses a friendly interface to make interactive and pretty graphs. As always when dealing with data we also need to do some data wrangling. For this we will primarily use the library pandas.


In [1]:
# Import libraries
import sys
import os
import pandas as pd
import plotly.express as px
from datetime import datetime
from skimpy import skim
from Spotify.data import Data_export


## Preparations

### Loading data

The data we are going to use is the streaming history, which goes exactly one year back. Since we are in February now,January is 2023 but all other months are from 2022.

In [2]:
df= Data_export.get_historical_data()
df

Unnamed: 0,endTime,artistName,trackName,msPlayed
0,2022-01-31 13:13,La Habitación Roja,No Estuviste Allí,192803
1,2022-02-02 06:27,Toña La Negra,Azul,164500
2,2022-02-02 16:58,Deforme Semanal Ideal Total,La juventud,1339390
3,2022-02-03 07:02,Estirando el chicle,ELIGE TUS BATALLAS con SARA SÁLAMO | Estirando...,267190
4,2022-02-03 12:31,Estirando el chicle,ELIGE TUS BATALLAS con SARA SÁLAMO | Estirando...,2448688
...,...,...,...,...
7552,2023-02-02 17:57,Funambulista,Dos Mares y una Mirada,184933
7553,2023-02-02 18:06,Arde Bogotá,Sin Vergüenza,216325
7554,2023-02-02 18:10,IZAL,Pausa,26569
7555,2023-02-02 18:10,La Maravillosa Orquesta del Alcohol,Catedrales,199760


### Inspecting the data

In [3]:
skim(df)

In [4]:
df.info() # or df.types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7557 entries, 0 to 7556
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   endTime     7557 non-null   object
 1   artistName  7557 non-null   object
 2   trackName   7557 non-null   object
 3   msPlayed    7557 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 236.3+ KB


### Updating `endTime` to be of data type `datetime`

In [5]:
df["endTime"] = df["endTime"].apply(pd.to_datetime)

In [None]:
df.head(10)

In [7]:
df.dtypes

endTime       datetime64[ns]
artistName            object
trackName             object
msPlayed               int64
dtype: object

## Data visualization

- One variable: continous

In [14]:
px.histogram(df.msPlayed) #or
px.histogram(df,x="msPlayed",nbins=100, title="Miliseconds played") #miliseconds played. Count=songs, playlists.

- One variable: categorical

In [None]:
# How many rows in the data there are for each artistName
listens_by_artists=df['artistName'].value_counts()
listens_by_artists

In [15]:
px.bar(listens_by_artists,title="How many times I listened to each artist")

- Two variables: continuous

In [16]:
# When we listened to our artist, and how many miliseconds.
px.scatter(df,x="endTime",y="msPlayed",hover_data=['artistName'],title="When each artist was listened and for how long")


There are many dots together, probably because I listened several tracks the same day. As this is the case, we will calcualte how many miliseconds I listened per day instead.

In [17]:
# Count the total msPlayed by date
play_time_per_day = df.groupby(df.endTime.dt.date).sum('msPlayed').reset_index() #Total miliseconds played per day.
# Plot two continuous variables just as we did above
px.scatter(play_time_per_day, x="endTime", y="msPlayed",title="Total miliseconds played per day")

- Two variables: continous and categorical

In [19]:
px.bar(df[:20],x="artistName", y="msPlayed",title="Miliseconds played by artist")


In the example above, we are just plotting straight from the dataframe. Plotly recognizes this and makes one "sub-bar" per record in the table. Just like earlier when plotting the distribution of one, categorical variable, here we might first want to do some aggregations.

For the same type of plot, we can easily show top 10 artists as well, if we first count **how long each artist has been listened to over the year.**

In [20]:
# Get my top ten artists by total playtime
playtime_per_artist = df.groupby("artistName", as_index=False)\
  .agg(listeningTime=("trackName", "count"))\
  .sort_values(by="listeningTime", ascending=False)\
  .head(10)



In [22]:
# Do the same plot type of two variables, one categorical and one continuous
px.bar(playtime_per_artist, x="artistName", y="listeningTime",title="How long I listened to my 10 favourite artists")

In [23]:
# Three variables, continuous and two categorial
px.bar(df[:10],x="trackName",y="msPlayed",color="artistName")
# px.bar(playtime_per_artist,x="trackName",y="msPlayed",color="artistName")

As you can see, data type matters and there is a bit of a limit to how many variables we can explain in one go. To understand this better, we can look at this masterpiece within the data visualization field, where as many as 5 variables are visualized in the same chart!

[Link to Gapminder World Health Chart.](https://www.gapminder.org/fw/world-health-chart/) 

## Time Series additions

Time series data can be analyzed and visualized in many interesting ways. In order to do that, we group the endTime column into month and week. That way we can later aggregate the data over these groupings.

In [24]:
# Extract month, month name and week
df["monthNumber"] = df["endTime"].dt.month
df["month"] = df["endTime"].dt.month_name()
df["week"] = df["endTime"].dt.isocalendar().week

# Convert milliseconds to minutes
df["minutesPlayed"] = df["msPlayed"] / 1000 / 60

In [None]:
df.head()

## Total streaming per month

Let's start with aggregating the data over month, and then displaying this as a line chart. This is a classic time series visualization - seeing what happened to a value over time.

In [26]:
# Group data per month and sum up the listening time
music_per_month=df.groupby(["monthNumber","month"],as_index=False)\
    .agg(totalTime=("minutesPlayed","sum")) #We include the number because we can have more than one month called the same.

In [27]:
# Inspect the grouped dataframe
music_per_month.sort_values(by="totalTime",ascending=False)

Unnamed: 0,monthNumber,month,totalTime
7,8,August,3189.807717
6,7,July,3033.227033
10,11,November,2929.881083
8,9,September,2596.8021
11,12,December,2577.925033
5,6,June,2499.49975
3,4,April,2421.033233
4,5,May,2328.201117
1,2,February,2086.1602
9,10,October,1981.5522


In [28]:
# Plot the total streaming time per month as a line graph
px.line(music_per_month,x="month",y="totalTime",title="Total streaming time per month")

Since our time series is exactly one year, like a full cycle, we can make use of the polar chart!

In [29]:
# As a polar chart
px.bar_polar(music_per_month,r="totalTime",theta="month") #check the limitations of this kind of chart.

Furthermore, we can group the data on weeks instead of months to get some more granular values.

In [30]:
df.dtypes

endTime          datetime64[ns]
artistName               object
trackName                object
msPlayed                  int64
monthNumber               int64
month                    object
week                     UInt32
minutesPlayed           float64
dtype: object

In [32]:
# Weekly instead of monthly

# 1. Aggregate data
music_per_week=df.groupby(['week'],as_index=False).agg(totalTime=("minutesPlayed","sum"))

# 2. Convert week from int to string
music_per_week["week"] = music_per_week["week"].astype(str) #since week.type = UInt32

# 3. Visualize it as a polar chart
# px.bar_polar(music_per_week,r="totalTime",theta='week').show() 

# # 4. Add log_r=True to play with the scale
px.bar_polar(music_per_week, r='totalTime', theta='week', log_r=True).show() 

## Flower chart displaying monthly listening time for my top 10 artist

Let's not only visualize how much we listened to music and podcasts, but also what we actually listened to.

For any categorical information, such as artists, it's unfortunately hard to include too many categories. We'll therefore start with only our streaming history of the top ten artists.

In [None]:
# Get my top ten artists
top_ten = df.groupby("artistName")\
  .agg(listeningEvents=("trackName", "count"))\
      .sort_values(by="listeningEvents", ascending=False)\
          .head(10) #Even if I listen to the same song many times, counts count them.
top_ten

In [34]:
# Group my streaming history by artist (in top 10) and month
music_per_month_and_artist = df[df["artistName"].isin(top_ten.index)].groupby(["artistName", "month"], as_index=False)\
    .agg(listeningTime=("msPlayed", "sum"))

In [None]:
music_per_month_and_artist

In [36]:
# Specify the order of the months
# I put January last since I want it in chronological order and January belongs to 2023.
month_order = {
    "month": 
      ["February", 
      "March", 
      "April", 
      "May", 
      "June", 
      "July", 
      "August", 
      "September", 
      "October", 
      "November", 
      "December",
      "January"]
}

In [40]:
# Visualize as a rose/flower chart
fig = px.bar_polar(music_per_month_and_artist, 
    r='listeningTime', 
    theta='month', 
    log_r=True, 
    color="artistName", 
    barmode="group", 
    category_orders=month_order,
    color_discrete_sequence=px.colors.qualitative.Pastel) # check out this https://plotly.com/python/discrete-color/ for different a color
fig.show()



In [41]:
# Adjust background color
background_color = "white"

fig.update_layout(
    polar = dict(
        bgcolor = background_color,
        radialaxis = dict(showticklabels=False, ticks=''),
    ),
    paper_bgcolor=background_color,
)

In [42]:
# Remove axis
fig.update_polars(
    angularaxis_gridcolor=background_color, 
    radialaxis_gridcolor=background_color, 
    radialaxis_linewidth=0, 
    angularaxis_linewidth=0)

In [43]:
# Remove white lines around the "bars"
fig.update_traces(marker=dict(line=dict(width=0)))

## Flower chart displaying monthly listening time and general diverse of my streaming history

What we did above was visualizing three variables - time, listening time and artist. The third variable was categorical. We can also use the color to display a continous variable. An interesting variable, which also doesn't exclude any artists, is simply the general diversity of the month. That is: How many unique artists/podcasts did we listen to during a certain month?

Again we can easily group and calculate this, and then displaying it by passing it to the color argument of the chart.

In [44]:
# Group by month, aggregate total time and number of unique artists
music_per_week = df.groupby(["week"], as_index=False)\
  .agg(totalTime=("msPlayed", "sum"), nrArtists=("artistName", "nunique"))
music_per_week["week"] = music_per_week["week"].astype(str)

In [45]:
# Make the polar chart
fig = px.bar_polar(music_per_week, 
    r='totalTime', 
    theta='week', 
    log_r=True, 
    color="nrArtists", 
    barmode="group", 
    category_orders={"week": [str(x) for x in range(1, 53)]},
    color_continuous_scale=['#E71A8F', '#242254'],
    labels={"nrArtists": "Number artists"})

fig.show()

Let's finally make this one nice and pretty to post as a story on Instagram or Facebook, just like we normally do with our Spotify wrapped!

In [46]:
# Set a new background color
background_color = "#deddd9"

In [47]:
# Adjust background, axis, bars and position the color scale
fig.update_layout(
    polar = dict(
        bgcolor = background_color,
        radialaxis = dict(showticklabels=False, ticks=''),
    ),
    paper_bgcolor=background_color,
    height=800,
    width=500,
)

fig.update_polars(angularaxis_gridcolor=background_color, 
    radialaxis_gridcolor=background_color, 
    radialaxis_linewidth=0, 
    angularaxis_linewidth=0)

fig.update_traces(marker=dict(line=dict(width=0)))

fig.update_layout(coloraxis=dict(colorbar=dict(orientation='h', y=0.05, dtick=10)))