# Making My Own Spotify Wrapped!
When the yearly Spotify Wrapped drops you get to see some statistics on music and podcast streaming. We will explore the underlying data for Spotify Wrapped even more by digging into our own streaming history for the past year!

The tool we will be using for visualizing our data is plotly, a Python library which uses a friendly interface to make interactive and pretty graphs. As always when dealing with data we also need to do some data wrangling. For this we will primarily use the library pandas.


In [1]:
# Import libraries
import pandas as pd
import plotly.express as px
from datetime import datetime
from skimpy import skim

## Preparations

### Loading data

The data we are going to use is the streaming history, which goes exactly one year back. Since we are in February now,January is 2023 but all other months are from 2022.

In [2]:
import os
path=os.environ.get('PATH')
path

'/Users/monic/.pyenv/versions/lewagon/bin:/opt/homebrew/Cellar/pyenv/2.3.5/libexec:/Users/monic/.pyenv/plugins/pyenv-virtualenv/bin:/opt/homebrew/Cellar/pyenv/2.3.5/plugins/python-build/bin:/Users/monic/.pyenv/shims:/Users/monic/.pyenv/versions/lewagon/bin:/opt/homebrew/Cellar/pyenv/2.3.5/libexec:/Users/monic/.pyenv/plugins/pyenv-virtualenv/bin:/opt/homebrew/Cellar/pyenv/2.3.5/plugins/python-build/bin:/Users/monic/code/monicasainer/Spotify/raw_data/:/opt/homebrew/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/bin:./bin:./node_modules/.bin:/Users/monic/.pyenv/plugins/pyenv-virtualenv/shims:/Users/monic/.pyenv/shims:/Users/monic/.rbenv/bin:/Users/monic/.pyenv/plugins/pyenv-virtualenv/shims:/opt/homebrew/bin:/opt/homebrew/sbin:/Users/monic/.pyenv/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/usr/local/sbin'

In [None]:
# Load and inspect data
filename = "/Users/monic/Downloads/MyData/StreamingHistory0.json"
df = pd.read_json(filename)
df.head(10)

In [None]:
skim(df)

In [None]:
#Whenever we load in new-to-us data, it's good practice to check the data types of imported data. It is not unusual for something to have been imported in a silly format.

# Inspect the data types of the dataframe
df.info() # or df.types

In [None]:
#Let's see what we have here. `int64` means integers, `object` means string (text). The dataframe is mostly correct already from the start, except for the `endTime` column which is currently a string. Since it's actually a date, we will convert it to the pandas datetime format.

# Update endTime to be of data type datetime
df["endTime"] = df["endTime"].apply(pd.to_datetime)

In [None]:
df.head(10)

In [None]:
df.dtypes

In [None]:
## Intro to data visualization

# When visualizing data, what we can do depends fully of what type of data we are dealing with. In this section we will go through some common ways of classifying data types and how we can visualize them

# One variable, continous
px.histogram(df.msPlayed) #or
px.histogram(df,x="msPlayed",nbins=100) #miliseconds played. Count=songs, playlists.

In [None]:
# One variable, categorical

# Count how many rows in the data there are for each artistName
listens_by_artists=df['artistName'].value_counts()
listens_by_artists

In [None]:
# Plot
px.bar(listens_by_artists)

In [None]:
# Two variables, continuous
px.scatter(df,x="endTime",y="msPlayed",hover_data=['artistName'])


In [None]:

#*Additional suggestion* Wow, there's a lot of dots in the plot above. This is partially due to the fact that many days you would have listened to multiple tracks, giving you many rows in the dataset for that day. If we want, we can re-do the same plot but this time look at total time you listened per day.

# Count the total msPlayed by date
play_time_per_day = df.groupby(df.endTime.dt.date).sum('msPlayed').reset_index() #Total miliseconds played per day.
# Plot two continuous variables just as we did above
px.scatter(play_time_per_day, x="endTime", y="msPlayed")



In [None]:
# Two variables, continous and categorical

px.bar(df[:20],x="artistName", y="msPlayed")


In [None]:
# *Additional suggestion* In the example above, we are just plotting straight from the dataframe. Plotly recognizes this and makes one "sub-bar" per record in the table. Just like earlier when plotting the distribution of one, categorical variable, here we might first want to do some aggregations.

# For the same type of plot, we can easily show top 10 artists as well, if we first count how long each artist has been listened to over the year.

# Get my top ten artists by total playtime
playtime_per_artist = df.groupby("artistName", as_index=False)\
  .agg(listeningTime=("trackName", "count"))\
  .sort_values(by="listeningTime", ascending=False)\
  .head(10)



In [None]:
# Do the same plot type of two variables, one categorical and one continuous
px.bar(playtime_per_artist, x="artistName", y="listeningTime")

In [None]:
# Three variables, continuous and categorial
px.bar(df[:10],x="trackName",y="msPlayed",color="artistName")
# px.bar(playtime_per_artist,x="trackName",y="msPlayed",color="artistName")

As you can see, data type matters and there is a bit of a limit to how many variables we can explain in one go. To understand this better, we can look at this masterpiece within the data visualization field, where as many as 5 variables are visualized in the same chart!

[Link to Gapminder World Health Chart.](https://www.gapminder.org/fw/world-health-chart/) 

## Time Series additions

Time series data can be analyzed and visualized in many interesting ways. In order to do that, let's group the endTime column into month and week. That way we can later aggregate the data over these groupings.

In [None]:
# Extract month, month name and week
df["monthNumber"] = df["endTime"].dt.month
df["month"] = df["endTime"].dt.month_name()
df["week"] = df["endTime"].dt.isocalendar().week

# Convert milliseconds to minutes
df["minutesPlayed"] = df["msPlayed"] / 1000 / 60

In [None]:
df.head()

## Total streaming per month

Let's start with aggregating the data over month, and then displaying this as a line chart. This is a classic time series visualization - seeing what happened to a value over time.

In [None]:
# Group data per month and sum up the listening time
music_per_month=df.groupby(["monthNumber","month"],as_index=False)\
    .agg(totalTime=("minutesPlayed","sum")) #We include the number bcs we can have more than one month called the same.

In [None]:
# Inspect the grouped dataframe
music_per_month.sort_values(by="totalTime",ascending=False)

In [None]:
# Plot the total streaming time per month as a line graph
px.line(music_per_month,x="month",y="totalTime")

This is informative and nice, but let's make the chart a bit more fun. Since our time series is exactly one year, like a full cycle, we can make use of the unusual but fun polar chart!

In [None]:
# As a polar chart
px.bar_polar(music_per_month,r="totalTime",theta="month") #check the limitations of this kind of chart.

Furthermore, we can group the data on weeks instead of months to get some more granular values.

In [None]:
df.dtypes

In [None]:
# Weekly instead of monthly

# 1. Aggregate data
music_per_week=df.groupby(['week'],as_index=False).agg(totalTime=("minutesPlayed","sum"))

# 2. Convert week from int to string
music_per_week["week"] = music_per_week["week"].astype(str) #since week.type= UInt32

# 3. Visualize it as a polar chart
px.bar_polar(music_per_week,r="totalTime",theta='week').show() 

# # 4. Add log_r=True to play with the scale
px.bar_polar(music_per_week, r='totalTime', theta='week', log_r=True).show() 

## Flower chart displaying monthly listening time for my top 10 artist

Now it's time to make some truly pretty and fun graphs! Let's not only visualize how much we listened to music and podcasts, but also what we actually listened to! :)

For any categorical information, such as artists, it's unfortunately hard to include too many categories. We'll therefore start with only our streaming history of the top ten artists.

In [None]:
# Get my top ten artists
top_ten = df.groupby("artistName")\
  .agg(listeningEvents=("trackName", "count"))\ 
  .sort_values(by="listeningEvents", ascending=False)\
  .head(10) #Even if I listen to the same song many times, counts count them.
top_ten

In [None]:
# Group my streaming history by artist (in top 10) and month
music_per_month_and_artist = df[df["artistName"].isin(top_ten.index)].groupby(["artistName", "month"], as_index=False)\
    .agg(listeningTime=("msPlayed", "sum"))

In [None]:
music_per_month_and_artist

In [None]:
# Specify the order of the months
# Put January last if you truly want it in chronological order 
month_order = {
    "month": 
      ["January", 
      "February", 
      "March", 
      "April", 
      "May", 
      "June", 
      "July", 
      "August", 
      "September", 
      "October", 
      "November", 
      "December"]
}

In [None]:
# Visualize as a rose/flower chart
fig = px.bar_polar(music_per_month_and_artist, 
    r='listeningTime', 
    theta='month', 
    log_r=True, 
    color="artistName", 
    barmode="group", 
    category_orders=month_order,
    color_discrete_sequence=px.colors.qualitative.Bold)
fig.show()

# check this https://plotly.com/python/discrete-color/

In [None]:
# Adjust background color
background_color = "white"

fig.update_layout(
    polar = dict(
        bgcolor = background_color,
        radialaxis = dict(showticklabels=False, ticks=''),
    ),
    paper_bgcolor=background_color,
)

In [None]:
# Remove axis
fig.update_polars(
    angularaxis_gridcolor=background_color, 
    radialaxis_gridcolor=background_color, 
    radialaxis_linewidth=0, 
    angularaxis_linewidth=0)

In [None]:
# Remove white lines around the "bars"
fig.update_traces(marker=dict(line=dict(width=0)))

## Flower chart displaying monthly listening time and general diverse of my streaming history

What we did above was visualizing three variables - time, listening time and artist. The third variable was categorical. We can also use the color to display a continous variable. An interesting variable, which also doesn't exclude any artists, is simply the general diversity of the month. That is: How many unique artists/podcasts did we listen to during a certain month?

Again we can easily group and calculate this, and then displaying it by passing it to the color argument of the chart.

In [None]:
# Group by month, aggregate total time and number of unique artists
music_per_week = df.groupby(["week"], as_index=False)\
  .agg(totalTime=("msPlayed", "sum"), nrArtists=("artistName", "nunique"))
music_per_week["week"] = music_per_week["week"].astype(str)

In [None]:
# Make the polar chart
fig = px.bar_polar(music_per_week, 
    r='totalTime', 
    theta='week', 
    log_r=True, 
    color="nrArtists", 
    barmode="group", 
    category_orders={"week": [str(x) for x in range(1, 53)]},
    color_continuous_scale=['#E71A8F', '#242254'],
    labels={"nrArtists": "Number artists"})

fig.show()

Let's finally make this one nice and pretty to post as a story on Instagram or Facebook, just like we normally do with our Spotify wrapped!

In [None]:
# Set a new background color
background_color = "#dbc5c3"

In [None]:
# Adjust background, axis, bars and position the color scale
fig.update_layout(
    polar = dict(
        bgcolor = background_color,
        radialaxis = dict(showticklabels=False, ticks=''),
    ),
    paper_bgcolor=background_color,
    height=800,
    width=500,
)

fig.update_polars(angularaxis_gridcolor=background_color, 
    radialaxis_gridcolor=background_color, 
    radialaxis_linewidth=0, 
    angularaxis_linewidth=0)

fig.update_traces(marker=dict(line=dict(width=0)))

fig.update_layout(coloraxis=dict(colorbar=dict(orientation='h', y=0.05, dtick=10)))