My household listens to a lot of Taylor Swift music.
Inspired by this and her Eras tour, I decided to take a look at some of her song data available on Spotify.
In this example, I'm visualizing the data with plotly.
These plots are more interactive, but will not render correctly in Github, so I would recommend viewing the notebook using nbviewer.

**Follow the link below to view the interactive plots!**

[nbviewer link](https://nbviewer.org/github/jaredcarter/data-science-portfolio/blob/main/spotify-data/Eras.ipynb?flush_cache=True)

First we have to import the required packages.

In [23]:
import pandas as pd
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import plotly.express as px
import plotly.graph_objects as go


Next, we need to import the data.
This data was pulled from Spotify using the [Get Spotify Playlist Data](Get%20Spotify%20Playlist%20Data.ipynb) notebook.

In [7]:
df = pd.read_csv("data/Taylor Swift.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 0 to 253
Data columns (total 57 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   added_at                            254 non-null    object 
 1   is_local                            254 non-null    bool   
 2   primary_color                       0 non-null      float64
 3   added_by.external_urls.spotify      254 non-null    object 
 4   added_by.href                       254 non-null    object 
 5   added_by.id                         254 non-null    object 
 6   added_by.type                       254 non-null    object 
 7   added_by.uri                        254 non-null    object 
 8   track.album.album_type              254 non-null    object 
 9   track.album.artists                 254 non-null    object 
 10  track.album.available_markets       254 non-null    object 
 11  track.album.external_urls.spotify   254 non-n

The output above shows what columns of data are available to us.
Some of the not-so-straightforward column names are explained in [Spotify's API documentation](https://developer.spotify.com/documentation/web-api/reference/get-several-audio-features).
The last bit of pre-processing for now is to format the `track.album.release_date` as a datetime object.

In [12]:
df['track.album.release_date'] = pd.to_datetime(df['track.album.release_date'])

Let's look at the album release dates in chronological order.
To achieve this, we will use a scatter plot.
The x-axis will be the release date and the y-axis will show the average popularity.
The size of the marker will be proportional to the number of tracks on each album.

In [19]:
date_pop_data = (df.groupby('track.album.name')
                 # aggregate different columns differently
                 # find the median release date
                 .agg({'track.album.release_date': 'median',
                       # find the mean popularity
                       'track.popularity': 'mean',
                       # find the number of tracks
                       'track.id': 'count'})
                       .reset_index())

# Create figure using plotly
figure = px.scatter(date_pop_data,
    # x axis is the median release date
    x='track.album.release_date',
    # y axis is mean popularity
    y='track.popularity',
    # size is count of tracks
    size='track.id',
    # color by album name
    color='track.album.name',
    # make labels more accurate
    labels={'track.album.release_date': 'Release Date',
            'track.popularity': 'Average Popularity',
            'track.id': 'Count of Album Tracks',
            'track.album.name': 'Album Name'})
figure

If you are viewing this notebook in nbviewer, feel free to hover over the data points to see more information. You can also click on the legend to show/hide particular albums.

The first observation that one might make from this figure is that Taylor Swift's recent albums are more popular than her older music, but that isn't entirely accurate.
Loyal Taylor Swift fans only stream [Taylor's Version](https://cherokeehighnews.com/2022/03/11/taylors-version-vs-the-original-whats-the-difference/) of her albums, for reasons described in the linked article.
This results in a lower average popularity for albums which have a corresponding Taylor's version.

Now I'm curious about the popularity of songs in albums that have a corresponding Taylor's version.
This will be plotted using a split violin plot in plotly


In [44]:
# Get first word of album name to group Taylor and non-taylor versions of songs
df['track.album.group'] = df['track.album.name'].str.replace(" (Taylor's Version)", "").str.replace(" [Deluxe]", "")
# True if Taylor owns the song
df['track.taylor_owns'] = df['track.album.release_date'] > '2018-11-01'
# only look at albums with a Taylor and non-Taylor version
taylor_vs_non = df[df['track.album.group'].isin(['1989', 'Fearless', 'Red', 'Speak Now'])]

fig = go.Figure()

fig.add_trace(go.Violin(x=taylor_vs_non['track.album.group'][taylor_vs_non['track.taylor_owns'] == False],
                        y=taylor_vs_non['track.popularity'][taylor_vs_non['track.taylor_owns'] == False],
                        text=taylor_vs_non['track.name'][taylor_vs_non['track.taylor_owns'] == False],
                        legendgroup='No', scalegroup='No', name='No', points='all', hoverinfo='text+y',
                        side='negative'))

fig.add_trace(go.Violin(x=taylor_vs_non['track.album.group'][taylor_vs_non['track.taylor_owns'] == True],
                        y=taylor_vs_non['track.popularity'][taylor_vs_non['track.taylor_owns'] == True],
                        text=taylor_vs_non['track.name'][taylor_vs_non['track.taylor_owns'] == True],
                        legendgroup='Yes', scalegroup='Yes', name='Yes', points='all', hoverinfo='text+y',
                        side='positive'))

fig.update_layout(
    title_text="Popularity of Taylor Swift Songs by album and Taylor's Version Status")

fig

In [11]:
figure = px.box(df, x="track.album.name", y="track.popularity", points='all', hover_name='track.name')
figure