spotify_exploratory_data_analysis.qmd

---
title: "Spotify Exploratory Data Analysis - Streaming History"
format: html
---

## Introduction

This is a series of exploratory data analysis (EDA) projects on my Spotify data. The data was downloaded from my Spotify account on July 23rd, 2023. The data is downloaded as a zip file containing several json files and saved on my personal google drive. The json files are then converted into tibbles for analysis using the `jsonlite` package.

This quarto document is the first of several EDA projects. This project focuses on my streaming history. I'm interested in exploring my listening habits across the time period of the data. I'm also interested in exploring my listening habits across the days of the week.

This process is documented in the following sections:

-   Setup and Configuration: Loading packages and googledrive API access

-   Data Loading: How to download and load the data?

-   Data Tidying: Get a tidy dataset

-   Data Cleaning: Ensure variables are in correct formats

-   Data Exploration: Answer one question and come up with two extra ones

Let's start exploring!

## Setup and Configuration

First, let's load in the packages we'll need for this project and authorise access to my google drive.

```{r}
### "Tidyverse"-oriented packages:

# The tidyverse is a collection of R packages designed for data science.
# All packages share a similar design philosophy, grammar, and data structures.
# Tidyverse includes packages such as:
# ggplot2, dplyr, tidyr, readr, purr, tibble, stringr, lubridate, and forcats.
### https://www.tidyverse.org/
library(tidyverse)

# To easily create data visualisations with simple and consistent syntax and grammar.
# https://ggplot2.tidyverse.org/index.html
library(ggplot2)

# To allow interaction between files on Google Drive and R.
# https://googledrive.tidyverse.org/
library(googledrive)


### Other Packages:
# To easily create summary statistics to understand and explore data.
# https://docs.ropensci.org/skimr/
library(skimr)

# A fast JSON parser and generator.
### https://cran.r-project.org/web/packages/jsonlite/index.html
library(jsonlite)

# To easily enable file referencing in project-oriented workflows.
# https://here.r-lib.org/
library(here)

# To easily format and scale data in visualisations.
# https://scales.r-lib.org/
library(scales)


# Google Drive Authentication --------------------------------------------------

# To establish a connection between a Google Drive account and R.
drive_auth()

# Example of how to download from Google Drive
# drive_download(
#   # Where to download file from
#   "https://drive.google.com/file/d/1Fjq1r6016H4isB2Cx2wg-Xm9zY7lHhYV/view?usp=drive_link",
# 
#   # Where to save it locally
#   path = here("foldertest", "text2")
#   )

```

## Data Loading

To access the data, I need to download it from my google drive. The data is requested from Johann's Spotify account and downloaded as a zip file containing several json files. There are several different json files; however, for this analysis I'm only interested in the Streaming History files.

You will only have access if Johann has given you read access to the email you authorised in 0-00_setup_and_configuration.R.

```{r}
# Only download raw data if it hasn't already been downloaded
if(!dir.exists(here("raw_data"))) {
  dir.create(here("raw_data"), showWarnings = FALSE)

  # List contents of Spotify Analysis Folder
  spotify_dribble <- drive_ls("Spotify Analysis")
  
  # Download raw data
  map2(
    spotify_dribble$id,
    spotify_dribble$name,
    ~ drive_download(
      file = as_id(.x),
      path = here("raw_data", .y),
      overwrite = TRUE
    )
  )
}


# Read in individual raw json as nested lists
# JRAW = RAW JSON
# RAW_JSON causes alphabetical ordering inconveniences in R environment.
JRAW_STREAMING_HISTORY_0 <- read_json(
  path = here(
    "raw_data",
    "StreamingHistory0.json"
  )
)

JRAW_STREAMING_HISTORY_1 <- read_json(
  path = here(
    "raw_data",
    "StreamingHistory1.json"
  )
)

JRAW_STREAMING_HISTORY_2 <- read_json(
  path = here(
    "raw_data",
    "StreamingHistory2.json"
  )
)
```

## Data Tidying

These json files are then converted into tibbles for analysis using the `jsonlite` package. The tibbles are then combined into one tibble, as they all have the same columns. I suspect the reason why there are different files is because of the size of the data.

```{r}
RAW_STREAMING_HISTORY_0 <- JRAW_STREAMING_HISTORY_0 %>% 
  bind_rows() %>% 
  as_tibble()

RAW_STREAMING_HISTORY_1 <- JRAW_STREAMING_HISTORY_1 %>% 
  bind_rows() |> 
  as_tibble()

RAW_STREAMING_HISTORY_2 <- JRAW_STREAMING_HISTORY_2 %>% 
  bind_rows() |> 
  as_tibble()

# Combine all streaming history tibbles into one tibble
RAW_STREAMING_HISTORY <- bind_rows(
  RAW_STREAMING_HISTORY_0,
  RAW_STREAMING_HISTORY_1,
  RAW_STREAMING_HISTORY_2
)
```

## Data Cleaning
Let's ensure the variables are in the correct format.

```{r}
CLEANED_STREAMING_HISTORY <- RAW_STREAMING_HISTORY |> 
  mutate(
    # Convert ms to minutes
    min_played = as.numeric(msPlayed / 60000),
    
    # Convert artistName to factor
    artist_name = as.factor(artistName),
    
    track_name = as.character(trackName),
    
    # Convert endTime into lubridate datetime
    streaming_datetime = as_date(endTime, format = "%Y-%m-%d %H:%M")
  ) |> 
  
  # Remove unnecessary columns
  select(
    artist_name,
    track_name,
    streaming_datetime,
    min_played
  )
```

## Data Exploration
This data exploration has two objectives:
1.    To get a sense of the data and to see if there are any issues with the data.
2.    To answer several questions that I have about my listening habits.

### Sanity Checks

There are `r nrow(CLEANED_STREAMING_HISTORY)` rows in the CLEANED_STREAMING_HISTORY tibble, which is the number of songs/podcast episodes that I have listened to between `r min(CLEANED_STREAMING_HISTORY$streaming_datetime)` and `r max(CLEANED_STREAMING_HISTORY$streaming_datetime)`. Let's use the function `skim()` from the skimr package to get a sense check of the data.

```{r}
CLEANED_STREAMING_HISTORY |> 
  skim()
```

There are `r ncol(CLEANED_STREAMING_HISTORY)` columns in the CLEANED_STREAMING_HISTORY tibble. There are `r distinct(CLEANED_STREAMING_HISTORY, artist_name) |> nrow()` unique artists and `r distinct(CLEANED_STREAMING_HISTORY, track_name) |> nrow()` unique tracks in the CLEANED_STREAMING_HISTORY tibble. It is interesting that the shortest `track_name` has a length of `r  min(str_length(CLEANED_STREAMING_HISTORY$track_name))` characters and the longest `track_name` has a length of `r max(str_length(CLEANED_STREAMING_HISTORY$track_name))` characters. Interestingly, the shortest `track_name` has a length of `r min(str_length(CLEANED_STREAMING_HISTORY$artist_name))` characters. I wonder what song that is. The date ranges between `r min(CLEANED_STREAMING_HISTORY$streaming_datetime)` and `r max(CLEANED_STREAMING_HISTORY$streaming_datetime)`.

It seems like the data mostly makes sense and that there are a wide range of song names and artist names.

### Reshape Data: Streaming per day
Let's reshape the data so that we can see how much I have streamed per day.

```{r}
STREAMING_HISTORY_PER_DAY <- CLEANED_STREAMING_HISTORY |> 
  group_by(streaming_datetime) |>
  summarise(
    total_hours_played = sum(min_played / 60)
  )
STREAMING_HISTORY_PER_DAY
```

### What were the top 5 days I listened to music?

Let's now investigate what the top 5 days I listened to music were and include the day of the week.

```{r}
TOP_SONGS <- STREAMING_HISTORY_PER_DAY |>
  mutate(
    day_of_week = wday(streaming_datetime, label = TRUE)
  ) |> 
  arrange(desc(total_hours_played)) |> 
  head(5)
TOP_SONGS
```

It seems like `r TOP_SONGS |> slice(1) |> pull(streaming_datetime)` and `r TOP_SONGS |> slice(2) |> pull(streaming_datetime)` were two days when I listened to a LOT of music.

Let's pull it back and look at the aggregate again; I wonder what the most listened to days are?

```{r}
STREAMING_HISTORY_PER_DAY |> 
  mutate(
    day_of_week = wday(streaming_datetime, label = TRUE)
    ) |> 
  group_by(day_of_week) |>
  summarise(
    total_hours_played = sum(total_hours_played)
  ) |>
  arrange(desc(total_hours_played))
```

Surprisingly, it seems like Mondays are the days where I have listened to the most streamed music. I wonder if this is because I listen to music on my commute to work? Although, I don't think I was really working consistently in 2022-23.

So potentially this is because I listen to music when I was studying? To answer this question and gain more insights, I would need to look at my calendar and see what I was doing on those days.

### How did my streaming time vary by day?

Let's plot the total hours played per day.

```{r}
GGPLOT_HOURS_PLAYED_PER_DAY <- STREAMING_HISTORY_PER_DAY |> 
  ggplot(aes(x = streaming_datetime, y = total_hours_played)) +
  
  geom_point() +
  geom_line() +
  
  labs(
    x = "",
    y = "Hours Played",
    title = "Hours Played Per Day",
    subtitle = "Spotify Streaming History"
  ) +
  
  theme_minimal() +
  theme(
    plot.title = element_text(
      size = 20,
      face = "bold"
    ),
    plot.subtitle = element_text(
      size = 15
    ),
    axis.title = element_text(
      size = 15
    ),
    axis.text = element_text(
      size = 10
    )
  )

GGPLOT_HOURS_PLAYED_PER_DAY
```

There is a high fluctuation in the number of hours played per day with some days, when very little music was played and some days were a lot of music was played. It seems that there are two days in particular, where I have listened to a lot of music. Let's investigate these days further, we know that the days are:  `r TOP_SONGS |> slice(1) |> pull(streaming_datetime)` and `r TOP_SONGS |> slice(2) |> pull(streaming_datetime)`. What did I do on these two days? Let's also include a smoothed line.

```{r}
GGPLOT_HOURS_PLAYED_PER_DAY +
  
  geom_point(aes(
    colour = ifelse(
      streaming_datetime == as.Date("2023-05-30") | 
        streaming_datetime == as.Date("2023-02-18"),
      "red",
      "darkgrey"
    )
    )
  ) +
  geom_line(colour = "darkgrey") +
  geom_smooth() +
  geom_label(
    label = "Flying to Australia",
    x = as.Date("2023-05-30"),
    y = STREAMING_HISTORY_PER_DAY |> 
      filter(streaming_datetime == as.Date("2023-05-30")) |> 
      pull(total_hours_played),
    vjust = -0.5
  ) +
  geom_label(
    label = "Flying to Austria",
    x = as.Date("2023-02-18"),
    y = STREAMING_HISTORY_PER_DAY |> 
      filter(streaming_datetime == as.Date("2023-02-18")) |> 
      pull(total_hours_played),
    vjust = -0.5
  ) +
    expand_limits(
    y = c(0, 20)
  ) +
  scale_color_identity()
```

Flying in the plane and listening to music! That makes sense. The smoothed line suggests that there was more music listened to in the second half of 2022 than the first half of 2023. 

### How did my streaming time vary by month?
Let's investigate this further: what was the total number of hours played per month?

```{r}
STREAMING_HISTORY_PER_MONTH <- CLEANED_STREAMING_HISTORY |> 
  mutate(
    month_floor = floor_date(streaming_datetime, unit = "month"),
    year_floor = floor_date(streaming_datetime, unit = "year")
  ) |> 
  group_by(month_floor, year_floor) |>
  summarise(
    total_hours_played = sum(min_played / 60)
  )
```

Let's plot the total hours played per month.

```{r}
STREAMING_HISTORY_PER_MONTH |> 
  ggplot(aes(x = month_floor, y = total_hours_played)) +
  
  geom_point() +
  geom_line() +
  
  labs(
    x = "",
    y = "Hours Played",
    title = "Hours Played Per Month",
    subtitle = "Spotify Streaming History"
  ) +
  
  theme_minimal() +
  theme(
    plot.title = element_text(
      size = 20,
      face = "bold"
    ),
    plot.subtitle = element_text(
      size = 15
    ),
    axis.title = element_text(
      size = 15
    ),
    axis.text = element_text(
      size = 10
    )
  )
```
There seems to be a bit of a pattern. Before I went backpacking (Jan 2023), I was listening to a lot more music. Let's calculate the total number of hours played in both years and see how different they are.

```{r}
STREAMING_HISTORY_PER_MONTH |> 
  group_by(year_floor) |>
  summarise(
    total_hours_played = sum(total_hours_played)
  )
```

There definitely seems like there is a major difference between the two years. I wonder if this is because I was travelling in 2023 and therefore didn't have as much time to listen to music. Let's investigate this further.

### Who were my top artists?
Let's investigate who my top artists are. We will do this by grouping by artist name and then calculating the total number of hours played.

```{r}
CLEANED_STREAMING_HISTORY |> 
  group_by(artist_name) |>
  summarise(
    total_hours_played = sum(min_played / 60)
  ) |> 
  arrange(desc(total_hours_played)) |> 
  head(10) |> 
  ggplot(aes(x = reorder(artist_name, total_hours_played), y = total_hours_played)) +
  geom_col(aes(fill = ifelse(total_hours_played > 20, "orange", "grey"))) +
  coord_flip() +
  scale_y_continuous(
    breaks = seq(0, 100, 10)
  ) +
  scale_fill_identity() +
  labs(
    x = "",
    y = "Hours Played",
    title = "Top Artists",
    subtitle = "Spotify Streaming History: July 2022 - July 2023"
  ) +
  theme_minimal()
```

As expected, I'm a massive Parcels fan and the data shows it!
Let's look at my top artists for each month.

```{r}
CLEANED_STREAMING_HISTORY |> 
  mutate(
    month_floor = floor_date(streaming_datetime, unit = "month")
  ) |> 
  group_by(month_floor, artist_name) |>
  summarise(
    total_hours_played = sum(min_played / 60)
  ) |> 
  arrange(desc(total_hours_played)) |> 
  group_by(month_floor) |> 
  slice(1) |> 
  ggplot(aes(x = month_floor, y = total_hours_played, fill = artist_name)) +
  geom_col() +
  scale_fill_viridis_d() +
  labs(
    x = "",
    y = "Hours Played",
    title = "Top Artists Per Month",
    subtitle = "Spotify Streaming History: July 2022 - July 2023"
  ) +
  theme_minimal()

```

Wow, Parcels really was my favourite artist consistently throughout the time range, although from April 2023 onwards, it seems I started listening to more podcasts. A further question for future investigation: How does my podcast listening behaviour change over time.

### What were my top songs?
Let's move onto top songs. We will do this by grouping by track name and then calculating the total number of hours played.

```{r}
CLEANED_STREAMING_HISTORY |> 
  group_by(track_name, artist_name) |>
  summarise(
    total_hours_played = sum(min_played / 60)
  ) |> 
  arrange(desc(total_hours_played)) |> 
  head(10) |> 
  ggplot(aes(x = reorder(track_name, total_hours_played), y = total_hours_played)) +
  geom_col(aes(fill = ifelse(artist_name == "Parcels", "orange", "grey"))) +
  coord_flip() +
  scale_y_continuous(
    breaks = seq(0, 10, 2)
  ) +
  scale_fill_identity() +
  labs(
    x = "",
    y = "Hours Played",
    title = "Top Songs",
    subtitle = "Spotify Streaming History: July 2022 - July 2023\nOrange = Parcels"
  ) +
  theme_minimal()
```

Five of the top 10 songs were songs from Parcels.

Let's look at the top songs for each month.

```{r}
CLEANED_STREAMING_HISTORY |> 
  mutate(
    month_floor = floor_date(streaming_datetime, unit = "month")
  ) |> 
  group_by(month_floor, track_name, artist_name) |>
  summarise(
    total_hours_played = sum(min_played / 60)
  ) |> 
  arrange(desc(total_hours_played)) |> 
  group_by(month_floor) |> 
  slice(1) |> 
  mutate(
    fill_colour = case_when(
      track_name == "Lost in Music - Dimitri from Paris Remix" ~ "pink",
      artist_name == "Parcels" ~ "orange",
      .default = "grey"
      )
  ) |> 
  ggplot(aes(x = month_floor, y = total_hours_played, fill = fill_colour)) +
  geom_col() +
  scale_fill_identity() +
  labs(
    x = "",
    y = "Hours Played",
    title = "Top Songs Per Month",
    subtitle = "Spotify Streaming History: July 2022 - July 2023\nOrange = Parcels\nPink = Lost in Music - Dimitri from Paris Remix"
  ) +
  theme_minimal()
```

It seems that I listened to Lost in Music - Dimitri from Paris Remix a lot in July/August 2022. Parcels was my top artist for every month, but it seems that I listened to them a lot more in October 2022 and January/Febuary 2023.

### How did my top 10 songs vary across time?

Let's investigate how my top 10 songs varied across time. We will do this by grouping by track name and then calculating the total number of hours played.

```{r}
top_ten_songs <- CLEANED_STREAMING_HISTORY |> 
  group_by(track_name) |>
  summarise(
    total_hours_played = sum(min_played / 60)
  ) |> 
  arrange(desc(total_hours_played)) |> 
  head(5) |> 
  pull(track_name)

CLEANED_STREAMING_HISTORY |>
  filter(track_name %in% top_ten_songs) |> 
  mutate(
    month_floor = floor_date(streaming_datetime, unit = "month")
  ) |>
  group_by(month_floor, track_name) |>
  summarise(
    total_hours_played = sum(min_played / 60)
  ) |>
  ggplot(aes(x = month_floor, y = total_hours_played, colour = track_name)) +
  
  geom_point() +
  geom_line() +
  
  labs(
    x = "",
    y = "Hours Played",
    title = "Top 5 Songs - Hours Played Per Day",
    subtitle = "Spotify Streaming History",
    colour = "Track Name"
  ) +
  
  theme_minimal() +
  theme(
    plot.title = element_text(
      size = 20,
      face = "bold"
    ),
    plot.subtitle = element_text(
      size = 15
    ),
    axis.title = element_text(
      size = 15
    ),
    axis.text = element_text(
      size = 10
    )
  )


```

This is super interesting. It seems that there are some rough patterns in my top 5 songs. For example, "Lost in Music - Dimitri from Paris Remix" was played a lot in the first half of 2022 and then not at all in the first half of 2023. Similarly, "The Girl" has a similar downwards trend. "Tieduprightnow" was played a lot in the new year (2023); however, also dropped. "Free" and "Bitter Sweet Symphony" were almost perfectly positively correlated with each other with the exception of late 2022.

I wonder if I could do this analysis for all of my songs and then create a grouping/cluster analysis to see if there are any temporal patterns in my music listening? Are there some songs that I listen to with other songs? Do these songs group together because I usually listen to them from the same playlist? Can I somehow link/predict my playlist data and my streaming data?

## Moving Forward

There are quite a few questions that I would like to explore in the future. For example:
-   I would like to explore how my podcast listening behaviour change over time.
-   I would like to explore how my top 10 songs varied across time and utilise the `gganimate` package.
-   I would really like to do some time series analysis on my streaming history.
-   I'm curious on linking my streaming history data with my playlist data. I wonder if I can predict my playlist data based on my streaming history data. I think I would typically use Spotify by listening to my playlists, so potentially doing some clustering/grouping analysis on my streaming history data and then linking it to my playlist data would be interesting.

These are all questions that I would like to explore in future! But for now, these were some great first initial data explorations of my Spotify streaming history. I hope you enjoyed reading this post and I hope you learned something new about Spotify streaming history data analysis. If you have any questions or comments, please feel free to reach out to me. I would love to hear from you! :)