-
Notifications
You must be signed in to change notification settings - Fork 0
/
spotify_exploratory_data_analysis.qmd
581 lines (468 loc) · 19.1 KB
/
spotify_exploratory_data_analysis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
---
title: "Spotify Exploratory Data Analysis - Streaming History"
format: html
---
## Introduction
This is a series of exploratory data analysis (EDA) projects on my Spotify data. The data was downloaded from my Spotify account on July 23rd, 2023. The data is downloaded as a zip file containing several json files and saved on my personal google drive. The json files are then converted into tibbles for analysis using the `jsonlite` package.
This quarto document is the first of several EDA projects. This project focuses on my streaming history. I'm interested in exploring my listening habits across the time period of the data. I'm also interested in exploring my listening habits across the days of the week.
This process is documented in the following sections:
- Setup and Configuration: Loading packages and googledrive API access
- Data Loading: How to download and load the data?
- Data Tidying: Get a tidy dataset
- Data Cleaning: Ensure variables are in correct formats
- Data Exploration: Answer one question and come up with two extra ones
Let's start exploring!
## Setup and Configuration
First, let's load in the packages we'll need for this project and authorise access to my google drive.
```{r}
### "Tidyverse"-oriented packages:
# The tidyverse is a collection of R packages designed for data science.
# All packages share a similar design philosophy, grammar, and data structures.
# Tidyverse includes packages such as:
# ggplot2, dplyr, tidyr, readr, purr, tibble, stringr, lubridate, and forcats.
### https://www.tidyverse.org/
library(tidyverse)
# To easily create data visualisations with simple and consistent syntax and grammar.
# https://ggplot2.tidyverse.org/index.html
library(ggplot2)
# To allow interaction between files on Google Drive and R.
# https://googledrive.tidyverse.org/
library(googledrive)
### Other Packages:
# To easily create summary statistics to understand and explore data.
# https://docs.ropensci.org/skimr/
library(skimr)
# A fast JSON parser and generator.
### https://cran.r-project.org/web/packages/jsonlite/index.html
library(jsonlite)
# To easily enable file referencing in project-oriented workflows.
# https://here.r-lib.org/
library(here)
# To easily format and scale data in visualisations.
# https://scales.r-lib.org/
library(scales)
# Google Drive Authentication --------------------------------------------------
# To establish a connection between a Google Drive account and R.
drive_auth()
# Example of how to download from Google Drive
# drive_download(
# # Where to download file from
# "https://drive.google.com/file/d/1Fjq1r6016H4isB2Cx2wg-Xm9zY7lHhYV/view?usp=drive_link",
#
# # Where to save it locally
# path = here("foldertest", "text2")
# )
```
## Data Loading
To access the data, I need to download it from my google drive. The data is requested from Johann's Spotify account and downloaded as a zip file containing several json files. There are several different json files; however, for this analysis I'm only interested in the Streaming History files.
You will only have access if Johann has given you read access to the email you authorised in 0-00_setup_and_configuration.R.
```{r}
# Only download raw data if it hasn't already been downloaded
if(!dir.exists(here("raw_data"))) {
dir.create(here("raw_data"), showWarnings = FALSE)
# List contents of Spotify Analysis Folder
spotify_dribble <- drive_ls("Spotify Analysis")
# Download raw data
map2(
spotify_dribble$id,
spotify_dribble$name,
~ drive_download(
file = as_id(.x),
path = here("raw_data", .y),
overwrite = TRUE
)
)
}
# Read in individual raw json as nested lists
# JRAW = RAW JSON
# RAW_JSON causes alphabetical ordering inconveniences in R environment.
JRAW_STREAMING_HISTORY_0 <- read_json(
path = here(
"raw_data",
"StreamingHistory0.json"
)
)
JRAW_STREAMING_HISTORY_1 <- read_json(
path = here(
"raw_data",
"StreamingHistory1.json"
)
)
JRAW_STREAMING_HISTORY_2 <- read_json(
path = here(
"raw_data",
"StreamingHistory2.json"
)
)
```
## Data Tidying
These json files are then converted into tibbles for analysis using the `jsonlite` package. The tibbles are then combined into one tibble, as they all have the same columns. I suspect the reason why there are different files is because of the size of the data.
```{r}
RAW_STREAMING_HISTORY_0 <- JRAW_STREAMING_HISTORY_0 %>%
bind_rows() %>%
as_tibble()
RAW_STREAMING_HISTORY_1 <- JRAW_STREAMING_HISTORY_1 %>%
bind_rows() |>
as_tibble()
RAW_STREAMING_HISTORY_2 <- JRAW_STREAMING_HISTORY_2 %>%
bind_rows() |>
as_tibble()
# Combine all streaming history tibbles into one tibble
RAW_STREAMING_HISTORY <- bind_rows(
RAW_STREAMING_HISTORY_0,
RAW_STREAMING_HISTORY_1,
RAW_STREAMING_HISTORY_2
)
```
## Data Cleaning
Let's ensure the variables are in the correct format.
```{r}
CLEANED_STREAMING_HISTORY <- RAW_STREAMING_HISTORY |>
mutate(
# Convert ms to minutes
min_played = as.numeric(msPlayed / 60000),
# Convert artistName to factor
artist_name = as.factor(artistName),
track_name = as.character(trackName),
# Convert endTime into lubridate datetime
streaming_datetime = as_date(endTime, format = "%Y-%m-%d %H:%M")
) |>
# Remove unnecessary columns
select(
artist_name,
track_name,
streaming_datetime,
min_played
)
```
## Data Exploration
This data exploration has two objectives:
1. To get a sense of the data and to see if there are any issues with the data.
2. To answer several questions that I have about my listening habits.
### Sanity Checks
There are `r nrow(CLEANED_STREAMING_HISTORY)` rows in the CLEANED_STREAMING_HISTORY tibble, which is the number of songs/podcast episodes that I have listened to between `r min(CLEANED_STREAMING_HISTORY$streaming_datetime)` and `r max(CLEANED_STREAMING_HISTORY$streaming_datetime)`. Let's use the function `skim()` from the skimr package to get a sense check of the data.
```{r}
CLEANED_STREAMING_HISTORY |>
skim()
```
There are `r ncol(CLEANED_STREAMING_HISTORY)` columns in the CLEANED_STREAMING_HISTORY tibble. There are `r distinct(CLEANED_STREAMING_HISTORY, artist_name) |> nrow()` unique artists and `r distinct(CLEANED_STREAMING_HISTORY, track_name) |> nrow()` unique tracks in the CLEANED_STREAMING_HISTORY tibble. It is interesting that the shortest `track_name` has a length of `r min(str_length(CLEANED_STREAMING_HISTORY$track_name))` characters and the longest `track_name` has a length of `r max(str_length(CLEANED_STREAMING_HISTORY$track_name))` characters. Interestingly, the shortest `track_name` has a length of `r min(str_length(CLEANED_STREAMING_HISTORY$artist_name))` characters. I wonder what song that is. The date ranges between `r min(CLEANED_STREAMING_HISTORY$streaming_datetime)` and `r max(CLEANED_STREAMING_HISTORY$streaming_datetime)`.
It seems like the data mostly makes sense and that there are a wide range of song names and artist names.
### Reshape Data: Streaming per day
Let's reshape the data so that we can see how much I have streamed per day.
```{r}
STREAMING_HISTORY_PER_DAY <- CLEANED_STREAMING_HISTORY |>
group_by(streaming_datetime) |>
summarise(
total_hours_played = sum(min_played / 60)
)
STREAMING_HISTORY_PER_DAY
```
### What were the top 5 days I listened to music?
Let's now investigate what the top 5 days I listened to music were and include the day of the week.
```{r}
TOP_SONGS <- STREAMING_HISTORY_PER_DAY |>
mutate(
day_of_week = wday(streaming_datetime, label = TRUE)
) |>
arrange(desc(total_hours_played)) |>
head(5)
TOP_SONGS
```
It seems like `r TOP_SONGS |> slice(1) |> pull(streaming_datetime)` and `r TOP_SONGS |> slice(2) |> pull(streaming_datetime)` were two days when I listened to a LOT of music.
Let's pull it back and look at the aggregate again; I wonder what the most listened to days are?
```{r}
STREAMING_HISTORY_PER_DAY |>
mutate(
day_of_week = wday(streaming_datetime, label = TRUE)
) |>
group_by(day_of_week) |>
summarise(
total_hours_played = sum(total_hours_played)
) |>
arrange(desc(total_hours_played))
```
Surprisingly, it seems like Mondays are the days where I have listened to the most streamed music. I wonder if this is because I listen to music on my commute to work? Although, I don't think I was really working consistently in 2022-23.
So potentially this is because I listen to music when I was studying? To answer this question and gain more insights, I would need to look at my calendar and see what I was doing on those days.
### How did my streaming time vary by day?
Let's plot the total hours played per day.
```{r}
GGPLOT_HOURS_PLAYED_PER_DAY <- STREAMING_HISTORY_PER_DAY |>
ggplot(aes(x = streaming_datetime, y = total_hours_played)) +
geom_point() +
geom_line() +
labs(
x = "",
y = "Hours Played",
title = "Hours Played Per Day",
subtitle = "Spotify Streaming History"
) +
theme_minimal() +
theme(
plot.title = element_text(
size = 20,
face = "bold"
),
plot.subtitle = element_text(
size = 15
),
axis.title = element_text(
size = 15
),
axis.text = element_text(
size = 10
)
)
GGPLOT_HOURS_PLAYED_PER_DAY
```
There is a high fluctuation in the number of hours played per day with some days, when very little music was played and some days were a lot of music was played. It seems that there are two days in particular, where I have listened to a lot of music. Let's investigate these days further, we know that the days are: `r TOP_SONGS |> slice(1) |> pull(streaming_datetime)` and `r TOP_SONGS |> slice(2) |> pull(streaming_datetime)`. What did I do on these two days? Let's also include a smoothed line.
```{r}
GGPLOT_HOURS_PLAYED_PER_DAY +
geom_point(aes(
colour = ifelse(
streaming_datetime == as.Date("2023-05-30") |
streaming_datetime == as.Date("2023-02-18"),
"red",
"darkgrey"
)
)
) +
geom_line(colour = "darkgrey") +
geom_smooth() +
geom_label(
label = "Flying to Australia",
x = as.Date("2023-05-30"),
y = STREAMING_HISTORY_PER_DAY |>
filter(streaming_datetime == as.Date("2023-05-30")) |>
pull(total_hours_played),
vjust = -0.5
) +
geom_label(
label = "Flying to Austria",
x = as.Date("2023-02-18"),
y = STREAMING_HISTORY_PER_DAY |>
filter(streaming_datetime == as.Date("2023-02-18")) |>
pull(total_hours_played),
vjust = -0.5
) +
expand_limits(
y = c(0, 20)
) +
scale_color_identity()
```
Flying in the plane and listening to music! That makes sense. The smoothed line suggests that there was more music listened to in the second half of 2022 than the first half of 2023.
### How did my streaming time vary by month?
Let's investigate this further: what was the total number of hours played per month?
```{r}
STREAMING_HISTORY_PER_MONTH <- CLEANED_STREAMING_HISTORY |>
mutate(
month_floor = floor_date(streaming_datetime, unit = "month"),
year_floor = floor_date(streaming_datetime, unit = "year")
) |>
group_by(month_floor, year_floor) |>
summarise(
total_hours_played = sum(min_played / 60)
)
```
Let's plot the total hours played per month.
```{r}
STREAMING_HISTORY_PER_MONTH |>
ggplot(aes(x = month_floor, y = total_hours_played)) +
geom_point() +
geom_line() +
labs(
x = "",
y = "Hours Played",
title = "Hours Played Per Month",
subtitle = "Spotify Streaming History"
) +
theme_minimal() +
theme(
plot.title = element_text(
size = 20,
face = "bold"
),
plot.subtitle = element_text(
size = 15
),
axis.title = element_text(
size = 15
),
axis.text = element_text(
size = 10
)
)
```
There seems to be a bit of a pattern. Before I went backpacking (Jan 2023), I was listening to a lot more music. Let's calculate the total number of hours played in both years and see how different they are.
```{r}
STREAMING_HISTORY_PER_MONTH |>
group_by(year_floor) |>
summarise(
total_hours_played = sum(total_hours_played)
)
```
There definitely seems like there is a major difference between the two years. I wonder if this is because I was travelling in 2023 and therefore didn't have as much time to listen to music. Let's investigate this further.
### Who were my top artists?
Let's investigate who my top artists are. We will do this by grouping by artist name and then calculating the total number of hours played.
```{r}
CLEANED_STREAMING_HISTORY |>
group_by(artist_name) |>
summarise(
total_hours_played = sum(min_played / 60)
) |>
arrange(desc(total_hours_played)) |>
head(10) |>
ggplot(aes(x = reorder(artist_name, total_hours_played), y = total_hours_played)) +
geom_col(aes(fill = ifelse(total_hours_played > 20, "orange", "grey"))) +
coord_flip() +
scale_y_continuous(
breaks = seq(0, 100, 10)
) +
scale_fill_identity() +
labs(
x = "",
y = "Hours Played",
title = "Top Artists",
subtitle = "Spotify Streaming History: July 2022 - July 2023"
) +
theme_minimal()
```
As expected, I'm a massive Parcels fan and the data shows it!
Let's look at my top artists for each month.
```{r}
CLEANED_STREAMING_HISTORY |>
mutate(
month_floor = floor_date(streaming_datetime, unit = "month")
) |>
group_by(month_floor, artist_name) |>
summarise(
total_hours_played = sum(min_played / 60)
) |>
arrange(desc(total_hours_played)) |>
group_by(month_floor) |>
slice(1) |>
ggplot(aes(x = month_floor, y = total_hours_played, fill = artist_name)) +
geom_col() +
scale_fill_viridis_d() +
labs(
x = "",
y = "Hours Played",
title = "Top Artists Per Month",
subtitle = "Spotify Streaming History: July 2022 - July 2023"
) +
theme_minimal()
```
Wow, Parcels really was my favourite artist consistently throughout the time range, although from April 2023 onwards, it seems I started listening to more podcasts. A further question for future investigation: How does my podcast listening behaviour change over time.
### What were my top songs?
Let's move onto top songs. We will do this by grouping by track name and then calculating the total number of hours played.
```{r}
CLEANED_STREAMING_HISTORY |>
group_by(track_name, artist_name) |>
summarise(
total_hours_played = sum(min_played / 60)
) |>
arrange(desc(total_hours_played)) |>
head(10) |>
ggplot(aes(x = reorder(track_name, total_hours_played), y = total_hours_played)) +
geom_col(aes(fill = ifelse(artist_name == "Parcels", "orange", "grey"))) +
coord_flip() +
scale_y_continuous(
breaks = seq(0, 10, 2)
) +
scale_fill_identity() +
labs(
x = "",
y = "Hours Played",
title = "Top Songs",
subtitle = "Spotify Streaming History: July 2022 - July 2023\nOrange = Parcels"
) +
theme_minimal()
```
Five of the top 10 songs were songs from Parcels.
Let's look at the top songs for each month.
```{r}
CLEANED_STREAMING_HISTORY |>
mutate(
month_floor = floor_date(streaming_datetime, unit = "month")
) |>
group_by(month_floor, track_name, artist_name) |>
summarise(
total_hours_played = sum(min_played / 60)
) |>
arrange(desc(total_hours_played)) |>
group_by(month_floor) |>
slice(1) |>
mutate(
fill_colour = case_when(
track_name == "Lost in Music - Dimitri from Paris Remix" ~ "pink",
artist_name == "Parcels" ~ "orange",
.default = "grey"
)
) |>
ggplot(aes(x = month_floor, y = total_hours_played, fill = fill_colour)) +
geom_col() +
scale_fill_identity() +
labs(
x = "",
y = "Hours Played",
title = "Top Songs Per Month",
subtitle = "Spotify Streaming History: July 2022 - July 2023\nOrange = Parcels\nPink = Lost in Music - Dimitri from Paris Remix"
) +
theme_minimal()
```
It seems that I listened to Lost in Music - Dimitri from Paris Remix a lot in July/August 2022. Parcels was my top artist for every month, but it seems that I listened to them a lot more in October 2022 and January/Febuary 2023.
### How did my top 10 songs vary across time?
Let's investigate how my top 10 songs varied across time. We will do this by grouping by track name and then calculating the total number of hours played.
```{r}
top_ten_songs <- CLEANED_STREAMING_HISTORY |>
group_by(track_name) |>
summarise(
total_hours_played = sum(min_played / 60)
) |>
arrange(desc(total_hours_played)) |>
head(5) |>
pull(track_name)
CLEANED_STREAMING_HISTORY |>
filter(track_name %in% top_ten_songs) |>
mutate(
month_floor = floor_date(streaming_datetime, unit = "month")
) |>
group_by(month_floor, track_name) |>
summarise(
total_hours_played = sum(min_played / 60)
) |>
ggplot(aes(x = month_floor, y = total_hours_played, colour = track_name)) +
geom_point() +
geom_line() +
labs(
x = "",
y = "Hours Played",
title = "Top 5 Songs - Hours Played Per Day",
subtitle = "Spotify Streaming History",
colour = "Track Name"
) +
theme_minimal() +
theme(
plot.title = element_text(
size = 20,
face = "bold"
),
plot.subtitle = element_text(
size = 15
),
axis.title = element_text(
size = 15
),
axis.text = element_text(
size = 10
)
)
```
This is super interesting. It seems that there are some rough patterns in my top 5 songs. For example, "Lost in Music - Dimitri from Paris Remix" was played a lot in the first half of 2022 and then not at all in the first half of 2023. Similarly, "The Girl" has a similar downwards trend. "Tieduprightnow" was played a lot in the new year (2023); however, also dropped. "Free" and "Bitter Sweet Symphony" were almost perfectly positively correlated with each other with the exception of late 2022.
I wonder if I could do this analysis for all of my songs and then create a grouping/cluster analysis to see if there are any temporal patterns in my music listening? Are there some songs that I listen to with other songs? Do these songs group together because I usually listen to them from the same playlist? Can I somehow link/predict my playlist data and my streaming data?
## Moving Forward
There are quite a few questions that I would like to explore in the future. For example:
- I would like to explore how my podcast listening behaviour change over time.
- I would like to explore how my top 10 songs varied across time and utilise the `gganimate` package.
- I would really like to do some time series analysis on my streaming history.
- I'm curious on linking my streaming history data with my playlist data. I wonder if I can predict my playlist data based on my streaming history data. I think I would typically use Spotify by listening to my playlists, so potentially doing some clustering/grouping analysis on my streaming history data and then linking it to my playlist data would be interesting.
These are all questions that I would like to explore in future! But for now, these were some great first initial data explorations of my Spotify streaming history. I hope you enjoyed reading this post and I hope you learned something new about Spotify streaming history data analysis. If you have any questions or comments, please feel free to reach out to me. I would love to hear from you! :)