# Last Christmas I gave you my heart
> TLDR; There is no such thing as a summer classic - but plenty of christmas classics!

- toc:true
- branch: master
- badges: false
- comments: true
- author: Mikkel Freltoft Krogsholm
- categories: [spotify, charts, christmas, summer, R, jupyter]

This blogpost looks at music. 

I have downloaded most of the chart data from [spotifycharts](https://spotifycharts.com/) and stored them in a SQLite database. 

Specifically I will look at "seasonal classics". It turns out that there are no summer classics - only christmas classics. It looks like Summer Hits are temporary and fleeing where Christmas songs are more permanent. 

### Setup

First I load the needed libraries and created a connection to my local database with music.

In [1]:
suppressPackageStartupMessages(library(tidyverse))
library(RSQLite)

options(repr.plot.width=15, repr.plot.height=7.5, scipen = 999)

con <- dbConnect(RSQLite::SQLite(), "spotify.db")

### Collect the data

I will look a three countries: Germany (de), Denmark (dk) and Great Britain (gb). And then create a plot that shows how many Christmas songs are the same year on year, but summer songs are not.

In [2]:
# Make a reference to the table in the DB
tblref <- tbl(con, "top200")

In [None]:
# Collect the data
dfhere <- tblref %>%
    filter(region %in% c("de", "dk", "gb")) %>% # filter relevant countries
    mutate(m = str_sub(date, 6, 7), # create a month column
           y = str_sub(date, 1, 4), # create a year column
           ym = paste(y,m)) %>% # create a yearmonth  column
    group_by(region, y, m, ym, `Track Name`, Artist, URL) %>% # Group by region, track and year month
    summarise(streams = sum(Streams, na.rm = TRUE)) %>% # sum up the streams for a track per year month 
    ungroup() %>% # ungroup
    collect() # Import the data into R's memory

# Lets have a look at the first few lines
head(dfhere)

With the monthly data now in memory I will pick the top 100 tracks per month

In [None]:
top <- dfhere %>%
    group_by(region, y, m, ym) %>%
    top_n(n = 100, wt = streams) %>%
    ungroup()

# yms <- top$ym %>% unique() %>% sort()
# top$ym <- factor(top$ym, levels = yms)

head(top)

Now we need to divide the data into different data frames. We need these to create the visualization.

In [None]:
# First I create a table with repeaters, i.e. with songs that appear in the same month across the years more than once:
repeaters <- top %>%
    group_by(region) %>%
    count(m, `Track Name`, Artist, URL) %>%
    filter(n > 1) %>%
    select(region, URL) %>%
    ungroup()

# Then I filter the data to only contain data for the repeathers
toprep <- top %>%
    inner_join(repeaters, by = c("region", "URL"))

# Then I create a table with the songs that have max hits in december
xmasmax <- toprep %>%
    group_by(region, URL) %>%
    filter(streams == max(streams),
           m == 12) %>%
    select(region, URL) %>%
    ungroup()

# And use this table to create a data frame for repeater songs where
# the most streams are not in december
noxmas <- toprep %>% anti_join(xmasmax, by = c("region", "URL")) %>% distinct()

# And a dataframe where most streams are in december
xmas <- toprep %>% inner_join(xmasmax, by = c("region", "URL")) %>% distinct()

head(xmas)

### Visualize the data

#### Year on year repeaters
Let's first look at the songs that repeat year on year. As you can see the songs normally have a more or less sharp rise and then a long slow decline. It is this decline that makes it possible for the song to appear in the same months year on year, although on a very different scale. 

In [None]:
# I need a little helper to create pretty x-axis on the charts
yms <- top$ym %>% unique() %>% sort()
yms_labs <- str_c(str_sub(yms, -2,-1), " '", str_sub(yms, 3, 4))
yms_df <- top %>% distinct(ym)

In [None]:
ggplot() + 
    # Then I plot the non-xmas data in green
    geom_line(data = noxmas, aes(ym, streams, group = URL), show.legend = FALSE, color = "darkgreen", 
              linetype = "dashed", alpha = .4) + 
    geom_point(data = noxmas, aes(ym, streams, group = URL), show.legend = FALSE, color = "darkgreen", 
               alpha = .4) + 
    
    # And some chart styling
    scale_x_discrete(labels = yms_labs, limits = yms) + 
    theme_minimal() + 
    theme(axis.text.x = element_text(angle = 45)) +
    facet_wrap(facets = ~ region, ncol = 1, scales = "free_y") +
    labs(x = "", y = "Streams per month")

#### Christmas repeaters
It looks a bit different if we plot the Christmas songs. It becomes very obvious that the Christmas repeaters are limited to Christmas. They also do not follow the same path as the other songs (sharp increase, long slow decrease). They come to life pretty high on the charts in december (some in november) and then they quickly die out and wait for Christmas to come again. I have marked decemberr with a red dashed line.

In [None]:
ggplot() + 

    # First I create a vertical red dashed line for every december month
    geom_vline(xintercept = c("2017 12", "2018 12", "2019 12"), linetype = "dashed", color = "red") +
    
    # And the christmas data in blue
    geom_line(data = xmas, aes(ym, streams, group = URL), show.legend = FALSE, color = "blue", 
              alpha = .5) + 
    geom_point(data = xmas, aes(ym, streams, group = URL), show.legend = FALSE, color = "blue", 
               alpha = .9) + 
    
    # And some chart styling
    scale_x_discrete(labels = yms_labs, limits = yms) + 
    theme_minimal() + 
    theme(axis.text.x = element_text(angle = 45)) +
    facet_wrap(facets = ~ region, ncol = 1, scales = "free_y") +
    labs(x = "", y = "Streams per month")

#### Plotted together
When plotted together the difference becomes obvious.

In [None]:
ggplot() + 
    # First I create a vertical red dashed line for every december month
    geom_vline(xintercept = c("2017 12", "2018 12", "2019 12"), linetype = "dashed", color = "red") +
    
    # Then I plot the non-xmas data in green
    geom_line(data = noxmas, aes(ym, streams, group = URL), show.legend = FALSE, color = "darkgreen", 
              linetype = "dashed", alpha = .4) + 
    geom_point(data = noxmas, aes(ym, streams, group = URL), show.legend = FALSE, color = "darkgreen", 
               alpha = .4) + 
    
    # And the christmas data in blue
    geom_line(data = xmas, aes(ym, streams, group = URL), show.legend = FALSE, color = "blue", 
              alpha = .5) + 
    geom_point(data = xmas, aes(ym, streams, group = URL), show.legend = FALSE, color = "blue", 
               alpha = .9) + 
    
    # And some chart styling
    scale_x_discrete(labels = yms_labs, limits = yms) + 
    theme_minimal() + 
    theme(axis.text.x = element_text(angle = 45)) +
    facet_wrap(facets = ~ region, ncol = 1, scales = "free_y") +
    labs(x = "", y = "Streams per month")

### Interpretation

#### Non Christmas
We see a few non-christmas songs that make it more than a year - ie they appear in the same month i more than one year. But, they show a much different trend than the christmas songs. They have a pattern of popularity and then a fade out. 

#### Christmas
With the Christmas songs we see a whole different pattern. First we see more of them, ie there are more survivors year on year. Second the pattern is quite different; the songs lie dormant pretty much all year and then suddenly they explode in december. Some even get an early listen in november, where the must christmas enthusiastic people get started. 

## Conlusion
There is no such thing as a summer classic - but plenty of christmas classics!