# Spotify Music Data Analysis with ggplot2

This notebook analyzes Spotify song data using R's ggplot2 package, exploring musical features, popularity trends, and decade-based patterns through static visualizations instead of interactive plotly charts.

In [None]:
# Load Required Libraries
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(forcats)
library(scales)
library(GGally)
library(viridis)
library(RColorBrewer)
library(gridExtra)
library(corrplot)

# Suppress warnings
options(warn = -1)

# Set display options
options(digits = 2)
options(width = 120)

# List input files
input_dir <- file.path(getwd(), "..", "..", "data-files", "spotify_data")
file_list <- list.files(input_dir, recursive = TRUE, full.names = TRUE)
print(file_list)

## Import and Clean Data

We'll start by importing the Spotify dataset and cleaning it:
- Converting musical keys (numbers) to standard music notation (C, C#, etc.)
- Converting mode from binary (0/1) to human-readable format (Minor/Major)
- Removing duplicate songs
- Converting duration from milliseconds to minutes

In [None]:
# Read the data
data_file_path <- file.path(getwd(), "..", "..", "data-files", "spotify_data", "data.csv")
df <- read_csv(data_file_path)

# Display data structure
str(df)

# Display first few rows
head(df)

In [None]:
# Define mappings as named vectors
map_key <- c(
  "0" = "C", "1" = "C#", "2" = "D", "3" = "D#", "4" = "E",
  "5" = "F", "6" = "F#", "7" = "G", "8" = "G#", "9" = "A",
  "10" = "A#", "11" = "B"
)

map_mode <- c("1" = "Major", "0" = "Minor")

# Apply the mappings to the 'mode' and 'key' columns
df$mode <- map_mode[as.character(df$mode)]
df$key <- map_key[as.character(df$key)]

# Check for duplicates
check_dups <- df[, c("artists", "name")]
dups <- which(duplicated(check_dups))

# Drop duplicates
cat("Before dropping duplicates:", nrow(df), "rows\n")
df <- df[-dups, ]
cat("After dropping duplicates:", nrow(df), "rows\n")

# Convert duration from milliseconds to minutes
df$duration_min <- df$duration_ms / 60000

# Drop unused columns
df$release_date <- NULL
df$id <- NULL
df$duration_ms <- NULL

# Filter to songs with duration <= 5 minutes
df <- df %>% filter(duration_min <= 5)

# Summary of cleaned data
summary(df)

## Basic Data Exploration with ggplot2

Using ggplot2 to create visualizations of the various musical features in the dataset. 
We'll look at distributions of key metrics like valence, acousticness, danceability, etc.

In [None]:
# Define a theme for consistent visualization
theme_spotify <- function() {
  theme_minimal() +
    theme(
      plot.title = element_text(size = 14, face = "bold"),
      plot.subtitle = element_text(size = 12),
      axis.title = element_text(size = 10),
      legend.position = "right",
      panel.grid.minor = element_blank()
    )
}

# Function to create distribution plots with ggplot2
create_distribution_plot <- function(df, column) {
  p1 <- ggplot(df, aes_string(x = column)) +
    geom_histogram(bins = 30, fill = "#1DB954", alpha = 0.7) +
    geom_vline(aes(xintercept = mean(df[[column]], na.rm = TRUE)), 
               color = "#E3CF7A", linetype = "dashed", size = 1) +
    geom_vline(aes(xintercept = median(df[[column]], na.rm = TRUE)), 
               color = "#F28C28", linetype = "dotted", size = 1) +
    annotate("text", x = mean(df[[column]], na.rm = TRUE) + 0.05, 
             y = max(hist(df[[column]], plot = FALSE)$counts) * 0.9, 
             label = paste("Mean:", round(mean(df[[column]], na.rm = TRUE), 2))) +
    annotate("text", x = median(df[[column]], na.rm = TRUE) - 0.05, 
             y = max(hist(df[[column]], plot = FALSE)$counts) * 0.8, 
             label = paste("Median:", round(median(df[[column]], na.rm = TRUE), 2))) +
    labs(title = paste("Distribution of", column), 
         x = column, y = "Count") +
    theme_spotify()
  
  p2 <- ggplot(df, aes_string(y = column)) +
    geom_boxplot(fill = "#1DB954", alpha = 0.7) +
    labs(title = paste("Boxplot of", column), 
         x = "", y = column) +
    theme_spotify() +
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
  
  # Return both plots to be arranged with grid.arrange
  return(list(p1, p2))
}

# Define the music values to analyze
music_vals <- c(
  "valence", "acousticness", "danceability", "duration_min",
  "energy", "instrumentalness", "liveness", "loudness",
  "popularity", "speechiness", "tempo"
)

# Create and display distribution plots for first few features
for (col in music_vals[1:4]) {
  plots <- create_distribution_plot(df, col)
  grid.arrange(plots[[1]], plots[[2]], ncol = 2, widths = c(2, 1))
}

In [None]:
# Create density plots for key musical features
musical_features <- c("valence", "energy", "danceability", "acousticness")

ggplot(df, aes(x = valence, fill = "Valence")) +
  geom_density(alpha = 0.5) +
  geom_density(aes(x = energy, fill = "Energy"), alpha = 0.5) +
  geom_density(aes(x = danceability, fill = "Danceability"), alpha = 0.5) +
  labs(title = "Density Comparison of Key Musical Features",
       x = "Value", y = "Density",
       fill = "Feature") +
  scale_fill_brewer(palette = "Set1") +
  theme_spotify() +
  theme(legend.position = "top")

## Decade Analysis

Categorizing songs by decade to analyze musical trends over time. This helps us understand
how musical preferences and styles have evolved from the 1920s to the 2010s.

In [None]:
# Create decade bins
decade_breaks <- c(1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020)
df$decade <- cut(df$year, breaks = decade_breaks, include.lowest = FALSE, right = TRUE)

# Map decade intervals to simpler labels
map_decade <- c(
  "(1920,1930]" = "1920s",
  "(1930,1940]" = "1930s",
  "(1940,1950]" = "1940s",
  "(1950,1960]" = "1950s",
  "(1960,1970]" = "1960s",
  "(1970,1980]" = "1970s",
  "(1980,1990]" = "1980s",
  "(1990,2000]" = "1990s",
  "(2000,2010]" = "2000s",
  "(2010,2020]" = "2010s"
)

df$decade <- map_decade[as.character(df$decade)]

# Display counts by decade
decade_counts <- df %>%
  count(decade) %>%
  arrange(decade)

# Plot songs per decade using ggplot2
ggplot(decade_counts, aes(x = reorder(decade, 1:nrow(decade_counts)), y = n)) +
  geom_bar(stat = "identity", fill = "#1DB954", alpha = 0.8) +
  geom_text(aes(label = n), vjust = -0.5) +
  labs(title = "Count of Songs per Decade",
       x = "Decade", y = "Number of Songs") +
  theme_spotify() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In [None]:
# Analyze trends in musical features across decades
decade_features <- df %>%
  group_by(decade) %>%
  summarize(across(
    .cols = all_of(c("valence", "energy", "acousticness", "danceability")),
    .fns = mean,
    .names = "{.col}"
  ))

# Convert to long format for faceted plotting
decade_features_long <- decade_features %>%
  pivot_longer(cols = c("valence", "energy", "acousticness", "danceability"),
               names_to = "feature", 
               values_to = "value")

# Create faceted plot of features by decade
ggplot(decade_features_long, aes(x = decade, y = value, group = feature)) +
  geom_line(aes(color = feature), size = 1) +
  geom_point(aes(color = feature), size = 3) +
  facet_wrap(~feature, scales = "free_y") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Musical Features Trends Across Decades",
       x = "Decade", y = "Average Value",
       color = "Feature") +
  theme_spotify() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In [None]:
# Boxplots of popularity by decade
ggplot(df, aes(x = decade, y = popularity, fill = decade)) +
  geom_boxplot() +
  scale_fill_viridis_d() +
  labs(title = "Distribution of Song Popularity by Decade",
       x = "Decade", y = "Popularity") +
  theme_spotify() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

## Key and Mode Analysis

Analyzing the distribution of musical keys and modes across the dataset, and their relationship 
with song popularity. Musical keys (C, C#, D, etc.) and modes (Major/Minor) significantly 
influence the emotional tone of songs.

In [None]:
# Create key-mode combinations
df$key_mode <- paste(df$key, "-", df$mode)

# Analyze distribution of keys
key_counts <- df %>%
  count(key) %>%
  arrange(desc(n))

# Visualize key distribution
ggplot(key_counts, aes(x = reorder(key, -n), y = n)) +
  geom_bar(stat = "identity", fill = "#1DB954", alpha = 0.8) +
  geom_text(aes(label = n), vjust = -0.5) +
  labs(title = "Distribution of Musical Keys",
       x = "Key", y = "Number of Songs") +
  theme_spotify()

In [None]:
# Analyze distribution of modes
mode_counts <- df %>%
  count(mode) %>%
  arrange(desc(n))

# Visualize mode distribution
ggplot(mode_counts, aes(x = mode, y = n, fill = mode)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5) +
  scale_fill_manual(values = c("Major" = "#1DB954", "Minor" = "#F28C28")) +
  labs(title = "Distribution of Musical Modes",
       x = "Mode", y = "Number of Songs") +
  theme_spotify() +
  theme(legend.position = "none")

In [None]:
# Mode distribution by decade
decade_mode <- df %>%
  count(decade, mode) %>%
  group_by(decade) %>%
  mutate(percentage = n / sum(n) * 100)

# Visualize mode distribution by decade
ggplot(decade_mode, aes(x = decade, y = percentage, fill = mode)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = sprintf("%.1f%%", percentage)), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  scale_fill_manual(values = c("Major" = "#1DB954", "Minor" = "#F28C28")) +
  labs(title = "Distribution of Musical Modes by Decade",
       x = "Decade", y = "Percentage", fill = "Mode") +
  theme_spotify() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In [None]:
# Key mode vs popularity
key_mode_popularity <- df %>%
  group_by(key_mode) %>%
  summarise(
    avg_popularity = mean(popularity, na.rm = TRUE),
    count = n()
  ) %>%
  filter(count > 20) %>%  # Filter to key-modes with sufficient data
  arrange(desc(avg_popularity))

# Visualize key-mode vs popularity
ggplot(key_mode_popularity, aes(x = reorder(key_mode, avg_popularity), y = avg_popularity)) +
  geom_bar(stat = "identity", aes(fill = avg_popularity)) +
  scale_fill_viridis() +
  coord_flip() +
  labs(title = "Average Popularity by Musical Key and Mode",
       x = "Key-Mode Combination", y = "Average Popularity") +
  theme_spotify() +
  theme(legend.position = "none")

In [None]:
# Create a heatmap of key-mode combinations across decades
key_mode_decade <- df %>%
  count(decade, key_mode) %>%
  group_by(decade) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  ungroup()

# Pivot for heatmap
key_mode_heatmap <- key_mode_decade %>%
  select(decade, key_mode, n) %>%
  pivot_wider(names_from = decade, values_from = n, values_fill = 0)

# Convert back to long form for ggplot2
key_mode_heatmap_long <- key_mode_heatmap %>%
  pivot_longer(cols = -key_mode, names_to = "decade", values_to = "count")

# Create heatmap
ggplot(key_mode_heatmap_long, aes(x = decade, y = key_mode, fill = count)) +
  geom_tile() +
  scale_fill_viridis(name = "Count") +
  labs(title = "Distribution of Key-Mode Combinations Across Decades",
       x = "Decade", y = "Key-Mode") +
  theme_spotify() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Artist Popularity Analysis

Extracting artist information and analyzing the most popular artists in the dataset, including 
their song counts and musical characteristics.

In [None]:
# Extract first artist from the artists list
df$first_artist <- sapply(
  strsplit(gsub("\\[|\\]|'", "", df$artists), ","), 
  function(x) trimws(x[1])
)

# Find most popular artists with at least 45 songs
popularity_check <- df %>%
  group_by(first_artist) %>%
  summarise(
    mean_popularity = mean(popularity, na.rm = TRUE),
    count = n(),
    median_popularity = median(popularity, na.rm = TRUE)
  ) %>%
  filter(count > 45) %>%
  arrange(desc(mean_popularity)) %>%
  head(50)

# Display the top 10 most popular artists
head(popularity_check, 10)

In [None]:
# Visualize top 20 artists by popularity
top_20_artists <- head(popularity_check, 20)

ggplot(top_20_artists, aes(x = reorder(first_artist, mean_popularity), y = mean_popularity)) +
  geom_bar(stat = "identity", aes(fill = count)) +
  scale_fill_viridis(name = "Number of Songs") +
  coord_flip() +
  labs(title = "Top 20 Artists by Average Song Popularity",
       subtitle = "For artists with at least 45 songs",
       x = "Artist", y = "Average Popularity") +
  theme_spotify()

In [None]:
# Get songs by top 50 artists
top_50_artists <- df %>%
  filter(first_artist %in% popularity_check$first_artist)

# Analyze musical features by top artists
top_artist_features <- top_50_artists %>%
  group_by(first_artist) %>%
  summarise(
    avg_valence = mean(valence, na.rm = TRUE),
    avg_energy = mean(energy, na.rm = TRUE),
    avg_danceability = mean(danceability, na.rm = TRUE),
    avg_acousticness = mean(acousticness, na.rm = TRUE),
    avg_popularity = mean(popularity, na.rm = TRUE)
  ) %>%
  arrange(desc(avg_popularity))

# Select top 10 for visualization
top_10_artist_features <- head(top_artist_features, 10)

# Convert to long format for faceted visualization
top_artist_features_long <- top_10_artist_features %>%
  pivot_longer(
    cols = starts_with("avg_"),
    names_to = "feature",
    values_to = "value"
  ) %>%
  mutate(
    feature = gsub("avg_", "", feature),
    artist = factor(first_artist, levels = top_10_artist_features$first_artist)
  )

# Create faceted plot of musical features for top 10 artists
ggplot(top_artist_features_long, aes(x = artist, y = value, fill = feature)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~feature, scales = "free_y") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Musical Feature Comparison of Top 10 Artists",
       x = "", y = "Average Value") +
  theme_spotify() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

In [None]:
# Distribution of decades for top artists
decades_count <- top_50_artists %>%
  count(decade) %>%
  arrange(decade)

# Visualize decades in top artists' songs
ggplot(decades_count, aes(x = decade, y = n, fill = decade)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5) +
  scale_fill_viridis_d() +
  labs(title = "Distribution of Decades in Top Artists' Songs",
       x = "Decade", y = "Number of Songs") +
  theme_spotify() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

## Feature Correlation Visualization

Creating visualizations that show correlations between different musical features, helping
us understand how these attributes relate to each other and to song popularity.

In [None]:
# Select numeric features for correlation analysis
cols <- c(
  "valence", "acousticness", "danceability", "duration_min",
  "energy", "instrumentalness", "liveness", "loudness",
  "year", "speechiness", "tempo", "popularity"
)

# Calculate correlation matrix
corr_matrix <- cor(df[cols], use = "complete.obs")

# Create correlation heatmap using ggplot2
corr_df <- as.data.frame(as.table(corr_matrix))
names(corr_df) <- c("Feature1", "Feature2", "Correlation")

ggplot(corr_df, aes(x = Feature1, y = Feature2, fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "blue", mid = "white", high = "red",
    midpoint = 0, limits = c(-1, 1)
  ) +
  geom_text(aes(label = sprintf("%.2f", Correlation)), size = 2.5) +
  labs(title = "Correlation Matrix of Musical Features",
       x = "", y = "") +
  theme_spotify() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In [None]:
# Create pair plots for selected key features
key_features <- c("popularity", "valence", "energy", "danceability", "year")

# Sample a subset of data for the pair plot (to avoid overplotting)
set.seed(123)
sample_df <- df %>% sample_n(min(5000, nrow(df)))

# Create pair plot with GGally
ggpairs(
  sample_df, 
  columns = key_features,
  upper = list(continuous = "cor"),
  diag = list(continuous = "barDiag"),
  lower = list(continuous = "smooth")
) +
  labs(title = "Pair Plot of Key Musical Features") +
  theme_spotify()

## Music Feature Trends Over Time

Analyzing how musical characteristics have changed throughout the decades, showing the
evolution of popular music from the 1920s to the 2010s.

In [None]:
# Analyze musical feature trends over time by year
year_features <- df %>%
  group_by(year) %>%
  filter(n() > 5) %>%  # Filter to years with sufficient data
  summarise(
    avg_valence = mean(valence, na.rm = TRUE),
    avg_energy = mean(energy, na.rm = TRUE),
    avg_acousticness = mean(acousticness, na.rm = TRUE),
    avg_danceability = mean(danceability, na.rm = TRUE),
    avg_popularity = mean(popularity, na.rm = TRUE),
    count = n()
  ) %>%
  filter(year >= 1950) # Focus on more recent years with better data

# Convert to long format for plotting
year_features_long <- year_features %>%
  pivot_longer(
    cols = starts_with("avg_"),
    names_to = "feature",
    values_to = "value"
  ) %>%
  mutate(feature = gsub("avg_", "", feature))

# Create time series plots with smoothed trend lines
ggplot(year_features_long, aes(x = year, y = value, color = feature)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", se = TRUE) +
  facet_wrap(~feature, scales = "free_y") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Musical Feature Trends Over Time (1950-2020)",
       x = "Year", y = "Average Value",
       color = "Feature") +
  theme_spotify() +
  theme(legend.position = "none")

In [None]:
# Analyze the evolution of key and mode preferences over time
key_year <- df %>%
  filter(year >= 1950) %>%
  group_by(year) %>%
  filter(n() > 30) %>%  # Filter to years with sufficient data
  group_by(year, key) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(year) %>%
  mutate(percentage = count / sum(count) * 100)

# Visualize the evolution of key preferences
ggplot(key_year, aes(x = year, y = percentage, fill = key)) +
  geom_area() +
  scale_fill_brewer(palette = "Spectral", name = "Musical Key") +
  labs(title = "Evolution of Musical Key Preferences (1950-2020)",
       x = "Year", y = "Percentage of Songs") +
  theme_spotify()

In [None]:
# Analyze the relationship between year and popularity
ggplot(df, aes(x = year, y = popularity)) +
  geom_hex(bins = 30) +
  scale_fill_viridis(name = "Count") +
  geom_smooth(color = "white", se = FALSE) +
  labs(title = "Relationship Between Release Year and Song Popularity",
       x = "Release Year", y = "Popularity Score") +
  theme_spotify()

In [None]:
# Compare musical characteristics of popular songs across different eras
popularity_threshold <- 75

popular_songs <- df %>%
  filter(popularity >= popularity_threshold) %>%
  mutate(era = case_when(
    year < 1970 ~ "Pre-1970s",
    year < 1990 ~ "1970s-1980s",
    year < 2010 ~ "1990s-2000s",
    TRUE ~ "2010s+"
  ))

# Calculate average features by era for popular songs
popular_era_features <- popular_songs %>%
  group_by(era) %>%
  summarise(
    avg_valence = mean(valence, na.rm = TRUE),
    avg_energy = mean(energy, na.rm = TRUE),
    avg_acousticness = mean(acousticness, na.rm = TRUE),
    avg_danceability = mean(danceability, na.rm = TRUE),
    song_count = n()
  ) %>%
  arrange(factor(era, levels = c("Pre-1970s", "1970s-1980s", "1990s-2000s", "2010s+")))

# Convert to long format for visualization
popular_era_long <- popular_era_features %>%
  pivot_longer(
    cols = starts_with("avg_"),
    names_to = "feature",
    values_to = "value"
  ) %>%
  mutate(feature = gsub("avg_", "", feature))

# Create radar chart data
popular_era_wide <- popular_era_long %>%
  pivot_wider(names_from = era, values_from = value)

# Create grouped bar chart of features by era
ggplot(popular_era_long, aes(x = feature, y = value, fill = era)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = sprintf("%.2f", value)), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  scale_fill_brewer(palette = "Set2", name = "Era") +
  labs(title = paste("Musical Features of Popular Songs (Popularity ≥", popularity_threshold, ")"),
       subtitle = "Comparison Across Different Eras",
       x = "", y = "Average Value") +
  theme_spotify() +
  theme(axis.text.x = element_text(angle = 0))

## Conclusion

This analysis has explored the Spotify dataset using ggplot2 visualizations, revealing:

1. Distributions and trends in musical features across songs
2. Evolution of musical characteristics over different decades
3. Popularity patterns related to musical keys and modes
4. Notable artists and their musical signatures
5. Correlations between different musical attributes

These insights provide a comprehensive understanding of what makes songs popular 
and how musical preferences have evolved over time. The ggplot2 visualizations
have allowed us to create aesthetically pleasing and informative static plots
that effectively communicate these patterns.