# Data Visualizations

This notebook focuses on visualizing the processed data from the SQLite database created in the Data Processing notebook. This will prepare the data for answering the leading project question (how do users' listening habits vary based on the time of day?) and identifying trends related to playlist metadata and user preferences. 

Using the `Lets-Plot` library, visualizations will be created to provide insights into the relationships between playlist attributes.

## 1. Preparing the Environment

Before visualizing the data, the required libraries must be imported and the processed SQLite database from the previous notebook must be loaded into Pandas DataFrames.

In [28]:
# Library loading
import pandas as pd
from lets_plot import *
import sqlite3

# Removing warnings to reduce visual clutter
import warnings
warnings.filterwarnings("ignore")

# Initialize Lets-Plot
LetsPlot.setup_html()

# Load the SQLite database
database_path = "../data/processed_playlists.db"
conn = sqlite3.connect(database_path)

## 2. Data Preparation

To create visualizations, the necessary tables must be loaded into Pandas DataFrames. The data is then transformed as needed for analysis and visualization.

In [3]:
# connect to SQLite database
conn = sqlite3.connect("../data/processed_playlists.db")

# Load tables into DataFrames
playlists_df = pd.read_sql_query("SELECT * FROM playlists;", conn)
tracks_df = pd.read_sql_query("SELECT * FROM tracks;", conn)
artists_df = pd.read_sql_query("SELECT * FROM artists;", conn)
track_artist_df = pd.read_sql_query("SELECT * FROM track_artist;", conn)

# Close connection
conn.close()

# Clean the data

# Drop rows with missing followers or genre
playlists_df.dropna(subset=["followers"], inplace=True)
tracks_df.dropna(subset=["genres"], inplace=True)

# Normalize column names
playlists_df.columns = playlists_df.columns.str.lower()
tracks_df.columns = tracks_df.columns.str.lower()

## 3. Visualizations

### Genre Distribution by Time of Day (Line Graph)
This line graph shows the distribution of the most popular genres across four time periods: morning, afternoon, evening, and night. The genres were selected as the two most common for each time period, with duplicates removed.

- **X-Axis**: Time periods in chronological order.
- **Y-Axis**: Count of tracks for each genre.
- **Legend**: The top genres included in the analysis.

The visualization highlights shifts in genre popularity throughout the day, such as Pop peaking in the evening and Rap gaining prominence at night.

In [59]:
# Merge playlists and tracks on playlist_id
merged_df = pd.merge(tracks_df, playlists_df, on='playlist_id')

# Explode genres column into individual rows
merged_df['genres'] = merged_df['genres'].str.split(',')
merged_df = merged_df.explode('genres')

# Ensure time_of_day is ordered correctly
time_order = ['morning', 'afternoon', 'evening', 'night']
merged_df['time_of_day'] = pd.Categorical(merged_df['time_of_day'], categories=time_order, ordered=True)

# Group and count by time_of_day and genres
agg_data = merged_df.groupby(['time_of_day', 'genres']).size().reset_index(name='count')

# Identify the top 2 genres for each time_of_day
top_genres_by_period = agg_data.sort_values(['time_of_day', 'count'], ascending=[True, False]) \
                                .groupby('time_of_day')['genres'] \
                                .head(2)

# Create a unique set of the top genres
top_genres_set = set(top_genres_by_period)

# Filter the data to include only the top genres
filtered_data = agg_data[agg_data['genres'].isin(top_genres_set)]

# Defining colorblind palette
colorblind_palette = [
    '#332288',  # Indigo
    '#44AA99',  # Teal
    '#DDCC77',  # Sand
    '#CC6677', # Rose
    '#AA4499'  # Purple
]

# Plot
p = (
    ggplot(filtered_data, aes(x='time_of_day', y='count', color='genres', group='genres')) +
    geom_line(size=1.5) +
    geom_point(size=3) +
    ggtitle("Genres Distribution by Time of Day") +
    xlab("Time of Day") +
    ylab("Count of Tracks") +
    scale_x_discrete(limits=time_order) +
    scale_color_manual(values=colorblind_palette) +  # Apply 5-color palette
    theme(
        axis_text_x=element_text(angle=45, hjust=1, size=10),  # Adjust axis text
        legend_position="right",  # Place legend to the right for better clarity
        legend_title=element_text(size=10, face="bold"),  # Highlight legend title
        plot_title=element_text(size=14, face="bold", hjust=0.5),  # Centralize and bold title
        panel_background=element_rect(fill="#f9f9f9"),  # Subtle background color
        panel_grid_major=element_line(color="#eaeaea")  # Light grid lines for clarity
    )
)

p

The line graph demonstrates clear shifts in genre popularity throughout the day. Pop music exhibits the highest prominence during the evening, while Rap becomes more significant at night. These trends indicate that listening preferences vary considerably depending on the time of day.

### Genre Diversity by Time of Day (Bar Chart)
This visualization highlights the most common genres across all playlists and their respective counts.

In [56]:
# Count unique genres by time_of_day
unique_genres = merged_df.groupby('time_of_day')['genres'].nunique().reset_index()
unique_genres.rename(columns={'genres': 'unique_genre_count'}, inplace=True)

# Count total tracks by time_of_day
total_tracks = merged_df.groupby('time_of_day').size().reset_index(name='total_tracks') # reset_index to rename column

# Merge unique genres and total tracks data
genre_diversity = pd.merge(unique_genres, total_tracks, on='time_of_day')

# Calculate normalized diversity
genre_diversity['diversity_ratio'] = genre_diversity['unique_genre_count'] / genre_diversity['total_tracks']

# Ensure time_of_day is ordered
time_order = ['morning', 'afternoon', 'evening', 'night']
genre_diversity['time_of_day'] = pd.Categorical(genre_diversity['time_of_day'], categories=time_order, ordered=True)

# Plot
p = (
    ggplot(genre_diversity, aes(x='time_of_day', y='diversity_ratio')) +
    geom_bar(stat='identity', fill='#88CCEE', color='#333333', size=0.7) +  # Colorblind-friendly fill
    ggtitle(
        'Genre Diversity by Time of Day',
        subtitle='Genre diversity is the number of unique genres divided by the total number of tracks for each time period.'
    ) +
    xlab('Time of Day') +
    ylab('Diversity Ratio') +  # Simplified y-axis label
    scale_x_discrete(limits=time_order) +
    theme(
        axis_text_x=element_text(angle=45, hjust=1),
        axis_text_y=element_text(size=10),
        plot_title=element_text(size=14, face='bold'),
        plot_subtitle=element_text(size=10, color='#555555', hjust=0.5),  # Style for the subtitle
        legend_position='none',
        panel_background=element_rect(fill='white'),
        panel_grid_major=element_line(color='#D3D3D3')
    )
)
p


The bar chart reveals that genre diversity, normalized by the number of tracks, is highest in the afternoon. This suggests that listeners explore a broader variety of genres during this period compared to other times of the day.

### Genre Distribution by Time of Day (Stacked Bar Chart)
This stacked bar chart shows the proportion of different genres during each time of day, highlighting the composition of listening habits.

- **X-Axis**: Time periods in chronological order.
- **Y-Axis**: Percentage of total tracks.
- **Legend**: The genres included in the analysis.

In [55]:
# Aggregate total tracks by time_of_day and genre
genre_distribution = merged_df.groupby(['time_of_day', 'genres']).size().reset_index(name='count')

# Normalize counts to percentages
total_tracks_by_time = genre_distribution.groupby('time_of_day')['count'].transform('sum')
genre_distribution['percentage'] = (genre_distribution['count'] / total_tracks_by_time) * 100

# Order time of day properly
time_order = ['morning', 'afternoon', 'evening', 'night']
genre_distribution['time_of_day'] = pd.Categorical(genre_distribution['time_of_day'], categories=time_order, ordered=True)

# Filter to top genres for clarity
top_genres = genre_distribution.groupby('genres')['count'].sum().nlargest(10).index # nlargest is a pandas function that returns the n largest values
filtered_data = genre_distribution[genre_distribution['genres'].isin(top_genres)]

colorblind_palette = [
    '#332288',  # Indigo
    '#88CCEE',  # Cyan
    '#44AA99',  # Teal
    '#117733',  # Green
    '#999933',  # Olive
    '#DDCC77',  # Sand
    '#CC6677',  # Rose
    '#882255',  # Wine
    '#AA4499',  # Purple
    '#DDDDDD'   # Pale Grey
]

# Plot
p = (
    ggplot(filtered_data, aes(x='time_of_day', y='percentage', fill='genres')) +
    geom_bar(stat='identity', position='stack') +
    ggtitle(
        'Top 10 Genres Distributed by Time of Day (Percentage)',
        subtitle='This chart shows the percentage contribution of the top 10 genres across time periods.'
    ) +
    xlab('Time of Day') +
    ylab('Percentage of Tracks') +
    scale_x_discrete(limits=time_order) +
    scale_fill_manual(
        values=colorblind_palette,
        name='Genres'
    ) +
    theme(
        axis_text_x=element_text(angle=45, hjust=1),
        axis_text_y=element_text(size=10),
        plot_title=element_text(size=14, face='bold'),
        plot_subtitle=element_text(size=10, color='#555555', hjust=0.5),
        legend_position='right',  # Move the legend to the right
        legend_text=element_text(size=10),
        legend_title=element_text(size=12, face='bold'),
        panel_background=element_rect(fill='white'),
        panel_grid_major=element_line(color='#D3D3D3'),
    )
)
p



The stacked bar chart illustrates the proportional distribution of genres across different time periods. Pop music is dominant in the evening, while genres like Lo-Fi Study are more prevalent during the morning and night. This highlights how listening habits and preferences align with the context and mood of different times of day.

### Playlist Followers by Time of Day (Line Graph)

This line graph visualizes the average number of followers for playlists featuring the top 10 most frequently appearing artists across different times of day (morning, afternoon, evening, and night). The visualization helps identify patterns in playlist popularity for top artists based on when these playlists are typically listened to, with each point representing the mean follower count for a given artist during that time period.

- **X-Axis:** Time periods in chronological order.
- **Y-Axis:** Number of followers.
- **Legend:** The most popular artists included in the analysis.

In [60]:
# Merge tracks with track-artist relationships
tracks_with_artists = pd.merge(tracks_df, track_artist_df, on='track_id')

# Add artist names
tracks_with_artists = pd.merge(tracks_with_artists, artists_df.rename(columns={'name': 'artist_name'}), on='artist_id')

# Merge with playlist data
tracks_with_playlists = pd.merge(tracks_with_artists, playlists_df, on='playlist_id')

# Identify the top 10 artists by appearance in playlists
top_artists = (
    tracks_with_playlists
    .groupby('artist_name')
    .size()
    .sort_values(ascending=False)
    .head(10)
    .index
)

# Filter for top artists and aggregate to get average followers per time of day
scatter_data = (
    tracks_with_playlists[tracks_with_playlists['artist_name'].isin(top_artists)]
    .groupby(['time_of_day', 'artist_name'])['followers']
    .mean()
    .reset_index()
)

# Ensure correct time order
time_order = ['morning', 'afternoon', 'evening', 'night']
scatter_data['time_of_day'] = pd.Categorical(
    scatter_data['time_of_day'], 
    categories=time_order, 
    ordered=True
)

# Plot
from lets_plot import ggplot, aes, geom_point, ggtitle, xlab, ylab, theme, element_text

colorblind_palette = [
    '#332288',  # Indigo
    '#88CCEE',  # Cyan
    '#44AA99',  # Teal
    '#117733',  # Green
    '#999933',  # Olive
    '#DDCC77',  # Sand
    '#CC6677',  # Rose
    '#882255',  # Wine
    '#AA4499',  # Purple
    '#DDDDDD'   # Pale Grey
]

p = (
    ggplot(scatter_data, aes(x='time_of_day', y='followers', color='artist_name', group='artist_name')) +
    geom_line(size=1) +
    geom_point(size=3) +  # Keep points for clarity at each time
    ggtitle('Playlist Followers Throughout the Day (Top 10 Artists)') +
    xlab('Time of Day') +
    ylab('Average Number of Followers') +
    scale_color_manual(values=colorblind_palette) +
    labs(color="Artist Name") +
    theme(axis_text_x=element_text(angle=45, hjust=1))
)

p

Morning playlists attract the highest follower counts (250,000-350,000) particularly for Post Malone, The Weeknd, and Harry Styles. Most artists see significantly lower followers during afternoon and evening hours, though some (Post Malone, Justin Bieber) experience a resurgence at night. A few artists like Ed Sheeran and Taylor Swift maintain steady but lower follower counts across all time periods. These patterns suggest playlist listening peaks in the morning with a smaller surge at night for select artists.

## Playlist Popularity by Time of Day (Box Plot)
This box plot visualizes the distribution of playlist followers across different times of day (morning, afternoon, evening, and night). The visualization helps identify patterns in overall playlist popularity and engagement levels for each time period.

In [68]:
# Define correct time order (excluding unknown)
time_order = ['morning', 'afternoon', 'evening', 'night']

# Filter out unknown times and aggregate data
time_stats = (
    playlists_df[playlists_df['time_of_day'].isin(time_order)]  # Filter out unknown
    .groupby('time_of_day')
    .agg({
        'followers': ['mean', 'median', 'count', 'std']
    })
    .reset_index()
)

# Ensure correct ordering
playlists_df['time_of_day'] = pd.Categorical(
    playlists_df['time_of_day'],
    categories=time_order,
    ordered=True
)

# Create box plot
from lets_plot import geom_boxplot, scale_y_continuous, layer_tooltips

p = (
    ggplot(playlists_df[playlists_df['time_of_day'].isin(time_order)], 
        aes(x='time_of_day', y='followers')) +
    geom_boxplot(
        fill='lightblue', 
        alpha=0.7,
        tooltips=layer_tooltips()
            .line("Time of Day: @{time_of_day}")
            .line("Median: @{..middle..}")
            .line("Q1: @{..lower..}")
            .line("Q3: @{..upper..}")
            .line("Min: @{..ymin..}")
            .line("Max: @{..ymax..}")
    ) +
    ggtitle('Distribution of Playlist Followers by Time of Day') +
    xlab('Time of Day') +
    ylab('Number of Followers') +
    theme(axis_text_x=element_text(angle=45, hjust=1))
)

p

Morning and night periods demonstrate both higher median follower counts and wider distributions, suggesting these are prime times for playlist engagement. In contrast, afternoon playlists consistently show the lowest follower counts with minimal variation. Evening playlists maintain moderate follower numbers, though some outliers achieve notably high follower counts. Overall, this supports the finding that playlist popularity peaks during morning and night hours, with a significant dip during afternoon periods.

## Conclusion
Spotify users' listening habits show distinct patterns throughout the day. Pop music dominates overall listening, particularly during evening hours, while rap gains prominence at night. The afternoon shows the highest genre diversity, suggesting listeners are more experimental during this time. Morning hours feature notably high follower counts for mainstream artists like Post Malone and The Weeknd, while nighttime sees more varied artist engagement. Interestingly, lo-fi study music peaks during afternoon hours, likely corresponding with work and study periods. These patterns reflect how music preferences align with daily activities and energy levels, from energetic mornings to more diverse afternoon listening and specialized nighttime preferences.