# 04 Engagement KPIs

This notebook computes behavioral engagement KPIs from the unified
`listening_events` table. Metrics are derived from normalized Spotify
and Apple Music listening events stored in SQLite.

Focus areas include:
- Listening intensity
- Habit consistency
- Session behavior
- Preference diversity
- Discovery vs repeat listening

Outputs are exported as clean aggregate tables for Power BI.

In [1]:
import sqlite3
import pandas as pd
from pathlib import Path

In [2]:
DatabasePath = "../data/processed/MusicPlatformInsights.db"

## Step 1 — Load normalized listening events

We load the unified `listening_events` table from SQLite.
At this stage, all events already share a canonical schema and UTC timestamps.

In [3]:
connect = sqlite3.connect(DatabasePath)

Events = pd.read_sql_query(
    "SELECT * FROM listening_events",
    connect
)

connect.close()

print("Total listening events:", len(Events))
Events.head()

Total listening events: 11362


Unnamed: 0,event_time,platform,artist,track,duration_minutes,session_id
0,2024-08-10 02:45:00+00:00,spotify,Miguel,coffee,3.026717,1
1,2025-01-08 11:54:00+00:00,spotify,Travis Scott,ASTROTHUNDER,2.382817,2
2,2025-01-08 12:00:00+00:00,spotify,SZA,Another Life,2.880433,2
3,2025-01-08 12:03:00+00:00,spotify,BigXthaPlug,Change Me,2.2811,2
4,2025-01-08 12:07:00+00:00,spotify,The Marías,Heavy,4.220217,2


## Step 2 — Create time-based features

We derive date-, week-, and month-level fields from `event_time`.
These features support habit consistency, streak analysis, and
longitudinal engagement metrics.

In [8]:
# Normalize event_time so everything is comparable
#
# event_time comes from different sources and formats:
# - Spotify originally naive timestamps
# - Apple Music already timezone-aware
# - SQLite returns mixed ISO strings
#
# format="mixed" lets pandas handle all of that safely
# utc=True forces everything onto the same timeline
Events["event_time"] = pd.to_datetime(
    Events["event_time"],
    format="mixed",
    utc=True
)

# Add time-based fields for habit and longitudinal analysis
# These are derived here on purpose and not stored in the DB
Events["event_date"] = Events["event_time"].dt.date
Events["event_week"] = Events["event_time"].dt.to_period("W").astype(str)
Events["event_month"] = Events["event_time"].dt.to_period("M").astype(str)

Events.head()

  Events["event_week"] = Events["event_time"].dt.to_period("W").astype(str)
  Events["event_month"] = Events["event_time"].dt.to_period("M").astype(str)


Unnamed: 0,event_time,platform,artist,track,duration_minutes,session_id,event_date,event_week,event_month
0,2024-08-10 02:45:00+00:00,spotify,Miguel,coffee,3.026717,1,2024-08-10,2024-08-05/2024-08-11,2024-08
1,2025-01-08 11:54:00+00:00,spotify,Travis Scott,ASTROTHUNDER,2.382817,2,2025-01-08,2025-01-06/2025-01-12,2025-01
2,2025-01-08 12:00:00+00:00,spotify,SZA,Another Life,2.880433,2,2025-01-08,2025-01-06/2025-01-12,2025-01
3,2025-01-08 12:03:00+00:00,spotify,BigXthaPlug,Change Me,2.2811,2,2025-01-08,2025-01-06/2025-01-12,2025-01
4,2025-01-08 12:07:00+00:00,spotify,The Marías,Heavy,4.220217,2,2025-01-08,2025-01-06/2025-01-12,2025-01


## Step 3 — Listening intensity

Listening intensity measures total time spent listening,
independent of frequency or session structure.

These metrics establish a baseline for overall engagement
and platform usage.

In [11]:
# Compute total listening time by platform
# This is the baseline engagement metric and answers:
# "How much time do I spend listening, and where?"
ListeningByPlatform = (
    Events
    .groupby("platform", as_index=False)["duration_minutes"]
    .sum()
    .rename(columns={"duration_minutes": "total_minutes"})
    .sort_values("total_minutes", ascending=False)
)

ListeningByPlatform

Unnamed: 0,platform,total_minutes
0,apple_music,18487.8743
1,spotify,10758.034167


In [12]:
# Aggregate listening time at the day level
# Useful for trend charts and habit consistency visuals
DailyListening = (
    Events
    .groupby(["platform", "event_date"], as_index=False)["duration_minutes"]
    .sum()
    .rename(columns={"duration_minutes": "daily_minutes"})
)

DailyListening.head()

Unnamed: 0,platform,event_date,daily_minutes
0,apple_music,2025-09-24,1.988433
1,apple_music,2025-09-28,2.952467
2,apple_music,2025-09-30,103.571067
3,apple_music,2025-10-02,48.0017
4,apple_music,2025-10-03,130.83835


## Step 4 — Habit consistency (active days and weeks)

Habit consistency focuses on how often listening occurs,
not how long each listening session lasts.

We define:
- **Active day**: Any calendar day with at least one listening event
- **Active week**: Any calendar week with at least one listening event

These metrics help distinguish consistent listening habits from
sporadic or binge behavior.

In [13]:
# Identify active listening days
# One row per platform per day where at least one event occurred
ActiveDays = (
    Events
    .groupby(["platform", "event_date"], as_index=False)
    .size()
    .rename(columns={"size": "event_count"})
)

ActiveDays.head()

Unnamed: 0,platform,event_date,event_count
0,apple_music,2025-09-24,1
1,apple_music,2025-09-28,1
2,apple_music,2025-09-30,47
3,apple_music,2025-10-02,14
4,apple_music,2025-10-03,47


In [14]:
# Identify active listening weeks
# Useful for longer-term habit consistency and streak analysis
ActiveWeeks = (
    Events
    .groupby(["platform", "event_week"], as_index=False)
    .size()
    .rename(columns={"size": "event_count"})
)

ActiveWeeks.head()

Unnamed: 0,platform,event_week,event_count
0,apple_music,2025-09-22/2025-09-28,2
1,apple_music,2025-09-29/2025-10-05,108
2,apple_music,2025-10-06/2025-10-12,133
3,apple_music,2025-10-13/2025-10-19,182
4,apple_music,2025-10-20/2025-10-26,173


In [15]:
# Count total active days and weeks per platform
HabitSummary = (
    Events
    .groupby("platform")
    .agg(
        active_days=("event_date", "nunique"),
        active_weeks=("event_week", "nunique")
    )
    .reset_index()
)

HabitSummary

Unnamed: 0,platform,active_days,active_weeks
0,apple_music,80,17
1,spotify,235,43


## Step 5 — Session behavior

Sessions provide a more realistic unit of listening behavior
than individual track-level events.

Using precomputed session IDs, we analyze:
- Session length (minutes)
- Number of tracks per session
- Session start and end times

This helps distinguish casual listening from longer, more immersive sessions.

In [16]:
# Aggregate listening behavior at the session level
SessionMetrics = (
    Events
    .groupby(["platform", "session_id"], as_index=False)
    .agg(
        session_minutes=("duration_minutes", "sum"),
        session_events=("track", "count"),
        session_start=("event_time", "min"),
        session_end=("event_time", "max")
    )
)

SessionMetrics.head()

Unnamed: 0,platform,session_id,session_minutes,session_events,session_start,session_end
0,apple_music,1,1.988433,1,2025-09-24 14:10:33.680000+00:00,2025-09-24 14:10:33.680000+00:00
1,apple_music,2,2.952467,1,2025-09-28 02:49:36.263000+00:00,2025-09-28 02:49:36.263000+00:00
2,apple_music,3,23.3427,8,2025-09-30 14:58:17.450000+00:00,2025-09-30 15:05:09.128000+00:00
3,apple_music,4,80.228367,39,2025-09-30 21:57:01.732000+00:00,2025-09-30 23:25:05.490000+00:00
4,apple_music,5,25.32825,10,2025-10-02 01:39:55.123000+00:00,2025-10-02 01:59:21.528000+00:00


In [17]:
# High-level session summaries by platform
SessionSummary = (
    SessionMetrics
    .groupby("platform")
    .agg(
        avg_session_minutes=("session_minutes", "mean"),
        median_session_minutes=("session_minutes", "median"),
        avg_tracks_per_session=("session_events", "mean")
    )
    .reset_index()
)

SessionSummary

Unnamed: 0,platform,avg_session_minutes,median_session_minutes,avg_tracks_per_session
0,apple_music,3.773806,2.627067,1.484589
1,spotify,15.937828,9.116917,6.057778


## Step 6 — Preference diversity

Preference diversity measures the breadth of content exposure,
rather than total listening volume.

We focus on:
- Unique tracks listened to per day
- Unique artists listened to over time

These metrics help distinguish repetitive listening behavior
from exploratory or diverse listening patterns.

In [18]:
# Count unique tracks listened to per day
# Deduplication is applied at the KPI level, not globally
UniqueTracksPerDay = (
    Events
    .drop_duplicates(subset=["platform", "event_date", "track"])
    .groupby(["platform", "event_date"], as_index=False)
    .size()
    .rename(columns={"size": "unique_tracks"})
)

UniqueTracksPerDay.head()

Unnamed: 0,platform,event_date,unique_tracks
0,apple_music,2025-09-24,1
1,apple_music,2025-09-28,1
2,apple_music,2025-09-30,25
3,apple_music,2025-10-02,12
4,apple_music,2025-10-03,28


In [19]:
# Count unique artists listened to per week
# Weekly aggregation smooths out day-to-day noise
UniqueArtistsPerWeek = (
    Events
    .drop_duplicates(subset=["platform", "event_week", "artist"])
    .groupby(["platform", "event_week"], as_index=False)
    .size()
    .rename(columns={"size": "unique_artists"})
)

UniqueArtistsPerWeek.head()

Unnamed: 0,platform,event_week,unique_artists
0,apple_music,2025-09-22/2025-09-28,2
1,apple_music,2025-09-29/2025-10-05,21
2,apple_music,2025-10-06/2025-10-12,21
3,apple_music,2025-10-13/2025-10-19,28
4,apple_music,2025-10-20/2025-10-26,24


## Step 7 — Discovery vs repeat listening

Discovery metrics capture how often listening activity is driven by
new artists versus returning to familiar ones.

We define:
- **Discovery event**: The first recorded listen to an artist on a given platform
- **Repeat event**: Any subsequent listen to that artist

These metrics help quantify exploration behavior over time.

In [20]:
# Sort events chronologically so first listens are identifiable
Events = Events.sort_values("event_time")

# Identify the first time each artist appears per platform
Events["first_artist_listen"] = (
    Events
    .groupby(["platform", "artist"])["event_time"]
    .transform("min")
)

# Flag discovery events
Events["is_discovery"] = Events["event_time"] == Events["first_artist_listen"]

Events[["platform", "artist", "event_time", "is_discovery"]].head()

Unnamed: 0,platform,artist,event_time,is_discovery
0,spotify,Miguel,2024-08-10 02:45:00+00:00,True
1,spotify,Travis Scott,2025-01-08 11:54:00+00:00,True
2,spotify,SZA,2025-01-08 12:00:00+00:00,True
3,spotify,BigXthaPlug,2025-01-08 12:03:00+00:00,True
4,spotify,The Marías,2025-01-08 12:07:00+00:00,True


In [21]:
# Compute discovery rate per platform
DiscoveryRates = (
    Events
    .groupby("platform", as_index=False)["is_discovery"]
    .mean()
    .rename(columns={"is_discovery": "discovery_rate"})
)

DiscoveryRates

Unnamed: 0,platform,discovery_rate
0,apple_music,0.028736
1,spotify,0.122035


## Step 8 — Export KPI tables for Power BI

The following aggregate tables are exported for visualization:
- Listening intensity
- Habit consistency
- Session behavior
- Preference diversity
- Discovery rates

Raw event-level data is intentionally excluded from the dashboard layer.

In [22]:
OutputPath = Path("../data/processed/kpis")
OutputPath.mkdir(parents=True, exist_ok=True)

ListeningByPlatform.to_csv(OutputPath / "listening_by_platform.csv", index=False)
ActiveDays.to_csv(OutputPath / "active_days.csv", index=False)
ActiveWeeks.to_csv(OutputPath / "active_weeks.csv", index=False)
SessionMetrics.to_csv(OutputPath / "session_metrics.csv", index=False)
UniqueTracksPerDay.to_csv(OutputPath / "unique_tracks_per_day.csv", index=False)
UniqueArtistsPerWeek.to_csv(OutputPath / "unique_artists_per_week.csv", index=False)
DiscoveryRates.to_csv(OutputPath / "discovery_rates.csv", index=False)

print("KPI exports complete")

KPI exports complete
