# 02 Apple Music Ingestion, Cleaning, and SQLite Load

This notebook ingests exported Apple Music listening history files, normalizes them into the same canonical event schema used for Spotify, creates session IDs, and appends the results to the SQLite database table `listening_events`.

Notes:
- Apple Music data is appended to the existing event table.
- No KPIs are computed here.

In [1]:
import sqlite3
import pandas as pd
import glob
import json
from pathlib import Path

## Step 1 — Define paths and basic parameters

Expected Apple Music export files:
- CSV files located in `data/raw/apple_music/`

Output:
- SQLite database: `data/processed/MusicPlatformInsights.db`
- Table: `listening_events`

In [2]:
# Parameters
MinPlaySeconds = 30
SessionGapMinutes = 30

# Paths (relative to notebooks/)
RawDataPath = Path("../data/raw/apple music")
DatabasePath = "../data/processed/MusicPlatformInsights.db"

# Quick sanity check
print("Apple Music raw data path:", RawDataPath)
print("Database path:", DatabasePath)

Apple Music raw data path: ../data/raw/apple music
Database path: ../data/processed/MusicPlatformInsights.db


## Step 2 — Find Apple Music files

We look for Apple Music listening history CSV files in the raw data folder.
If no files are found, Apple has likely not finished processing the data request.

In [3]:
# Find all CSV files in the Apple Music raw data folder
AppleFiles = list(RawDataPath.glob("*.csv"))

print("Apple Music files found:", len(AppleFiles))

# Print a few file names for a quick sanity check
for file in AppleFiles[:5]:
    print(file)

# If no files are found, stop here
if len(AppleFiles) == 0:
    raise FileNotFoundError(
        "No Apple Music CSV files found in data/raw/apple_music/")

Apple Music files found: 1
../data/raw/apple music/Apple Music Play Activity.csv


## Step 3 — Load Apple Music listening events

Apple Music play activity is stored in a single CSV file. In this step, we load
the raw listening events into a DataFrame and inspect the available fields
before normalizing them into the canonical event schema.

In [4]:
# Load Apple Music play activity CSV
ApplePlayFile = AppleFiles[0]

print(f"Loading {ApplePlayFile}...")
apple = pd.read_csv(ApplePlayFile)

print("Rows:", len(apple))
apple.head()

Loading ../data/raw/apple music/Apple Music Play Activity.csv...
Rows: 31291


  apple = pd.read_csv(ApplePlayFile)


Unnamed: 0,Age Bucket,Album Name,Apple ID Number,Apple Music Subscription,Auto Play,Build Version,Bundle Version,Camera Option,Carrier Name,Client Build Version,...,Subscription User ID,Transition Type,Translation Displayed,Use Listening History,User's Transition Type,User’s Audio Quality,User’s Playback Format,UTC Offset In Seconds,Vocal Attenuation Duration,Vocal Attenuation Model ID
0,,Envy Me - Single,10079424226,True,,"Music/3.1 iOS/12.3.1 model/iPhone9,4 hwp/t8010...",,,,,...,,,,False,,,,-25200,,
1,35-44,dont smile at me,10079424226,True,AUTO_ON,"Music/3.1 iOS/26.0.1 model/iPhone18,1 hwp/t815...",3.1,,VERIZONUS,,...,1608221000.0,,,False,Smart Transition,HIGH_QUALITY,SPATIAL,-25200,0.0,
2,35-44,In The Lonely Hour (10th Anniversary Edition /...,10079424226,True,AUTO_ON,"Music/3.1 iOS/26.0.1 model/iPhone18,1 hwp/t815...",3.1,,VERIZONUS,,...,1608221000.0,,,False,Smart Transition,HIGH_QUALITY,SPATIAL,-25200,0.0,
3,35-44,SOS Deluxe: LANA,10079424226,True,AUTO_UNKNOWN,"Music/3.1 iOS/26.1 model/iPhone18,1 hwp/t8150 ...",,,VERIZONUS,,...,1608221000.0,,False,True,,,,-28800,,
4,35-44,I Love You. (10th Anniversary Edition),10079424226,True,AUTO_ON,"Music/3.1 iOS/26.1 model/iPhone18,1 hwp/t8150 ...",3.1,,VERIZONUS,,...,1608221000.0,,,False,Smart Transition,HIGH_QUALITY,SPATIAL,-28800,0.0,


## Step 4 — Inspect and standardize column names

Apple Music column names vary and include spaces and capitalization.
In this step, we standardize column names by lowercasing them first
so they are easier to inspect and map into the canonical schema.

In [5]:
# Standardize column names to lowercase for easier matching
apple.columns = apple.columns.str.lower()

# Print column names to inspect what Apple provided
print("Apple Music columns:")
for col in apple.columns:
    print(col)

apple.head()

Apple Music columns:
age bucket
album name
apple id number
apple music subscription
auto play
build version
bundle version
camera option
carrier name
client build version
client device name
client ip address
client platform
container album name
container artist name
container global playlist id
container id
container itunes playlist id
container library id
container name
container origin type
container personalized id
container playlist folder id
container playlist id
container radio station id
container radio station version
container season id
container type
contingency
continuity microphone used
device app name
device app version
device identifier
device os name
device os version
device type
display count
display language
display type
end position in milliseconds
end reason type
evaluation variant
event end timestamp
event id
event post date time
event reason hint type
event received timestamp
event start timestamp
event timestamp
event type
feature name
grace period
grouping
house 

Unnamed: 0,age bucket,album name,apple id number,apple music subscription,auto play,build version,bundle version,camera option,carrier name,client build version,...,subscription user id,transition type,translation displayed,use listening history,user's transition type,user’s audio quality,user’s playback format,utc offset in seconds,vocal attenuation duration,vocal attenuation model id
0,,Envy Me - Single,10079424226,True,,"Music/3.1 iOS/12.3.1 model/iPhone9,4 hwp/t8010...",,,,,...,,,,False,,,,-25200,,
1,35-44,dont smile at me,10079424226,True,AUTO_ON,"Music/3.1 iOS/26.0.1 model/iPhone18,1 hwp/t815...",3.1,,VERIZONUS,,...,1608221000.0,,,False,Smart Transition,HIGH_QUALITY,SPATIAL,-25200,0.0,
2,35-44,In The Lonely Hour (10th Anniversary Edition /...,10079424226,True,AUTO_ON,"Music/3.1 iOS/26.0.1 model/iPhone18,1 hwp/t815...",3.1,,VERIZONUS,,...,1608221000.0,,,False,Smart Transition,HIGH_QUALITY,SPATIAL,-25200,0.0,
3,35-44,SOS Deluxe: LANA,10079424226,True,AUTO_UNKNOWN,"Music/3.1 iOS/26.1 model/iPhone18,1 hwp/t8150 ...",,,VERIZONUS,,...,1608221000.0,,False,True,,,,-28800,,
4,35-44,I Love You. (10th Anniversary Edition),10079424226,True,AUTO_ON,"Music/3.1 iOS/26.1 model/iPhone18,1 hwp/t8150 ...",3.1,,VERIZONUS,,...,1608221000.0,,,False,Smart Transition,HIGH_QUALITY,SPATIAL,-28800,0.0,


## Step 5 — Rename columns to canonical schema

Apple Music fields are explicitly mapped into the canonical event schema.
Only track-level listening data is retained; all other metadata is ignored.

Canonical fields:
- event_time
- artist
- track
- duration_minutes

In [6]:
print(apple.columns.tolist())

['age bucket', 'album name', 'apple id number', 'apple music subscription', 'auto play', 'build version', 'bundle version', 'camera option', 'carrier name', 'client build version', 'client device name', 'client ip address', 'client platform', 'container album name', 'container artist name', 'container global playlist id', 'container id', 'container itunes playlist id', 'container library id', 'container name', 'container origin type', 'container personalized id', 'container playlist folder id', 'container playlist id', 'container radio station id', 'container radio station version', 'container season id', 'container type', 'contingency', 'continuity microphone used', 'device app name', 'device app version', 'device identifier', 'device os name', 'device os version', 'device type', 'display count', 'display language', 'display type', 'end position in milliseconds', 'end reason type', 'evaluation variant', 'event end timestamp', 'event id', 'event post date time', 'event reason hint type

In [7]:
# Rename Apple Music telemetry fields into canonical schema
apple = apple.rename(columns={
    "event timestamp": "event_time",
    "song name": "track",
    "play duration milliseconds": "ms_played"
})

# Parse timestamps
apple["event_time"] = pd.to_datetime(
    apple["event_time"],
    errors="coerce",
    utc=True)

# Explicit platform label
apple["platform"] = "apple_music"

# Convert duration from milliseconds to minutes
apple["duration_minutes"] = apple["ms_played"] / 1000 / 60

# Artist intentionally left null (enrichment handled separately)
apple["artist"] = None

# Filter out very short plays (same logic as Spotify)
apple = apple[apple["ms_played"] >= (MinPlaySeconds * 1000)]

# Keep only canonical fields
apple = apple[[
    "event_time",
    "platform",
    "artist",
    "track",
    "duration_minutes"
]]

# Sort chronologically
apple = apple.sort_values("event_time").reset_index(drop=True)

print("Clean Apple Music rows:", len(apple))
apple.head()



Clean Apple Music rows: 6803


Unnamed: 0,event_time,platform,artist,track,duration_minutes
0,2025-09-24 14:10:33.680000+00:00,apple_music,,Moon (feat. Bon Iver),1.988433
1,2025-09-28 02:49:36.263000+00:00,apple_music,,ocean eyes,2.952467
2,2025-09-30 14:58:17.450000+00:00,apple_music,,Nobody Gets Me,3.015033
3,2025-09-30 15:01:54.689000+00:00,apple_music,,Always,3.7551
4,2025-09-30 15:02:41.553000+00:00,apple_music,,CHIHIRO,0.867517


In [8]:
apple["artist"].isna().mean()

np.float64(1.0)

## Step 6 — Quick sanity checks

Before creating session IDs, we run a few basic checks to confirm the
Apple Music data looks reasonable after normalization.

We check:
- overall date range
- missing values in canonical fields

In [9]:
# Check overall date range
print("Date range:")
print(apple["event_time"].min(), "to", apple["event_time"].max())

print("\nNull counts:")
print(
    apple[["event_time", "platform", "artist", "track", "duration_minutes"]]
    .isna()
    .sum())

Date range:
2025-09-24 14:10:33.680000+00:00 to 2026-01-07 22:51:18.410000+00:00

Null counts:
event_time          4735
platform               0
artist              6803
track                  0
duration_minutes       0
dtype: int64


## Step 7 — Create session IDs

We use the same session definition as Spotify.

Definition:
- A new session starts when the gap between consecutive events
  exceeds `SessionGapMinutes`.

In [10]:
# Look at the timestamp of the previous listening event
apple["prev_time"] = apple["event_time"].shift()

# Compute gap between consecutive events (in minutes)
apple["gap_minutes"] = (
    apple["event_time"] - apple["prev_time"]).dt.total_seconds() / 60

# Start a new session if:
# - this is the first event, or
# - the gap exceeds the session threshold
apple["new_session"] = (
    apple["gap_minutes"].isna() |
    (apple["gap_minutes"] > SessionGapMinutes))

# Assign session IDs by cumulatively summing session breaks
apple["session_id"] = apple["new_session"].cumsum()

# Drop intermediate columns used only for session construction
apple = apple.drop(columns=["prev_time", "gap_minutes", "new_session"])

apple.head()

Unnamed: 0,event_time,platform,artist,track,duration_minutes,session_id
0,2025-09-24 14:10:33.680000+00:00,apple_music,,Moon (feat. Bon Iver),1.988433,1
1,2025-09-28 02:49:36.263000+00:00,apple_music,,ocean eyes,2.952467,2
2,2025-09-30 14:58:17.450000+00:00,apple_music,,Nobody Gets Me,3.015033,3
3,2025-09-30 15:01:54.689000+00:00,apple_music,,Always,3.7551,3
4,2025-09-30 15:02:41.553000+00:00,apple_music,,CHIHIRO,0.867517,3


## Step 8 — Append Apple Music events to SQLite

Spotify initialized the `listening_events` table.
Apple Music events are appended to the same table to preserve all existing data.

In [11]:
# Connect to the SQLite database
connect = sqlite3.connect(DatabasePath)

# Append Apple Music events to the existing listening_events table
apple.to_sql(
    "listening_events",
    connect,
    if_exists="append",
    index=False
)

connect.commit()
connect.close()

print("Apple Music rows appended:", len(apple))

Apple Music rows appended: 6803


## Step 9 — Verify Apple Music rows in SQLite

In [12]:
connect = sqlite3.connect(DatabasePath)

VerificationQuery = """
SELECT
    platform,
    COUNT(*) AS row_count,
    MIN(event_time) AS min_time,
    MAX(event_time) AS max_time
FROM listening_events
GROUP BY platform;
"""

pd.read_sql_query(VerificationQuery, connect)

connect.close()