# 02 Apple Music Ingestion, Cleaning, and SQLite Load

This notebook ingests exported Apple Music listening history files, normalizes them into the same canonical event schema used for Spotify, creates session IDs, and appends the results to the SQLite database table `listening_events`.

Notes:
- Apple Music data is appended to the existing event table.
- No KPIs are computed here.

In [None]:
import sqlite3
import pandas as pd
import glob
import json
from pathlib import Path

## Step 1, Define paths and basic parameters

Expected Apple Music export files:
- CSV files located in `data/raw/apple_music/`

Output:
- SQLite database: `data/processed/MusicPlatformInsights.db`
- Table: `listening_events`

In [None]:
DatabasePath = "data/processed/MusicPlatformInsights.db"
RawDataPath = "data/raw/apple_music/*.csv"

SessionGapMinutes = 30

## Step 2, Find Apple Music files

If no files are found yet, Apple has likely not finished processing the data request.

In [None]:
files = glob.glob(RawDataPath)

print("Files found:", len(files))
for file in files[:5]:
    print(file)

if len(files) == 0:
    raise FileNotFoundError("No Apple Music CSV files found in data/raw/apple_music/")

## Step 3, Load Apple Music files into a DataFrame

Apple Music files are typically already row-based, so we load them directly.

In [None]:
AppleFrames = []

for file in files:
    print(f"Loading {file}...")
    df = pd.read_csv(file)
    AppleFrames.append(df)

apple = pd.concat(AppleFrames, ignore_index=True)

print("Rows:", len(apple))
apple.head()

## Step 4, Inspect and standardize column names

Apple Music column names vary, so we lowercase everything first and then map to the canonical schema.

In [None]:
apple.columns = apple.columns.str.lower()
print("Columns:")
print(apple.columns)

## Step 5, Rename columns to canonical schema

We map Apple Music fields into:
- event_time
- artist
- track
- duration_minutes (if available)

In [None]:
apple = apple.rename(columns={
    "play_date": "event_time",
    "artist_name": "artist",
    "track_name": "track",
    "play_duration_minutes": "duration_minutes"
})

apple["event_time"] = pd.to_datetime(apple["event_time"], errors="coerce")
apple["platform"] = "apple_music"

# Duration may not exist in Apple exports
if "duration_minutes" not in apple.columns:
    apple["duration_minutes"] = None

# Keep canonical fields
apple = apple[["event_time", "platform", "artist", "track", "duration_minutes"]]
apple = apple.sort_values("event_time")

apple.head()

## Step 6, Quick sanity checks

We check:
- date range
- null counts

In [None]:
print("Min event_time:", apple["event_time"].min())
print("Max event_time:", apple["event_time"].max())

print("\nNull counts:")
print(apple.isna().sum())

## Step 7, Create session IDs

Same session definition as Spotify:
- A new session starts when the gap between consecutive events exceeds the threshold.

In [None]:
apple["prev_time"] = apple["event_time"].shift()
apple["gap_minutes"] = (apple["event_time"] - apple["prev_time"]).dt.total_seconds() / 60

apple["new_session"] = apple["gap_minutes"].isna() | (apple["gap_minutes"] > SessionGapMinutes)
apple["session_id"] = apple["new_session"].cumsum()

apple = apple.drop(columns=["prev_time", "gap_minutes", "new_session"])

apple.head()

## Step 8, Append Apple Music events to SQLite

Spotify initialized the table.
Apple Music appends rows to the same `listening_events` table.

In [None]:
connect = sqlite3.connect(DatabasePath)

apple.to_sql(
    "listening_events",
    connect,
    if_exists="append",
    index=False
)

connect.commit()
connect.close()

print("Apple Music rows loaded:", len(apple))

## Step 9, Verify Apple Music rows in SQLite

In [None]:
connect = sqlite3.connect(DatabasePath)

CheckQuery = """
SELECT
    platform,
    COUNT(*) AS RowCount,
    MIN(event_time) AS MinTime,
    MAX(event_time) AS MaxTime
FROM listening_events
GROUP BY platform;
"""

verification = pd.read_sql_query(CheckQuery, connect)
connect.close()

verification