# Notebook 03 — Mart Build (MART Layer)
---
## Goal
Build dashboard-ready, analytics-ready datasets from the CLEAN layer.
This is where business logic and strategic metrics are engineered — no raw cleaning here.

## Inputs
- `data/clean/titles_clean.parquet`
- `data/clean/provider_availability_clean.parquet`

## Outputs
- `data/mart/provider_catalog_mart.parquet`
- `data/mart/provider_market_metrics.parquet`
- `data/mart/overlap_matrix.parquet`
- `data/mart/manifest.json`

## Metrics Engineered Here
- **Catalog size** — unique titles per provider x country, with movie/TV split
- **Weighted rating** — vote_count-weighted average vote_average
- **Recency mix** — % titles in Recent / Mid / Legacy buckets
- **Localization index** — % titles where origin_country matches the market country
- **Recency Intensity Score** — vote_count-weighted freshness score; higher = more investment in recent content
- **Genre Concentration Index** — Herfindahl index on primary_genre shares; higher = more focused catalog
- **International Diversification Score** — % non-local content weighted by breadth of origin countries
- **Jaccard similarity** — catalog overlap between provider pairs per country
---

## Imports & Paths

In [1]:
import os
import json
import pandas as pd
import numpy as np
from pathlib import Path

# Project root detection — works whether running from /notebooks or project root
NOTEBOOK_DIR = Path.cwd()
PROJECT_ROOT = NOTEBOOK_DIR
for candidate in [NOTEBOOK_DIR, NOTEBOOK_DIR.parent, NOTEBOOK_DIR.parent.parent]:
    if (candidate / "data").exists() and (candidate / "notebooks").exists():
        PROJECT_ROOT = candidate
        break

DATA_CLEAN_DIR = str(PROJECT_ROOT / "data" / "clean")
DATA_MART_DIR  = str(PROJECT_ROOT / "data" / "mart")
os.makedirs(DATA_MART_DIR, exist_ok=True)

SNAPSHOT_DATE = "2026-02-21"
CURRENT_YEAR  = 2026

print(f"Project root: {PROJECT_ROOT}")
print(f"CLEAN dir   : {DATA_CLEAN_DIR}")
print(f"MART dir    : {DATA_MART_DIR}")

Project root: c:\Users\matt\OneDrive\Desktop\Data_Projects\Streaming_Benchmark_Project
CLEAN dir   : c:\Users\matt\OneDrive\Desktop\Data_Projects\Streaming_Benchmark_Project\data\clean
MART dir    : c:\Users\matt\OneDrive\Desktop\Data_Projects\Streaming_Benchmark_Project\data\mart


## Load CLEAN Files
No transformations here — pure load from the CLEAN layer.

In [2]:
titles       = pd.read_parquet(os.path.join(DATA_CLEAN_DIR, "titles_clean.parquet"))
availability = pd.read_parquet(os.path.join(DATA_CLEAN_DIR, "provider_availability_clean.parquet"))

print(f"titles       : {titles.shape}")
print(f"availability : {availability.shape}")
print()
print("Availability sample:")
print(availability.head(3))
print()
print("Titles sample:")
print(titles.head(3))

titles       : (46973, 17)
availability : (127399, 6)

Availability sample:
   tmdb_id media_type  provider_id  canonical_provider country snapshot_date
0  1168190      movie            9  Amazon Prime Video      US    2026-02-21
1  1317672      movie            9  Amazon Prime Video      US    2026-02-21
2   755898      movie            9  Amazon Prime Video      US    2026-02-21

Titles sample:
   tmdb_id media_type              title release_date  release_year  \
0  1168190      movie  The Wrecking Crew   2026-01-28          2026   
1  1317672      movie    Love Me Love Me   2026-02-12          2026   
2   755898      movie  War of the Worlds   2025-07-29          2025   

   recency_years recency_bucket    primary_genre                       genres  \
0              0         Recent           Action  Action|Comedy|Crime|Mystery   
1              0         Recent          Romance                Romance|Drama   
2              1         Recent  Science Fiction     Science Fiction|Thr

## Build provider_catalog_mart
Joins availability with title metadata. Grain: `(tmdb_id, media_type, canonical_provider, country)`.
This is the base fact table — all other mart tables derive from this.

In [3]:
catalog_mart = availability.merge(
    titles,
    on=["tmdb_id", "media_type"],
    how="left"
)

no_meta = catalog_mart["title"].isna().sum()
print(f"Rows with no metadata match: {no_meta} ({no_meta/len(catalog_mart)*100:.2f}%)")

CATALOG_COLS = [
    "tmdb_id", "media_type", "canonical_provider", "country",
    "title", "release_year", "recency_bucket",
    "primary_genre", "genres",
    "origin_country", "original_language",
    "vote_average", "vote_count", "popularity",
    "runtime_min",
]
catalog_mart = catalog_mart[CATALOG_COLS]

print(f"\nCatalog mart shape: {catalog_mart.shape}")
print("\nRows per provider x country:")
print(
    catalog_mart.groupby(["canonical_provider", "country"])
    .size()
    .unstack(fill_value=0)
    .to_string()
)

Rows with no metadata match: 0 (0.00%)

Catalog mart shape: (127399, 15)

Rows per provider x country:
country               DE    ES    FR     GB    IT    KR     US
canonical_provider                                            
Amazon Prime Video  8026     0     0  13451     0     0  14517
Apple TV Plus        309   309   305    311   314   305    317
Disney Plus         3489  3525  3620   4068  3570  2669   2349
Netflix             8699  9396  8651   8863  9015  8490   7794
Paramount Plus         0     0   878    933     0     0   3226


## Build provider_market_metrics
Computes all strategic metrics at `(canonical_provider, country)` grain.
Includes three composite scores: Recency Intensity, Genre Concentration, and International Diversification.

In [4]:
metrics = []

for (provider, country), group in catalog_mart.groupby(["canonical_provider", "country"]):

    # Catalog size
    catalog_size  = len(group)
    unique_titles = group["tmdb_id"].nunique()

    # Movie / TV split
    movie_pct = (group["media_type"] == "movie").mean()
    tv_pct    = (group["media_type"] == "tv").mean()

    # Weighted rating — vote_count as weight to avoid low-vote title inflation
    valid = group[group["vote_count"].notna() & group["vote_average"].notna()]
    if valid["vote_count"].sum() > 0:
        weighted_rating = (valid["vote_average"] * valid["vote_count"]).sum() / valid["vote_count"].sum()
    else:
        weighted_rating = None

    # Recency mix
    total = group["recency_bucket"].notna().sum()
    recent_pct = (group["recency_bucket"] == "Recent").sum() / total if total > 0 else None
    mid_pct    = (group["recency_bucket"] == "Mid").sum()    / total if total > 0 else None
    legacy_pct = (group["recency_bucket"] == "Legacy").sum() / total if total > 0 else None

    # Localization index — % titles where origin_country matches the market country
    has_origin = group["origin_country"].notna()
    if has_origin.sum() > 0:
        localization_index = (group.loc[has_origin, "origin_country"] == country).mean()
    else:
        localization_index = None

    # Recency Intensity Score — vote_count-weighted average recency score
    # Recent=3, Mid=2, Legacy=1 — higher = more investment in new content
    recency_map = {"Recent": 3, "Mid": 2, "Legacy": 1}
    group_r = group.copy()
    group_r["recency_score"] = group_r["recency_bucket"].map(recency_map)
    valid_r = group_r[group_r["recency_score"].notna() & group_r["vote_count"].notna()]
    if valid_r["vote_count"].sum() > 0:
        recency_intensity = (valid_r["recency_score"] * valid_r["vote_count"]).sum() / valid_r["vote_count"].sum()
    else:
        recency_intensity = None

    # Genre Concentration Index (Herfindahl) — sum of squared genre shares
    # 1.0 = single genre dominates | ~0 = fully diversified
    genre_counts = group["primary_genre"].dropna().value_counts(normalize=True)
    genre_concentration = (genre_counts ** 2).sum() if len(genre_counts) > 0 else None

    # International Diversification Score
    # Combines % non-local content with breadth of distinct origin countries
    has_origin_all = group["origin_country"].dropna()
    if len(has_origin_all) > 0:
        pct_non_local      = (has_origin_all != country).mean()
        distinct_countries = has_origin_all.nunique()
        intl_score = pct_non_local * np.log1p(distinct_countries) / np.log1p(100)
    else:
        intl_score = None

    metrics.append({
        "canonical_provider":         provider,
        "country":                    country,
        "catalog_size":               catalog_size,
        "unique_titles":              unique_titles,
        "movie_pct":                  round(movie_pct, 4),
        "tv_pct":                     round(tv_pct, 4),
        "weighted_rating":            round(weighted_rating, 4) if weighted_rating else None,
        "recent_pct":                 round(recent_pct, 4) if recent_pct is not None else None,
        "mid_pct":                    round(mid_pct, 4) if mid_pct is not None else None,
        "legacy_pct":                 round(legacy_pct, 4) if legacy_pct is not None else None,
        "localization_index":         round(localization_index, 4) if localization_index is not None else None,
        "recency_intensity_score":    round(recency_intensity, 4) if recency_intensity else None,
        "genre_concentration_idx":    round(genre_concentration, 4) if genre_concentration else None,
        "intl_diversification_score": round(intl_score, 4) if intl_score is not None else None,
    })

market_metrics = pd.DataFrame(metrics)

print(f"Market metrics shape: {market_metrics.shape}")
print()
print(market_metrics.to_string(index=False))

Market metrics shape: (27, 14)

canonical_provider country  catalog_size  unique_titles  movie_pct  tv_pct  weighted_rating  recent_pct  mid_pct  legacy_pct  localization_index  recency_intensity_score  genre_concentration_idx  intl_diversification_score
Amazon Prime Video      DE          8026           8010     0.8145  0.1855           7.0347      0.2865   0.3317      0.3819              0.1099                   1.6474                   0.1018                      0.8783
Amazon Prime Video      GB         13451          13387     0.7434  0.2566           6.9100      0.2792   0.3581      0.3627              0.0853                   1.6047                   0.1093                      0.9298
Amazon Prime Video      US         14517          14398     0.6888  0.3112           6.7451      0.2336   0.3695      0.3970              0.5336                   1.6478                   0.1095                      0.4703
     Apple TV Plus      DE           309            309     0.3269  0.6731  

## Build overlap_matrix
Computes Jaccard similarity between every provider pair for each country.
Grain: `(country, provider_a, provider_b)`. Self-comparisons (Jaccard = 1.0) are included for completeness.

In [5]:
overlap_rows = []

for country, group in catalog_mart.groupby("country"):
    provider_sets = {
        provider: set(subgroup["tmdb_id"].unique())
        for provider, subgroup in group.groupby("canonical_provider")
    }

    providers = sorted(provider_sets.keys())

    for i, prov_a in enumerate(providers):
        for prov_b in providers[i:]:  # upper triangle including diagonal
            set_a = provider_sets[prov_a]
            set_b = provider_sets[prov_b]

            intersection = len(set_a & set_b)
            union        = len(set_a | set_b)
            jaccard      = intersection / union if union > 0 else 0.0

            overlap_rows.append({
                "country":      country,
                "provider_a":   prov_a,
                "provider_b":   prov_b,
                "titles_a":     len(set_a),
                "titles_b":     len(set_b),
                "intersection": intersection,
                "union":        union,
                "jaccard":      round(jaccard, 4),
            })

overlap_matrix = pd.DataFrame(overlap_rows)

print(f"Overlap matrix shape: {overlap_matrix.shape}")
print()
us_overlap = overlap_matrix[
    (overlap_matrix["country"] == "US") &
    (overlap_matrix["provider_a"] != overlap_matrix["provider_b"])
][["provider_a", "provider_b", "jaccard"]].sort_values("jaccard", ascending=False)
print("Jaccard similarity — US:")
print(us_overlap.to_string(index=False))

Overlap matrix shape: (68, 8)

Jaccard similarity — US:
        provider_a     provider_b  jaccard
Amazon Prime Video        Netflix   0.0132
Amazon Prime Video Paramount Plus   0.0083
           Netflix Paramount Plus   0.0065
Amazon Prime Video    Disney Plus   0.0041
       Disney Plus        Netflix   0.0039
       Disney Plus Paramount Plus   0.0021
     Apple TV Plus    Disney Plus   0.0004
Amazon Prime Video  Apple TV Plus   0.0002
     Apple TV Plus        Netflix   0.0000
     Apple TV Plus Paramount Plus   0.0000


In [6]:
## Fix column dtypes before saving to parquet

# provider_catalog_mart — numeric columns stored as object
catalog_numeric = ['vote_average', 'vote_count', 'popularity', 'runtime_min', 'release_year']
for col in catalog_numeric:
    if col in catalog_mart.columns:
        catalog_mart[col] = pd.to_numeric(catalog_mart[col], errors='coerce')

# provider_market_metrics — float scores
metrics_numeric = [
    'catalog_size', 'unique_titles', 'movie_pct', 'tv_pct',
    'weighted_rating', 'recent_pct', 'mid_pct', 'legacy_pct',
    'localization_index', 'recency_intensity_score',
    'genre_concentration_idx', 'intl_diversification_score'
]
for col in metrics_numeric:
    if col in market_metrics.columns:
        market_metrics[col] = pd.to_numeric(market_metrics[col], errors='coerce')

# overlap_matrix
overlap_numeric = ['jaccard', 'intersection', 'union', 'titles_a', 'titles_b']
for col in overlap_numeric:
    if col in overlap_matrix.columns:
        overlap_matrix[col] = pd.to_numeric(overlap_matrix[col], errors='coerce')

print("Dtypes fixed:")
print(catalog_mart[catalog_numeric].dtypes)
print(market_metrics[metrics_numeric].dtypes)
print(overlap_matrix[overlap_numeric].dtypes)

Dtypes fixed:
vote_average    float64
vote_count        Int64
popularity      float64
runtime_min     float64
release_year      Int64
dtype: object
catalog_size                    int64
unique_titles                   int64
movie_pct                     float64
tv_pct                        float64
weighted_rating               float64
recent_pct                    float64
mid_pct                       float64
legacy_pct                    float64
localization_index            float64
recency_intensity_score       float64
genre_concentration_idx       float64
intl_diversification_score    float64
dtype: object
jaccard         float64
intersection      int64
union             int64
titles_a          int64
titles_b          int64
dtype: object


In [6]:
# ── Quality Assertions ──────────────────────────────────────────────────────

assert not catalog_mart.empty, "provider_catalog_mart is empty"
assert not market_metrics.empty, "provider_market_metrics is empty"
assert not overlap_matrix.empty, "overlap_matrix is empty"

# Critical columns must not be fully null
critical_catalog = ["canonical_provider", "country", "tmdb_id", "media_type"]
for col in critical_catalog:
    null_rate = catalog_mart[col].isna().mean()
    assert null_rate < 1.0, f"[catalog_mart] '{col}' is fully null"
    assert null_rate < 0.05, f"[catalog_mart] '{col}' has >{null_rate:.0%} nulls — check pipeline"

critical_metrics = ["canonical_provider", "country", "catalog_size", "weighted_rating"]
for col in critical_metrics:
    null_rate = market_metrics[col].isna().mean()
    assert null_rate < 1.0, f"[market_metrics] '{col}' is fully null"

# Jaccard must be between 0 and 1
assert overlap_matrix["jaccard"].between(0, 1).all(), \
    "jaccard values out of [0, 1] range"

# Providers must match expected set
expected_providers = {
    "Netflix", "Amazon Prime Video", "Disney Plus", "Apple TV Plus", "Paramount Plus"
}
actual_providers = set(catalog_mart["canonical_provider"].unique())
assert actual_providers == expected_providers, \
    f"Provider mismatch — expected: {expected_providers}, got: {actual_providers}"

print("✅ All assertions passed.")
print(f"   catalog_mart      : {len(catalog_mart):,} rows")
print(f"   market_metrics    : {len(market_metrics):,} rows")
print(f"   overlap_matrix    : {len(overlap_matrix):,} rows")

✅ All assertions passed.
   catalog_mart      : 127,399 rows
   market_metrics    : 27 rows
   overlap_matrix    : 68 rows


## Save MART Files + Manifest

In [7]:
catalog_path  = os.path.join(DATA_MART_DIR, "provider_catalog_mart.parquet")
metrics_path  = os.path.join(DATA_MART_DIR, "provider_market_metrics.parquet")
overlap_path  = os.path.join(DATA_MART_DIR, "overlap_matrix.parquet")

catalog_mart.to_parquet(catalog_path,   index=False)
market_metrics.to_parquet(metrics_path, index=False)
overlap_matrix.to_parquet(overlap_path, index=False)

print(f"Saved: {catalog_path}")
print(f"Saved: {metrics_path}")
print(f"Saved: {overlap_path}")

manifest = {
    "snapshot_date": SNAPSHOT_DATE,
    "current_year": CURRENT_YEAR,
    "source": "TMDB",
    "layer": "mart",
    "row_counts": {
        "provider_catalog_mart":    int(len(catalog_mart)),
        "provider_market_metrics":  int(len(market_metrics)),
        "overlap_matrix":           int(len(overlap_matrix)),
    },
    "providers": sorted(catalog_mart["canonical_provider"].unique().tolist()),
    "countries": sorted(catalog_mart["country"].unique().tolist()),
    "recency_buckets": {
        "Recent": f"release_year >= {CURRENT_YEAR - 5}",
        "Mid":    f"{CURRENT_YEAR - 15} < release_year < {CURRENT_YEAR - 5}",
        "Legacy": f"release_year <= {CURRENT_YEAR - 15}",
    },
    "metrics": [
        "catalog_size", "unique_titles", "movie_pct", "tv_pct",
        "weighted_rating", "recent_pct", "mid_pct", "legacy_pct",
        "localization_index", "recency_intensity_score",
        "genre_concentration_idx", "intl_diversification_score",
        "jaccard_similarity",
    ],
    "notes": [
        "Amazon Prime Video available only for US, GB, DE — TMDB does not map flatrate for IT, FR, ES, KR.",
        "Paramount Plus available only for US, GB, FR.",
        "Jaccard similarity computed on tmdb_id sets per country.",
        "Recency Intensity Score: vote_count-weighted avg (Recent=3, Mid=2, Legacy=1).",
        "Genre Concentration Index: Herfindahl index on primary_genre shares.",
        "International Diversification Score: pct_non_local * log(distinct_countries) / log(100).",
    ],
}

manifest_path = os.path.join(DATA_MART_DIR, "manifest.json")
with open(manifest_path, "w", encoding="utf-8") as f:
    json.dump(manifest, f, indent=2)

print(f"\nManifest written to: {manifest_path}")
print("\nMart build complete.")

Saved: c:\Users\matt\OneDrive\Desktop\Data_Projects\Streaming_Benchmark_Project\data\mart\provider_catalog_mart.parquet
Saved: c:\Users\matt\OneDrive\Desktop\Data_Projects\Streaming_Benchmark_Project\data\mart\provider_market_metrics.parquet
Saved: c:\Users\matt\OneDrive\Desktop\Data_Projects\Streaming_Benchmark_Project\data\mart\overlap_matrix.parquet

Manifest written to: c:\Users\matt\OneDrive\Desktop\Data_Projects\Streaming_Benchmark_Project\data\mart\manifest.json

Mart build complete.
