# Team USA Olympic & Paralympic Data Pipeline v2
### From Raw CSVs ‚Üí Unified, Enriched, BigQuery-Ready Tables

**Purpose:** Acquire, clean, merge, and enrich Olympic + Paralympic data for every Team USA athlete across all available Games years (1896‚Äì2024).

**Output Tables:**
- `team_usa_athletes` ‚Äî One row per unique athlete (identity, career stats, AI-generated profiles, vector embeddings)
- `team_usa_results` ‚Äî One row per athlete √ó event √ó Games (the fact table)

**Final Schema (Athletes ‚Äî 19 columns):**

| Column | Type | Description |
|--------|------|-------------|
| `athlete_id` | UUID5 | Deterministic ID from name + first_games_year + primary_sport |
| `name` | VARCHAR | Full name (Gemini-verified where possible) |
| `gender` | VARCHAR | Male / Female / NULL |
| `birth_date` | DATE | Sparse for Paralympic athletes |
| `games_type` | VARCHAR | Olympic / Paralympic |
| `games_season` | VARCHAR | Summer / Winter / Both |
| `primary_sport` | VARCHAR | Normalized sport name |
| `classification_code` | VARCHAR | Paralympic only (e.g., T54, S6, B1) |
| `height_cm` | FLOAT | Physical height (Olympic-heavy coverage) |
| `weight_kg` | FLOAT | Physical weight (Olympic-heavy coverage) |
| `first_games_year` | INT | First Games appearance |
| `last_games_year` | INT | Most recent Games appearance |
| `games_count` | INT | Number of Games appearances |
| `gold_count` | INT | Career gold medals |
| `silver_count` | INT | Career silver medals |
| `bronze_count` | INT | Career bronze medals |
| `total_medals` | INT | Career total medals |
| `profile_summary` | TEXT | AI-generated 2-paragraph bio (absorbs all bio fields) |
| `embedding` | VECTOR(3072) | Gemini embedding for similarity search |

**Pipeline Phases:**
- **Phase 1:** Setup & data acquisition
- **Phase 2:** Olympic athlete backbone
- **Phase 3:** Paralympic athlete backbone
- **Phase 4:** Unification (merge, deduplicate, normalize)
- **Phase 5:** Results table & career stat backfill
- **Phase 6:** Gemini enrichment (name verification + profiles + embeddings)
- **Phase 7:** Validation & export

**Data sources:** `gs://class-demo/team-usa/raw/`
**Final output:** `gs://class-demo/team-usa/final/`

---
## Phase 1: Setup & Data Acquisition

Configure the environment, download raw CSVs from GCS, and build two reference tables we'll use throughout the pipeline:

1. **Games Lookup** ‚Äî Every Olympic and Paralympic Games mapped to year, city, season, and type. This gives us reliable `games_season` derivation and richer context for Gemini prompts later.
2. **File Inventory** ‚Äî Catalog of every CSV we downloaded, with row counts and sizes.

**Sources:** `gs://class-demo/team-usa/raw/` (4 dataset folders, ~15 CSV files)

In [64]:
# ‚îÄ‚îÄ Phase 1, Step 1: Environment & Configuration ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

import pandas as pd
import numpy as np
import os
import re
import glob
import json
import uuid
import warnings
warnings.filterwarnings('ignore')

# Configuration
BUCKET = 'gs://class-demo/team-usa'
RAW_PATH = f'{BUCKET}/raw'
FINAL_PATH = f'{BUCKET}/final'
LOCAL_DIR = '/tmp/olympic-data'

os.makedirs(LOCAL_DIR, exist_ok=True)

PROJECT_ID = "qwiklabs-gcp-01-bafc8841fc77"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

print(f'Project:    {PROJECT_ID}')
print(f'Region:     {REGION}')
print(f'Raw data:   {RAW_PATH}')
print(f'Final out:  {FINAL_PATH}')
print(f'Local dir:  {LOCAL_DIR}')

Project:    qwiklabs-gcp-01-bafc8841fc77
Region:     us-central1
Raw data:   gs://class-demo/team-usa/raw
Final out:  gs://class-demo/team-usa/final
Local dir:  /tmp/olympic-data


In [65]:
# ‚îÄ‚îÄ Phase 1, Step 2: Games Lookup Table ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Every Olympic and Paralympic Games with year, city, season, and type.
# Used for games_season derivation and Gemini prompt context.

GAMES_LOOKUP = {
    # ‚îÄ‚îÄ Olympic Summer ‚îÄ‚îÄ
    (1896, 'Olympic'): ('Athens', 'Summer'),
    (1900, 'Olympic'): ('Paris', 'Summer'),
    (1904, 'Olympic'): ('St. Louis', 'Summer'),
    (1906, 'Olympic'): ('Athens', 'Summer'),        # Intercalated
    (1908, 'Olympic'): ('London', 'Summer'),
    (1912, 'Olympic'): ('Stockholm', 'Summer'),
    (1920, 'Olympic'): ('Antwerp', 'Summer'),
    (1924, 'Olympic'): ('Paris', 'Summer'),
    (1928, 'Olympic'): ('Amsterdam', 'Summer'),
    (1932, 'Olympic'): ('Los Angeles', 'Summer'),
    (1936, 'Olympic'): ('Berlin', 'Summer'),
    (1948, 'Olympic'): ('London', 'Summer'),
    (1952, 'Olympic'): ('Helsinki', 'Summer'),
    (1956, 'Olympic'): ('Melbourne', 'Summer'),
    (1960, 'Olympic'): ('Rome', 'Summer'),
    (1964, 'Olympic'): ('Tokyo', 'Summer'),
    (1968, 'Olympic'): ('Mexico City', 'Summer'),
    (1972, 'Olympic'): ('Munich', 'Summer'),
    (1976, 'Olympic'): ('Montreal', 'Summer'),
    (1980, 'Olympic'): ('Moscow', 'Summer'),
    (1984, 'Olympic'): ('Los Angeles', 'Summer'),
    (1988, 'Olympic'): ('Seoul', 'Summer'),
    (1992, 'Olympic'): ('Barcelona', 'Summer'),
    (1996, 'Olympic'): ('Atlanta', 'Summer'),
    (2000, 'Olympic'): ('Sydney', 'Summer'),
    (2004, 'Olympic'): ('Athens', 'Summer'),
    (2008, 'Olympic'): ('Beijing', 'Summer'),
    (2012, 'Olympic'): ('London', 'Summer'),
    (2016, 'Olympic'): ('Rio de Janeiro', 'Summer'),
    (2020, 'Olympic'): ('Tokyo', 'Summer'),
    (2024, 'Olympic'): ('Paris', 'Summer'),

    # ‚îÄ‚îÄ Olympic Winter (pre-1994 uses 3-tuple to disambiguate from Summer) ‚îÄ‚îÄ
    (1924, 'Olympic', 'Winter'): ('Chamonix', 'Winter'),
    (1928, 'Olympic', 'Winter'): ('St. Moritz', 'Winter'),
    (1932, 'Olympic', 'Winter'): ('Lake Placid', 'Winter'),
    (1936, 'Olympic', 'Winter'): ('Garmisch-Partenkirchen', 'Winter'),
    (1948, 'Olympic', 'Winter'): ('St. Moritz', 'Winter'),
    (1952, 'Olympic', 'Winter'): ('Oslo', 'Winter'),
    (1956, 'Olympic', 'Winter'): ('Cortina d\'Ampezzo', 'Winter'),
    (1960, 'Olympic', 'Winter'): ('Squaw Valley', 'Winter'),
    (1964, 'Olympic', 'Winter'): ('Innsbruck', 'Winter'),
    (1968, 'Olympic', 'Winter'): ('Grenoble', 'Winter'),
    (1972, 'Olympic', 'Winter'): ('Sapporo', 'Winter'),
    (1976, 'Olympic', 'Winter'): ('Innsbruck', 'Winter'),
    (1980, 'Olympic', 'Winter'): ('Lake Placid', 'Winter'),
    (1984, 'Olympic', 'Winter'): ('Sarajevo', 'Winter'),
    (1988, 'Olympic', 'Winter'): ('Calgary', 'Winter'),
    (1992, 'Olympic', 'Winter'): ('Albertville', 'Winter'),
    # Post-1994: Winter on own cycle, 2-tuple keys work fine
    (1994, 'Olympic'): ('Lillehammer', 'Winter'),
    (1998, 'Olympic'): ('Nagano', 'Winter'),
    (2002, 'Olympic'): ('Salt Lake City', 'Winter'),
    (2006, 'Olympic'): ('Turin', 'Winter'),
    (2010, 'Olympic'): ('Vancouver', 'Winter'),
    (2014, 'Olympic'): ('Sochi', 'Winter'),
    (2018, 'Olympic'): ('PyeongChang', 'Winter'),
    (2022, 'Olympic'): ('Beijing', 'Winter'),

    # ‚îÄ‚îÄ Paralympic Summer ‚îÄ‚îÄ
    (1960, 'Paralympic'): ('Rome', 'Summer'),
    (1964, 'Paralympic'): ('Tokyo', 'Summer'),
    (1968, 'Paralympic'): ('Tel Aviv', 'Summer'),
    (1972, 'Paralympic'): ('Heidelberg', 'Summer'),
    (1976, 'Paralympic'): ('Toronto', 'Summer'),
    (1980, 'Paralympic'): ('Arnhem', 'Summer'),
    (1984, 'Paralympic'): ('Stoke Mandeville/New York', 'Summer'),
    (1988, 'Paralympic'): ('Seoul', 'Summer'),
    (1992, 'Paralympic'): ('Barcelona', 'Summer'),
    (1996, 'Paralympic'): ('Atlanta', 'Summer'),
    (2000, 'Paralympic'): ('Sydney', 'Summer'),
    (2004, 'Paralympic'): ('Athens', 'Summer'),
    (2008, 'Paralympic'): ('Beijing', 'Summer'),
    (2012, 'Paralympic'): ('London', 'Summer'),
    (2016, 'Paralympic'): ('Rio de Janeiro', 'Summer'),
    (2020, 'Paralympic'): ('Tokyo', 'Summer'),
    (2024, 'Paralympic'): ('Paris', 'Summer'),

    # ‚îÄ‚îÄ Paralympic Winter (pre-1994 uses 3-tuple) ‚îÄ‚îÄ
    (1976, 'Paralympic', 'Winter'): ('√ñrnsk√∂ldsvik', 'Winter'),
    (1980, 'Paralympic', 'Winter'): ('Geilo', 'Winter'),
    (1984, 'Paralympic', 'Winter'): ('Innsbruck', 'Winter'),
    (1988, 'Paralympic', 'Winter'): ('Innsbruck', 'Winter'),
    (1992, 'Paralympic', 'Winter'): ('Albertville', 'Winter'),
    # Post-1994: own cycle
    (1994, 'Paralympic'): ('Lillehammer', 'Winter'),
    (1998, 'Paralympic'): ('Nagano', 'Winter'),
    (2002, 'Paralympic'): ('Salt Lake City', 'Winter'),
    (2006, 'Paralympic'): ('Turin', 'Winter'),
    (2010, 'Paralympic'): ('Vancouver', 'Winter'),
    (2014, 'Paralympic'): ('Sochi', 'Winter'),
    (2018, 'Paralympic'): ('PyeongChang', 'Winter'),
    (2022, 'Paralympic'): ('Beijing', 'Winter'),
}


def lookup_games(year, games_type, season_hint=None):
    """Look up city and season for a given Games year and type.

    For years where Summer and Winter overlap (pre-1994), use season_hint
    or the 3-tuple key to disambiguate.
    """
    if season_hint == 'Winter':
        key3 = (year, games_type, 'Winter')
        if key3 in GAMES_LOOKUP:
            return GAMES_LOOKUP[key3]
    key2 = (year, games_type)
    if key2 in GAMES_LOOKUP:
        return GAMES_LOOKUP[key2]
    return (None, None)


def get_season(year, games_type, season_hint=None):
    """Get just the season for a year+type."""
    _, season = lookup_games(year, games_type, season_hint)
    return season


# Quick validation
print(f'Games lookup entries: {len(GAMES_LOOKUP)}')
print(f'  Olympic:    {sum(1 for k in GAMES_LOOKUP if k[1] == "Olympic")}')
print(f'  Paralympic: {sum(1 for k in GAMES_LOOKUP if k[1] == "Paralympic")}')
print(f'\nSpot checks:')
print(f'  1996 Olympic ‚Üí {lookup_games(1996, "Olympic")}')
print(f'  2020 Paralympic ‚Üí {lookup_games(2020, "Paralympic")}')
print(f'  1992 Olympic Winter ‚Üí {lookup_games(1992, "Olympic", "Winter")}')
print(f'  2024 Olympic ‚Üí {lookup_games(2024, "Olympic")}')

Games lookup entries: 85
  Olympic:    55
  Paralympic: 30

Spot checks:
  1996 Olympic ‚Üí ('Atlanta', 'Summer')
  2020 Paralympic ‚Üí ('Tokyo', 'Summer')
  1992 Olympic Winter ‚Üí ('Albertville', 'Winter')
  2024 Olympic ‚Üí ('Paris', 'Summer')


In [66]:
# ‚îÄ‚îÄ Phase 1, Step 3: Download Raw Data from GCS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

!gsutil -m cp -r {RAW_PATH}/* {LOCAL_DIR}/

# Count what we got
csv_count = !find {LOCAL_DIR} -name "*.csv" | wc -l
print(f'\nTotal CSV files downloaded: {csv_count[0].strip()}')

Copying gs://class-demo/team-usa/raw/olympic-120years/noc_regions.csv...
/ [0 files][    0.0 B/  3.5 KiB]                                                Copying gs://class-demo/team-usa/raw/olympic-beijing2022/athletes.csv...
Copying gs://class-demo/team-usa/raw/olympic-120years/athlete_events.csv...
/ [0 files][    0.0 B/562.3 KiB]                                                / [0 files][    0.0 B/ 40.1 MiB]                                                Copying gs://class-demo/team-usa/raw/olympic-beijing2022/coaches.csv...
/ [0 files][    0.0 B/ 40.1 MiB]                                                Copying gs://class-demo/team-usa/raw/olympic-beijing2022/curling_results.csv...
Copying gs://class-demo/team-usa/raw/olympic-beijing2022/medals_total.csv...
/ [0 files][    0.0 B/ 40.1 MiB]                                                / [0 files][    0.0 B/ 40.2 MiB]                                                Copying gs://class-demo/team-usa/raw/olympic-beijing2022/techni

In [67]:
# ‚îÄ‚îÄ Phase 1, Step 4: File Inventory ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

file_inventory = []

for root, dirs, files in os.walk(LOCAL_DIR):
    dirs[:] = [d for d in dirs if d != '.git']
    for f in sorted(files):
        if f.endswith('.csv'):
            full_path = os.path.join(root, f)
            size_mb = os.path.getsize(full_path) / (1024 * 1024)
            rel_path = full_path.replace(LOCAL_DIR + '/', '')
            dataset = rel_path.split('/')[0]
            file_inventory.append({
                'dataset': dataset,
                'file': f,
                'rel_path': rel_path,
                'full_path': full_path,
                'size_mb': size_mb
            })

inv_df = pd.DataFrame(file_inventory)

print('FILES BY DATASET:\n')
for dataset, group in inv_df.groupby('dataset'):
    total_mb = group['size_mb'].sum()
    print(f'{dataset}/ ({len(group)} files, {total_mb:.1f} MB total)')
    for _, row in group.iterrows():
        print(f'  {row["file"]:45s} {row["size_mb"]:8.2f} MB')
    print()

FILES BY DATASET:

olympic-120years/ (2 files, 39.6 MB total)
  athlete_events.csv                               39.58 MB
  noc_regions.csv                                   0.00 MB

olympic-beijing2022/ (10 files, 1.3 MB total)
  athletes.csv                                      0.55 MB
  coaches.csv                                       0.01 MB
  curling_results.csv                               0.03 MB
  entries_discipline.csv                            0.00 MB
  events.csv                                        0.11 MB
  hockey_players_stats.csv                          0.43 MB
  hockey_results.csv                                0.01 MB
  medals.csv                                        0.12 MB
  medals_total.csv                                  0.00 MB
  technical_officials.csv                           0.01 MB

olympic-keithgalli/ (5 files, 70.3 MB total)
  bios.csv                                         26.67 MB
  bios_locs.csv                                    11.99 MB
  noc

In [68]:
# ‚îÄ‚îÄ Phase 1, Step 5: Shared Utilities ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Name normalization and cleanup functions used across all phases.


def norm_name(n):
    """Normalize a name for matching: uppercase, sort tokens, strip non-alpha.
    Handles bullets (‚Ä¢), dots, hyphens consistently."""
    if pd.isna(n):
        return ''
    s = str(n).replace('‚Ä¢', ' ').replace('.', ' ').replace('-', ' ')
    s = re.sub(r'[^A-Za-z\s]', '', s).upper().strip()
    tokens = sorted(s.split())
    return ' '.join(tokens)


def clean_name_mechanical(n):
    """Mechanical name cleanup: bullets‚Üíspaces, smart title case.
    Handles Mc/O' prefixes and Roman numeral suffixes."""
    if pd.isna(n):
        return None
    s = str(n).strip()

    # Bullets and dots as separators
    s = s.replace('‚Ä¢', ' ')

    # Collapse multiple spaces
    s = re.sub(r'\s+', ' ', s).strip()

    # Smart title case
    parts = []
    for word in s.split():
        w = word.strip()
        if not w:
            continue
        w_upper = w.upper()

        # Roman numerals
        if re.match(r'^(I{1,3}|IV|V|VI{0,3}|IX|X)$', w_upper):
            parts.append(w_upper)
        # Mc prefix
        elif w_upper.startswith('MC') and len(w) > 2:
            parts.append('Mc' + w[2:].title())
        # O' prefix
        elif w_upper.startswith("O'") and len(w) > 2:
            parts.append("O'" + w[2:].title())
        else:
            parts.append(w.title())

    return ' '.join(parts) if parts else None


def flip_name_order(n):
    """Flip 'LAST First' ‚Üí 'First Last' for two-word names only.
    Leaves 3+ word names untouched to avoid compound-name errors."""
    if pd.isna(n):
        return n
    parts = str(n).strip().split()
    if len(parts) == 2:
        # If first part is all-caps and second is not, likely LAST First
        if parts[0].isupper() and not parts[1].isupper():
            return f'{parts[1]} {parts[0]}'
    return str(n).strip()


def is_abbreviated(n):
    """Check if a name is abbreviated (e.g., 'SMITH J' or 'JONES A.')."""
    if pd.isna(n):
        return True
    parts = str(n).strip().split()
    if len(parts) < 2:
        return True
    for p in parts:
        cleaned = p.replace('.', '')
        if len(cleaned) == 1:
            return True
    return False


# Classification code extraction for Paralympic athletes
CLASSIFICATION_PATTERN = re.compile(
    r'\b('
    r'[TF][1-5]\d'          # Track/Field: T11-T54, F11-F57
    r'|S[1-9]\d?'           # Swimming: S1-S14
    r'|S[BM]\d{1,2}'        # SB1-SB14, SM1-SM14
    r'|B[1-3]'              # Blind: B1-B3
    r'|BC[1-4]'             # Boccia: BC1-BC4
    r'|LW\d{1,2}'          # Skiing: LW1-LW12
    r'|H[1-5]'              # Cycling handcycle: H1-H5
    r'|C[1-5]'              # Cycling: C1-C5
    r'|PT[1-5]'             # Para triathlon: PT1-PT5
    r'|SH[12]'              # Shooting: SH1-SH2
    r'|SU[1-5]'             # Standing upper: SU5
    r'|TT\d{1,2}'          # Table tennis: TT1-TT11
    r'|KL[1-3]'             # Kayak: KL1-KL3
    r'|VL[1-3]'             # Va'a: VL1-VL3
    r')\b'
)


def extract_classification(text):
    """Extract Paralympic classification code from an event string."""
    if pd.isna(text):
        return None
    matches = CLASSIFICATION_PATTERN.findall(str(text))
    return matches[0] if matches else None


def most_common(s):
    """Return the mode of a Series (for groupby aggregation)."""
    mode = s.dropna().mode()
    return mode.iloc[0] if len(mode) > 0 else None


# Verify
print('‚úÖ Shared utilities loaded')
print(f'  norm_name("John‚Ä¢Smith") ‚Üí "{norm_name("John‚Ä¢Smith")}"')
print(f'  clean_name_mechanical("MCCARTHY‚Ä¢john") ‚Üí "{clean_name_mechanical("MCCARTHY‚Ä¢john")}"')
print(f'  flip_name_order("SMITH John") ‚Üí "{flip_name_order("SMITH John")}"')
print(f'  extract_classification("Men\'s 100m T54") ‚Üí "{extract_classification("Men\'s 100m T54")}"')

‚úÖ Shared utilities loaded
  norm_name("John‚Ä¢Smith") ‚Üí "JOHN SMITH"
  clean_name_mechanical("MCCARTHY‚Ä¢john") ‚Üí "McCarthy John"
  flip_name_order("SMITH John") ‚Üí "John SMITH"
  extract_classification("Men's 100m T54") ‚Üí "T54"


In [69]:
# ‚îÄ‚îÄ PHASE 1 QC REPORT ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('=' * 70)
print('PHASE 1 QC: SETUP & DATA ACQUISITION')
print('=' * 70)

print(f'\nüìÇ DATA INVENTORY')
print(f'  Datasets:    {inv_df["dataset"].nunique()}')
print(f'  CSV files:   {len(inv_df)}')
print(f'  Total size:  {inv_df["size_mb"].sum():.1f} MB')

print(f'\nüìã DATASETS')
for dataset in sorted(inv_df['dataset'].unique()):
    sub = inv_df[inv_df['dataset'] == dataset]
    print(f'  {dataset:35s} {len(sub):2d} files, {sub["size_mb"].sum():7.1f} MB')

print(f'\nüó∫Ô∏è GAMES LOOKUP')
print(f'  Total entries: {len(GAMES_LOOKUP)}')
olympic_years = sorted(set(k[0] for k in GAMES_LOOKUP if k[1] == 'Olympic'))
para_years = sorted(set(k[0] for k in GAMES_LOOKUP if k[1] == 'Paralympic'))
print(f'  Olympic range:    {min(olympic_years)}‚Äì{max(olympic_years)} ({len(olympic_years)} editions)')
print(f'  Paralympic range: {min(para_years)}‚Äì{max(para_years)} ({len(para_years)} editions)')

print(f'\nüîß UTILITIES')
print(f'  ‚úÖ norm_name, clean_name_mechanical, flip_name_order')
print(f'  ‚úÖ extract_classification (Paralympic codes)')
print(f'  ‚úÖ lookup_games / get_season')

print(f'\n{"=" * 70}')
print('Ready to proceed to Phase 2: Olympic Athletes')
print('=' * 70)

PHASE 1 QC: SETUP & DATA ACQUISITION

üìÇ DATA INVENTORY
  Datasets:    6
  CSV files:   119
  Total size:  141.5 MB

üìã DATASETS
  olympic-120years                     2 files,    39.6 MB
  olympic-beijing2022                 10 files,     1.3 MB
  olympic-keithgalli                   5 files,    70.3 MB
  olympic-paris2024                   58 files,    14.3 MB
  paralympic-katiepress                2 files,     6.2 MB
  paralympic-piterfm                  42 files,     9.7 MB

üó∫Ô∏è GAMES LOOKUP
  Total entries: 85
  Olympic range:    1896‚Äì2024 (39 editions)
  Paralympic range: 1960‚Äì2024 (25 editions)

üîß UTILITIES
  ‚úÖ norm_name, clean_name_mechanical, flip_name_order
  ‚úÖ extract_classification (Paralympic codes)
  ‚úÖ lookup_games / get_season

Ready to proceed to Phase 2: Olympic Athletes


---
## Phase 2: Olympic Athletes

Build the Olympic athlete backbone from keithgalli (bios + results through 2022), then identify and add Paris 2024 debut athletes.

**Sources:**
- `olympic-keithgalli/bios.csv` ‚Üí USA athlete identification via NOC
- `olympic-keithgalli/bios_locs.csv` ‚Üí Structured birth, height, weight
- `olympic-keithgalli/results.csv` ‚Üí Career stats (Games appearances, medals, primary sport)
- `olympic-paris2024/athletes.csv` ‚Üí Gap check for 2024 debut athletes

**Key operations:**
1. Join bios with structured data (birth info, physical attributes)
2. Compute career stats from results (games_count, medals, primary_sport)
3. Mechanical name cleanup (bullet separators ‚Üí spaces, smart title case)
4. Identify Paris 2024 athletes not in keithgalli ‚Üí add as new rows
5. Derive `games_season` from Games lookup table

In [70]:
# ‚îÄ‚îÄ Phase 2, Step 1: Load keithgalli backbone ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# bios.csv has NOC for USA filtering; bios_locs.csv has structured physical data

bios = pd.read_csv(
    os.path.join(LOCAL_DIR, 'olympic-keithgalli', 'bios.csv'),
    low_memory=False
)
usa_bios = bios[
    bios['NOC'].astype(str).str.contains('United States', case=False, na=False)
].copy()

bios_locs = pd.read_csv(
    os.path.join(LOCAL_DIR, 'olympic-keithgalli', 'bios_locs.csv'),
    low_memory=False
)

# Join USA athletes with their structured data
olympic_backbone = usa_bios[['athlete_id', 'Sex', 'Used name', 'Roles']].merge(
    bios_locs[['athlete_id', 'name', 'born_date', 'born_city', 'born_region',
                'born_country', 'height_cm', 'weight_kg']],
    on='athlete_id',
    how='left'
)

# Prefer 'Used name' from bios, fall back to bios_locs 'name'
olympic_backbone['display_name'] = olympic_backbone['Used name'].fillna(
    olympic_backbone['name']
)

has_h = olympic_backbone['height_cm'].notna() & (olympic_backbone['height_cm'] > 0)
has_w = olympic_backbone['weight_kg'].notna() & (olympic_backbone['weight_kg'] > 0)

print(f"Keithgalli USA athletes: {len(olympic_backbone):,}")
print(f"  With birth_date:  {olympic_backbone['born_date'].notna().sum():,}")
print(f"  With height (>0): {has_h.sum():,}")
print(f"  With weight (>0): {has_w.sum():,}")
print(f"\nSex distribution:\n{olympic_backbone['Sex'].value_counts().to_string()}")

Keithgalli USA athletes: 10,332
  With birth_date:  10,277
  With height (>0): 8,050
  With weight (>0): 7,648

Sex distribution:
Sex
Male      7379
Female    2953


In [71]:
# ‚îÄ‚îÄ Phase 2, Step 2: Career stats from keithgalli results ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

results_raw = pd.read_csv(
    os.path.join(LOCAL_DIR, 'olympic-keithgalli', 'results.csv'),
    low_memory=False
)
usa_results = results_raw[
    results_raw['noc'].str.upper().str.strip() == 'USA'
].copy()

print(f"Results: {len(results_raw):,} total ‚Üí {len(usa_results):,} USA")
print(f"Unique USA athletes in results: {usa_results['athlete_id'].nunique():,}")
print(f"Year range: {int(usa_results['year'].min())} ‚Äî {int(usa_results['year'].max())}")

career = usa_results.groupby('athlete_id').agg(
    first_games_year=('year', 'min'),
    last_games_year=('year', 'max'),
    games_count=('year', 'nunique'),
    primary_sport=('discipline', most_common),
    gold_count=('medal', lambda x: (x == 'Gold').sum()),
    silver_count=('medal', lambda x: (x == 'Silver').sum()),
    bronze_count=('medal', lambda x: (x == 'Bronze').sum()),
    _season_types=('type', lambda x: ','.join(sorted(x.dropna().unique())))
).reset_index()

career['total_medals'] = (
    career['gold_count'] + career['silver_count'] + career['bronze_count']
)

print(f"\nCareer stats computed for {len(career):,} athletes")
print(f"  With medals: {(career['total_medals'] > 0).sum():,}")
print(f"  Multi-Games: {(career['games_count'] > 1).sum():,}")

Results: 308,408 total ‚Üí 21,353 USA
Unique USA athletes in results: 10,058
Year range: 1896 ‚Äî 2022

Career stats computed for 10,058 athletes
  With medals: 4,060
  Multi-Games: 2,630


In [72]:
# ‚îÄ‚îÄ Phase 2, Step 3: Merge backbone + career stats ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

olympic_athletes = olympic_backbone.merge(career, on='athlete_id', how='left')
olympic_athletes = olympic_athletes.drop(
    columns=['name', 'Used name', 'Roles'], errors='ignore'
)

# Standardize columns
olympic_athletes = olympic_athletes.rename(columns={
    'Sex': 'gender',
    'display_name': 'name',
    'born_date': 'birth_date',
    'born_city': 'birth_place',
    'born_country': 'birth_country'
})

# Drop born_region (not in final schema)
olympic_athletes = olympic_athletes.drop(columns=['born_region'], errors='ignore')
olympic_athletes['games_type'] = 'Olympic'


# Derive games_season from the _season_types field
def derive_season(row):
    st = row.get('_season_types', '')
    if pd.isna(st) or st == '':
        # Fall back to games lookup
        yr = row.get('first_games_year')
        if pd.notna(yr):
            return get_season(int(yr), 'Olympic')
        return None

    types = str(st).split(',')
    seasons = set()
    for t in types:
        t = t.strip()
        if 'Winter' in t:
            seasons.add('Winter')
        elif 'Summer' in t:
            seasons.add('Summer')

    if seasons == {'Summer', 'Winter'}:
        return 'Both'
    elif 'Winter' in seasons:
        return 'Winter'
    elif 'Summer' in seasons:
        return 'Summer'
    return None


olympic_athletes['games_season'] = olympic_athletes.apply(derive_season, axis=1)
olympic_athletes = olympic_athletes.drop(columns=['_season_types'], errors='ignore')

# Mechanical name cleanup
olympic_athletes['name'] = olympic_athletes['name'].apply(clean_name_mechanical)

print(f"Olympic athletes table: {len(olympic_athletes):,} rows")
print(f"  With career results:   {olympic_athletes['games_count'].notna().sum():,}")
print(f"  Bios only (no results): {olympic_athletes['games_count'].isna().sum():,}")
print(f"\nSeason distribution:")
print(olympic_athletes['games_season'].value_counts(dropna=False).to_string())

Olympic athletes table: 10,332 rows
  With career results:   10,058
  Bios only (no results): 274

Season distribution:
games_season
Summer    8119
Winter    1907
None       290
Both        16


In [73]:
# ‚îÄ‚îÄ Phase 2, Step 4: Paris 2024 gap analysis ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

paris = pd.read_csv(
    os.path.join(LOCAL_DIR, 'olympic-paris2024', 'athletes.csv'),
    low_memory=False
)
paris_usa = paris[paris['country_code'] == 'USA'].copy()

# Match using normalized names
paris_usa['_norm'] = paris_usa['name'].apply(norm_name)
olympic_athletes['_norm'] = olympic_athletes['name'].apply(norm_name)

matched = paris_usa['_norm'].isin(olympic_athletes['_norm'])

print(f"Paris 2024 USA athletes: {len(paris_usa):,}")
print(f"  Already in keithgalli (name match): {matched.sum():,}")
print(f"  Not matched: {(~matched).sum():,}")

# Pass 2: last name + birth year for unmatched
unmatched = paris_usa[~matched].copy()
unmatched['_last'] = unmatched['name'].apply(
    lambda n: str(n).split()[0].upper() if pd.notna(n) else ''
)
unmatched['_by'] = pd.to_datetime(
    unmatched['birth_date'], errors='coerce'
).dt.year

olympic_athletes['_last'] = olympic_athletes['name'].apply(
    lambda n: str(n).replace('‚Ä¢', ' ').split()[-1].upper() if pd.notna(n) else ''
)
olympic_athletes['_by'] = pd.to_datetime(
    olympic_athletes['birth_date'], errors='coerce'
).dt.year

pass2 = unmatched.merge(
    olympic_athletes[['_last', '_by']].drop_duplicates(),
    on=['_last', '_by'],
    how='inner'
)
print(f"  Pass 2 (last name + birth year): {len(pass2)} additional matches")

# Identify truly new athletes
pass2_norms = set(pass2['_norm']) if len(pass2) > 0 else set()
paris_new = paris_usa[
    (~paris_usa['_norm'].isin(set(olympic_athletes['_norm']))) &
    (~paris_usa['_norm'].isin(pass2_norms))
].copy()

print(f"  Truly new 2024 athletes: {len(paris_new):,}")

# Clean up temp columns
for df in [olympic_athletes, paris_usa, unmatched]:
    for col in ['_norm', '_last', '_by']:
        if col in df.columns:
            df.drop(columns=[col], inplace=True, errors='ignore')

Paris 2024 USA athletes: 619
  Already in keithgalli (name match): 214
  Not matched: 405
  Pass 2 (last name + birth year): 38 additional matches
  Truly new 2024 athletes: 367


In [74]:
# ‚îÄ‚îÄ Phase 2, Step 5: Add new Paris 2024 athletes ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

paris_new_rows = []
for _, row in paris_new.iterrows():
    # Flip name order (Paris uses LAST First) and clean
    raw_name = flip_name_order(str(row.get('name', '')))
    cleaned_name = clean_name_mechanical(raw_name)

    paris_new_rows.append({
        'name': cleaned_name,
        'gender': row.get('gender'),
        'birth_date': row.get('birth_date'),
        'birth_place': row.get('birth_place'),
        'birth_country': row.get('birth_country'),
        'height_cm': (
            row.get('height')
            if pd.notna(row.get('height')) and row.get('height', 0) > 0
            else None
        ),
        'weight_kg': (
            row.get('weight')
            if pd.notna(row.get('weight')) and row.get('weight', 0) > 0
            else None
        ),
        'games_type': 'Olympic',
        'games_season': 'Summer',  # Paris 2024 is Summer
        'primary_sport': row.get('disciplines'),
        'first_games_year': 2024,
        'last_games_year': 2024,
        'games_count': 1,
        'gold_count': 0,    # Will be backfilled from results in Phase 5
        'silver_count': 0,
        'bronze_count': 0,
        'total_medals': 0,
    })

paris_new_df = pd.DataFrame(paris_new_rows)

# Stack onto Olympic backbone
olympic_athletes = pd.concat([olympic_athletes, paris_new_df], ignore_index=True)

print(f"Added {len(paris_new_df):,} new Paris 2024 athletes")
print(f"Olympic athletes total: {len(olympic_athletes):,}")
print(f"\n‚ö†Ô∏è  Note: Medal counts for new Paris 2024 athletes are set to 0.")
print(f"   They will be backfilled from the results table in Phase 5.")

Added 367 new Paris 2024 athletes
Olympic athletes total: 10,699

‚ö†Ô∏è  Note: Medal counts for new Paris 2024 athletes are set to 0.
   They will be backfilled from the results table in Phase 5.


In [75]:
# ‚îÄ‚îÄ PHASE 2 QC REPORT ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('=' * 70)
print('PHASE 2 QC: OLYMPIC ATHLETES')
print('=' * 70)

print(f'\nüìä COUNTS')
print(f'  Total athletes:          {len(olympic_athletes):,}')
has_results = olympic_athletes['games_count'].notna().sum()
print(f'  With career results:     {has_results:,}')
print(f'  Bios only (no results):  {len(olympic_athletes) - has_results:,}')
print(f'  From Paris 2024 gap:     {len(paris_new_df):,}')

print(f'\nüìÖ TEMPORAL RANGE')
print(f'  Earliest: {olympic_athletes["first_games_year"].min()}')
print(f'  Latest:   {olympic_athletes["last_games_year"].max()}')

print(f'\nüìè PHYSICAL ATTRIBUTES')
h = (olympic_athletes['height_cm'].notna() & (olympic_athletes['height_cm'] > 0)).sum()
w = (olympic_athletes['weight_kg'].notna() & (olympic_athletes['weight_kg'] > 0)).sum()
print(f'  Height: {h:,} / {len(olympic_athletes):,} ({h/len(olympic_athletes)*100:.1f}%)')
print(f'  Weight: {w:,} / {len(olympic_athletes):,} ({w/len(olympic_athletes)*100:.1f}%)')

print(f'\nüèÖ MEDAL TOTALS')
for medal in ['gold_count', 'silver_count', 'bronze_count', 'total_medals']:
    val = olympic_athletes[medal].sum()
    print(f'  {medal:20s} {val:,.0f}')
print(f'  Athletes with 1+ medal: {(olympic_athletes["total_medals"] > 0).sum():,}')

print(f'\n‚öß GENDER')
print(olympic_athletes['gender'].value_counts(dropna=False).to_string())

print(f'\nüèüÔ∏è SEASON')
print(olympic_athletes['games_season'].value_counts(dropna=False).to_string())

print(f'\nüìã SAMPLE (5 random with results):')
sample_cols = ['name', 'gender', 'games_season', 'primary_sport',
               'games_count', 'total_medals', 'first_games_year', 'last_games_year']
sample_cols = [c for c in sample_cols if c in olympic_athletes.columns]
print(olympic_athletes.dropna(subset=['games_count'])[sample_cols].sample(
    5, random_state=42
).to_string())

print(f'\n‚ö†Ô∏è  KNOWN ISSUES')
print(f'  Paris 2024 medal counts: Not yet backfilled (set to 0)')
no_sport = olympic_athletes['primary_sport'].isna().sum()
print(f'  Missing primary_sport:   {no_sport:,}')
no_gender = olympic_athletes['gender'].isna().sum()
print(f'  Missing gender:          {no_gender:,}')

print(f'\n{"=" * 70}')
print('Ready to proceed to Phase 3: Paralympic Athletes')
print('=' * 70)

PHASE 2 QC: OLYMPIC ATHLETES

üìä COUNTS
  Total athletes:          10,699
  With career results:     10,425
  Bios only (no results):  274
  From Paris 2024 gap:     367

üìÖ TEMPORAL RANGE
  Earliest: 1896.0
  Latest:   2024.0

üìè PHYSICAL ATTRIBUTES
  Height: 8,237 / 10,699 (77.0%)
  Weight: 7,660 / 10,699 (71.6%)

üèÖ MEDAL TOTALS
  gold_count           2,717
  silver_count         1,806
  bronze_count         1,472
  total_medals         5,995
  Athletes with 1+ medal: 4,060

‚öß GENDER
gender
Male      7547
Female    3152

üèüÔ∏è SEASON
games_season
Summer    8486
Winter    1907
None       290
Both        16

üìã SAMPLE (5 random with results):
                      name  gender games_season         primary_sport  games_count  total_medals  first_games_year  last_games_year
10618       McCane Morelle  Female       Summer            ['Boxing']          1.0           0.0            2024.0           2024.0
6696           Brad Hauser    Male       Summer             Athletics 

---
## Phase 3: Paralympic Athletes

Build the Paralympic athlete backbone from three sources, extracting classification codes and standardizing names.

**Sources:**
- `paralympic-piterfm/2020_Tokyo/athletes.csv` ‚Üí 199 USA athletes, has dedicated `sport_class` column
- `paralympic-piterfm/2024_Paris/athletes.csv` ‚Üí 220 USA athletes, rich bio fields (hobbies, occupation, etc.), classification in event strings
- `paralympic-katiepress/medal_athlete.csv` ‚Üí Historical medalists 1960‚Äì2018, abbreviated names ("LAST I" format), classification in event strings

**Key operations:**
1. Extract and normalize classification codes from all sources
2. Mechanical name cleanup (LAST First ‚Üí First Last, dots, case)
3. Capture bio fields from Paris 2024 Paralympic data (these feed Gemini prompts in Phase 6, but don't become schema columns)
4. Infer gender from event strings for katiepress (no gender column)
5. Merge Tokyo + Paris overlapping athletes

In [76]:
# ‚îÄ‚îÄ Phase 3, Step 1: Tokyo 2020 Paralympic athletes ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

tokyo_para = pd.read_csv(
    os.path.join(LOCAL_DIR, 'paralympic-piterfm', '2020_Tokyo', 'athletes.csv'),
    low_memory=False
)
tokyo_usa = tokyo_para[
    tokyo_para['birth_country'].astype(str).str.contains('United States', case=False, na=False) |
    tokyo_para['country_code'].astype(str).str.upper().eq('USA') |
    tokyo_para['country'].astype(str).str.contains('United States', case=False, na=False)
].copy()

print(f"Tokyo 2020 Paralympic: {len(tokyo_para):,} total ‚Üí {len(tokyo_usa):,} USA")
print(f"  sport_class non-null: {tokyo_usa['sport_class'].notna().sum()} / {len(tokyo_usa)}")
print(f"  Disciplines: {sorted(tokyo_usa['discipline'].dropna().unique())}")

# Standardize
tokyo_std = tokyo_usa.rename(columns={
    'name': 'name',
    'gender': 'gender',
    'birth_date': 'birth_date',
    'discipline': 'primary_sport',
    'sport_class': 'classification_code',
}).copy()

# Flip name order and clean (piterfm uses LAST First)
tokyo_std['name'] = (
    tokyo_std['name']
    .apply(flip_name_order)
    .apply(clean_name_mechanical)
)

tokyo_std['games_type'] = 'Paralympic'
tokyo_std['games_season'] = 'Summer'
tokyo_std['first_games_year'] = 2020
tokyo_std['last_games_year'] = 2020
tokyo_std['games_count'] = 1

for col in ['gold_count', 'silver_count', 'bronze_count', 'total_medals']:
    tokyo_std[col] = 0

# Keep only unified schema columns
tokyo_keep = [
    'name', 'gender', 'birth_date', 'games_type', 'games_season',
    'primary_sport', 'classification_code', 'height_cm', 'weight_kg',
    'first_games_year', 'last_games_year', 'games_count',
    'gold_count', 'silver_count', 'bronze_count', 'total_medals'
]
tokyo_keep = [c for c in tokyo_keep if c in tokyo_std.columns]
tokyo_std = tokyo_std[tokyo_keep].copy()

print(f"\nTokyo standardized: {len(tokyo_std)} athletes")
print(f"  With classification: {tokyo_std['classification_code'].notna().sum()}")
print(f"  Sample:")
print(tokyo_std[['name', 'primary_sport', 'classification_code']].head(5).to_string())

Tokyo 2020 Paralympic: 4,527 total ‚Üí 252 USA
  sport_class non-null: 252 / 252
  Disciplines: ['Archery', 'Athletics', 'Canoe Sprint', 'Cycling Road', 'Cycling Track', 'Equestrian', 'Goalball', 'Judo', 'Powerlifting', 'Rowing', 'Shooting', 'Sitting Volleyball', 'Swimming', 'Table Tennis', 'Taekwondo', 'Triathlon', 'Wheelchair Basketball', 'Wheelchair Fencing', 'Wheelchair Rugby', 'Wheelchair Tennis']

Tokyo standardized: 252 athletes
  With classification: 252
  Sample:
                     name     primary_sport classification_code
30   Abrahams David Henry          Swimming       S13,SB13,SM13
140         Jazmin Almlie          Shooting                 SH2
199          Charles Aoki  Wheelchair Rugby                 3.0
209      Danielle Aravich         Athletics                 T47
219        Ryohei Ariyasu            Rowing              PR3-B2


In [77]:
# ‚îÄ‚îÄ Phase 3, Step 2: Paris 2024 Paralympic athletes ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

paris_para = pd.read_csv(
    os.path.join(LOCAL_DIR, 'paralympic-piterfm', '2024_Paris', 'athletes.csv'),
    low_memory=False
)
paris_para_usa = paris_para[
    paris_para['country_code'].astype(str).str.upper().eq('USA')
].copy()

print(f"Paris 2024 Paralympic: {len(paris_para):,} total ‚Üí {len(paris_para_usa):,} USA")

# Extract classification from events column
if 'events' in paris_para_usa.columns:
    paris_para_usa['classification_code'] = paris_para_usa['events'].apply(extract_classification)
elif 'sport_class' in paris_para_usa.columns:
    paris_para_usa['classification_code'] = paris_para_usa['sport_class']
else:
    paris_para_usa['classification_code'] = None

# Capture bio fields ‚Äî stored for Phase 6 Gemini prompts, NOT schema columns
bio_fields = [
    'reason', 'hero', 'philosophy', 'other_sports',
    'coach', 'hobbies', 'occupation', 'education'
]
bio_available = [f for f in bio_fields if f in paris_para_usa.columns]
print(f"  Bio fields available: {bio_available}")

# Store bio data separately for Phase 6 Gemini prompts
paris_para_bios = {}
for _, row in paris_para_usa.iterrows():
    name = row.get('name', '')
    bio = {f: row[f] for f in bio_available if pd.notna(row.get(f))}
    if bio:
        paris_para_bios[norm_name(name)] = bio

print(f"  Athletes with bio data: {len(paris_para_bios)}")

# Standardize
paris_para_std = paris_para_usa.rename(columns={
    'name': 'name',
    'gender': 'gender',
    'birth_date': 'birth_date',
    'discipline': 'primary_sport',
}).copy()

paris_para_std['name'] = (
    paris_para_std['name']
    .apply(flip_name_order)
    .apply(clean_name_mechanical)
)

paris_para_std['games_type'] = 'Paralympic'
paris_para_std['games_season'] = 'Summer'
paris_para_std['first_games_year'] = 2024
paris_para_std['last_games_year'] = 2024
paris_para_std['games_count'] = 1

for col in ['gold_count', 'silver_count', 'bronze_count', 'total_medals']:
    paris_para_std[col] = 0

para_keep = [
    'name', 'gender', 'birth_date', 'games_type', 'games_season',
    'primary_sport', 'classification_code',
    'first_games_year', 'last_games_year', 'games_count',
    'gold_count', 'silver_count', 'bronze_count', 'total_medals'
]
para_keep = [c for c in para_keep if c in paris_para_std.columns]
paris_para_std = paris_para_std[para_keep].copy()

# Height/weight typically not available for Paralympic athletes
paris_para_std['height_cm'] = None
paris_para_std['weight_kg'] = None

print(f"\nParis Paralympic standardized: {len(paris_para_std)} athletes")
print(f"  With classification: {paris_para_std['classification_code'].notna().sum()}")

Paris 2024 Paralympic: 4,459 total ‚Üí 220 USA
  Bio fields available: ['reason', 'hero', 'philosophy', 'other_sports', 'coach', 'hobbies', 'occupation', 'education']
  Athletes with bio data: 218

Paris Paralympic standardized: 220 athletes
  With classification: 88


In [78]:
# ‚îÄ‚îÄ Phase 3, Step 3: Merge Tokyo + Paris Paralympic ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Some athletes competed in both ‚Äî merge on normalized name

tokyo_std['_norm'] = tokyo_std['name'].apply(norm_name)
paris_para_std['_norm'] = paris_para_std['name'].apply(norm_name)

overlap = paris_para_std['_norm'].isin(tokyo_std['_norm'])
print(f"Tokyo-Paris overlap: {overlap.sum()} athletes competed in both")

# For overlapping athletes, update Tokyo records with Paris 2024 data
for _, paris_row in paris_para_std[overlap].iterrows():
    mask = tokyo_std['_norm'] == paris_row['_norm']
    if mask.any():
        idx = tokyo_std[mask].index[0]
        tokyo_std.loc[idx, 'last_games_year'] = 2024
        tokyo_std.loc[idx, 'games_count'] = 2
        # Prefer Paris classification if Tokyo is missing
        if pd.isna(tokyo_std.loc[idx, 'classification_code']):
            tokyo_std.loc[idx, 'classification_code'] = paris_row['classification_code']

# Add Paris-only athletes
paris_only = paris_para_std[~overlap].copy()
para_recent = pd.concat([tokyo_std, paris_only], ignore_index=True)
para_recent = para_recent.drop(columns=['_norm'], errors='ignore')

print(f"\nRecent Paralympic athletes: {len(para_recent)}")
print(f"  From Tokyo only:    {(para_recent['last_games_year'] == 2020).sum()}")
print(f"  Both Tokyo+Paris:   {(para_recent['games_count'] == 2).sum()}")
print(f"  Paris only:         {len(paris_only)}")

Tokyo-Paris overlap: 108 athletes competed in both

Recent Paralympic athletes: 364
  From Tokyo only:    144
  Both Tokyo+Paris:   108
  Paris only:         112


In [79]:
# ‚îÄ‚îÄ Phase 3, Step 4: Katiepress historical Paralympic athletes ‚îÄ‚îÄ‚îÄ

katie_raw = pd.read_csv(
    os.path.join(LOCAL_DIR, 'paralympic-katiepress', 'medal_athlete.csv'),
    low_memory=False
)
katie_raw = katie_raw[katie_raw['npc'].astype(str).str.upper().eq('USA')].copy()
print(f"Katiepress raw: {len(katie_raw):,} medal records")

# Extract classification from event column
if 'event' in katie_raw.columns:
    katie_raw['_class'] = katie_raw['event'].apply(extract_classification)

# Infer gender from event names
def infer_gender(event_str):
    if pd.isna(event_str):
        return None
    e = str(event_str).lower()
    if "women" in e or "female" in e:
        return "Female"
    if "men's" in e or "men " in e or "male" in e:
        return "Male"
    if "mixed" in e:
        return None
    return None

if 'event' in katie_raw.columns:
    katie_raw['_gender'] = katie_raw['event'].apply(infer_gender)

# Determine column names dynamically
name_col = next(
    (c for c in ['athlete', 'athlete_name', 'name'] if c in katie_raw.columns),
    katie_raw.columns[0]
)
year_col = next(
    (c for c in ['year', 'games_year', 'Year'] if c in katie_raw.columns),
    'year'
)
sport_col = next(
    (c for c in ['sport', 'discipline', 'Sport'] if c in katie_raw.columns),
    'sport'
)
medal_col = next(
    (c for c in ['medal', 'Medal'] if c in katie_raw.columns),
    'medal'
)

print(f"  Name col: '{name_col}', Year: '{year_col}', Sport: '{sport_col}', Medal: '{medal_col}'")
print(f"  Year range: {katie_raw[year_col].min()} ‚Äî {katie_raw[year_col].max()}")

Katiepress raw: 3,105 medal records
  Name col: 'athlete_name', Year: 'games_year', Sport: 'sport', Medal: 'medal'
  Year range: 1960 ‚Äî 2018


In [80]:
# ‚îÄ‚îÄ Phase 3, Step 5: Aggregate katiepress to per-athlete rows ‚îÄ‚îÄ‚îÄ‚îÄ

katie_athletes = katie_raw.groupby(name_col).agg(
    first_games_year=(year_col, 'min'),
    last_games_year=(year_col, 'max'),
    games_count=(year_col, 'nunique'),
    primary_sport=(sport_col, most_common),
    classification_code=(
        '_class',
        lambda x: x.dropna().mode().iloc[0] if len(x.dropna().mode()) > 0 else None
    ),
    gold_count=(
        medal_col,
        lambda x: (x.astype(str).str.lower() == 'gold').sum()
    ),
    silver_count=(
        medal_col,
        lambda x: (x.astype(str).str.lower() == 'silver').sum()
    ),
    bronze_count=(
        medal_col,
        lambda x: (x.astype(str).str.lower() == 'bronze').sum()
    ),
    gender=(
        '_gender',
        lambda x: x.dropna().mode().iloc[0] if len(x.dropna().mode()) > 0 else None
    ),
).reset_index()

katie_athletes = katie_athletes.rename(columns={name_col: 'name'})
katie_athletes['total_medals'] = (
    katie_athletes['gold_count'] +
    katie_athletes['silver_count'] +
    katie_athletes['bronze_count']
)
katie_athletes['games_type'] = 'Paralympic'
katie_athletes['birth_date'] = None
katie_athletes['height_cm'] = None
katie_athletes['weight_kg'] = None


# Derive games_season from the lookup table
def katie_season(row):
    seasons = set()
    for yr in range(int(row['first_games_year']), int(row['last_games_year']) + 1):
        s = get_season(yr, 'Paralympic')
        if s:
            seasons.add(s)
    if seasons == {'Summer', 'Winter'}:
        return 'Both'
    elif 'Winter' in seasons:
        return 'Winter'
    elif 'Summer' in seasons:
        return 'Summer'
    return 'Summer'  # Default for historical Paralympic


katie_athletes['games_season'] = katie_athletes.apply(katie_season, axis=1)

# Name cleanup
def clean_katie_name(n):
    if pd.isna(n):
        return None
    s = str(n).strip()
    s = re.sub(r'^[^A-Za-z]+', '', s)  # Strip leading punctuation
    s = s.strip()
    if len(s) < 3:
        return None
    return s

katie_athletes['name'] = katie_athletes['name'].apply(clean_katie_name)
katie_athletes = katie_athletes[katie_athletes['name'].notna()].copy()

# Flag abbreviated names
abbrev_count = katie_athletes['name'].apply(is_abbreviated).sum()

# Apply mechanical cleanup
katie_athletes['name'] = katie_athletes['name'].apply(clean_name_mechanical)

print(f"Katiepress aggregated: {len(katie_athletes):,} unique athletes")
print(f"  With classification: {katie_athletes['classification_code'].notna().sum():,}")
print(f"  With gender:         {katie_athletes['gender'].notna().sum():,}")
print(f"  Abbreviated names:   {abbrev_count:,} (will be resolved by Gemini in Phase 6)")
print(f"  Year range: {katie_athletes['first_games_year'].min()} ‚Äî {katie_athletes['last_games_year'].max()}")
print(f"  Season distribution:\n{katie_athletes['games_season'].value_counts().to_string()}")

Katiepress aggregated: 1,165 unique athletes
  With classification: 411
  With gender:         1,033
  Abbreviated names:   1,161 (will be resolved by Gemini in Phase 6)
  Year range: 1960 ‚Äî 2018
  Season distribution:
games_season
Summer    869
Both      243
Winter     53


In [81]:
# ‚îÄ‚îÄ Phase 3, Step 6: Remove katiepress athletes already in piterfm

# Avoid duplicates ‚Äî katiepress covers 1960-2018, piterfm covers 2020-2024
# Some athletes span both periods

para_recent['_norm'] = para_recent['name'].apply(norm_name)
katie_athletes['_norm'] = katie_athletes['name'].apply(norm_name)

katie_overlap = katie_athletes['_norm'].isin(para_recent['_norm'])
print(f"Katiepress athletes already in piterfm: {katie_overlap.sum()}")

# For overlapping, merge career stats (extend year range, sum medals)
for _, katie_row in katie_athletes[katie_overlap].iterrows():
    mask = para_recent['_norm'] == katie_row['_norm']
    if mask.any():
        idx = para_recent[mask].index[0]

        # Extend year range
        para_recent.loc[idx, 'first_games_year'] = min(
            para_recent.loc[idx, 'first_games_year'],
            katie_row['first_games_year']
        )

        # Add games count
        para_recent.loc[idx, 'games_count'] = (
            para_recent.loc[idx, 'games_count'] + katie_row['games_count']
        )

        # Add medals
        for medal in ['gold_count', 'silver_count', 'bronze_count', 'total_medals']:
            para_recent.loc[idx, medal] = (
                para_recent.loc[idx, medal] + katie_row[medal]
            )

        # Prefer classification from piterfm if available
        if pd.isna(para_recent.loc[idx, 'classification_code']):
            para_recent.loc[idx, 'classification_code'] = katie_row['classification_code']

        # Update gender if missing
        if pd.isna(para_recent.loc[idx, 'gender']):
            para_recent.loc[idx, 'gender'] = katie_row['gender']

# Keep katiepress-only athletes
katie_only = katie_athletes[~katie_overlap].drop(columns=['_norm']).copy()
para_recent = para_recent.drop(columns=['_norm'], errors='ignore')

# Stack all Paralympic athletes
paralympic_athletes = pd.concat([para_recent, katie_only], ignore_index=True)

print(f"\nParalympic athletes total: {len(paralympic_athletes):,}")
print(f"  From piterfm (recent): {len(para_recent):,}")
print(f"  From katiepress only:  {len(katie_only):,}")

Katiepress athletes already in piterfm: 0

Paralympic athletes total: 1,529
  From piterfm (recent): 364
  From katiepress only:  1,165


In [82]:
# ‚îÄ‚îÄ PHASE 3 QC REPORT ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('=' * 70)
print('PHASE 3 QC: PARALYMPIC ATHLETES')
print('=' * 70)

print(f'\nüìä COUNTS')
print(f'  Total athletes:       {len(paralympic_athletes):,}')
print(f'  From piterfm:         {len(para_recent):,} (Tokyo 2020 + Paris 2024)')
print(f'  From katiepress:      {len(katie_only):,} (historical 1960-2018)')

print(f'\nüìÖ TEMPORAL RANGE')
print(f'  Earliest: {paralympic_athletes["first_games_year"].min()}')
print(f'  Latest:   {paralympic_athletes["last_games_year"].max()}')

print(f'\nüè∑Ô∏è CLASSIFICATION CODES')
has_class = paralympic_athletes['classification_code'].notna().sum()
print(f'  Coverage: {has_class:,} / {len(paralympic_athletes):,} ({has_class/len(paralympic_athletes)*100:.1f}%)')
print(f'  Unique codes: {paralympic_athletes["classification_code"].nunique()}')

print(f'\nüèÖ MEDAL TOTALS')
for medal in ['gold_count', 'silver_count', 'bronze_count', 'total_medals']:
    val = paralympic_athletes[medal].sum()
    print(f'  {medal:20s} {val:,.0f}')
print(f'  Athletes with 1+ medal: {(paralympic_athletes["total_medals"] > 0).sum():,}')

print(f'\n‚öß GENDER')
print(paralympic_athletes['gender'].value_counts(dropna=False).to_string())

print(f'\nüèüÔ∏è SEASON')
print(paralympic_athletes['games_season'].value_counts(dropna=False).to_string())

print(f'\nüìù NAME QUALITY')
abbrev = paralympic_athletes['name'].apply(is_abbreviated).sum()
print(f'  Abbreviated names: {abbrev:,} (to be resolved by Gemini in Phase 6)')

print(f'\nüóÉÔ∏è BIO DATA (for Gemini prompts)')
print(f'  Paris 2024 athletes with bio fields: {len(paris_para_bios)}')
print(f'  Bio fields captured: {bio_available}')

print(f'\nüìã SAMPLE (5 random):')
sample_cols = [
    'name', 'gender', 'games_season', 'primary_sport', 'classification_code',
    'games_count', 'total_medals', 'first_games_year', 'last_games_year'
]
sample_cols = [c for c in sample_cols if c in paralympic_athletes.columns]
print(paralympic_athletes[sample_cols].sample(
    min(5, len(paralympic_athletes)), random_state=42
).to_string())

print(f'\n‚ö†Ô∏è  KNOWN ISSUES')
print(f'  Abbreviated names: {abbrev:,} (katiepress "LAST I" format)')
print(f'  Missing gender:    {paralympic_athletes["gender"].isna().sum():,}')
print(f'  Missing class:     {paralympic_athletes["classification_code"].isna().sum():,}')
print(f'  Medal counts for piterfm athletes not yet backfilled from results')

print(f'\n{"=" * 70}')
print('Ready to proceed to Phase 4: Unification')
print('=' * 70)

PHASE 3 QC: PARALYMPIC ATHLETES

üìä COUNTS
  Total athletes:       1,529
  From piterfm:         364 (Tokyo 2020 + Paris 2024)
  From katiepress:      1,165 (historical 1960-2018)

üìÖ TEMPORAL RANGE
  Earliest: 1960
  Latest:   2024

üè∑Ô∏è CLASSIFICATION CODES
  Coverage: 697 / 1,529 (45.6%)
  Unique codes: 154

üèÖ MEDAL TOTALS
  gold_count           1,154
  silver_count         959
  bronze_count         976
  total_medals         3,089
  Athletes with 1+ medal: 1,165

‚öß GENDER
gender
Male      768
Female    629
None      132

üèüÔ∏è SEASON
games_season
Summer    1233
Both       243
Winter      53

üìù NAME QUALITY
  Abbreviated names: 1,162 (to be resolved by Gemini in Phase 6)

üóÉÔ∏è BIO DATA (for Gemini prompts)
  Paris 2024 athletes with bio fields: 218
  Bio fields captured: ['reason', 'hero', 'philosophy', 'other_sports', 'coach', 'hobbies', 'occupation', 'education']

üìã SAMPLE (5 random):
                 name  gender games_season          primary_sport classif

---
## Phase 4: Unification

Merge Olympic and Paralympic athlete DataFrames into a single `all_athletes` table.

**Steps:**
1. Concatenate Olympic + Paralympic, inspect combined shape
2. Spot-check the katiepress "Both" seasons derivation
3. Normalize sport names (strip list brackets, parentheticals)
4. Fill missing `games_season` via sport‚Üíseason mapping
5. Detect and handle cross-type athletes (keep as separate rows)
6. Deduplicate within each games_type
7. Generate deterministic UUID5 `athlete_id`
8. Final column selection and ordering

In [83]:
# ‚îÄ‚îÄ Phase 4, Step 1: Concatenate & Initial Inspection ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Tag source before merging (helpful for debugging, will drop later)
olympic_athletes['_source'] = 'olympic'
paralympic_athletes['_source'] = 'paralympic'

all_athletes = pd.concat([olympic_athletes, paralympic_athletes], ignore_index=True)

print(f'Combined shape: {all_athletes.shape}')
print(f'\nBy source:')
print(all_athletes['_source'].value_counts())
print(f'\nBy games_type:')
print(all_athletes['games_type'].value_counts())
print(f'\nBy games_season:')
print(all_athletes['games_season'].value_counts(dropna=False))
print(f'\nColumns: {list(all_athletes.columns)}')

# ‚îÄ‚îÄ Spot-check: katiepress "Both" seasons ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# These athletes should have games spanning both Summer and Winter years
both_para = all_athletes[
    (all_athletes['games_season'] == 'Both') &
    (all_athletes['_source'] == 'paralympic')
]
print(f'\n{"=" * 60}')
print(f'SPOT CHECK: Paralympic "Both" season athletes: {len(both_para)}')

if len(both_para) > 0:
    sample = both_para.sample(min(10, len(both_para)), random_state=42)
    print(sample[['name', 'primary_sport', 'first_games_year', 'last_games_year',
                   'games_count', 'games_season', 'total_medals']].to_string(index=False))

   # Check plausibility: Summer-only sports flagged "Both" = misclassified
    WINTER_SPORTS = {
        'Alpine Skiing', 'Biathlon', 'Cross-Country Skiing', 'Ice Sledge Hockey',
        'Para Ice Hockey', 'Wheelchair Curling', 'Nordic Skiing', 'Snowboard',
        'Ice Sledge Speed Racing'
    }
    winter_both = both_para[both_para['primary_sport'].isin(WINTER_SPORTS)]
    summer_both = both_para[~both_para['primary_sport'].isin(WINTER_SPORTS)]
    print(f'\n  Winter-sport athletes marked "Both": {len(winter_both)}')
    print(f'  Summer-sport athletes marked "Both": {len(summer_both)}')
    print(f'  ‚ö†Ô∏è  Most "Both" are likely misclassified ‚Äî will fix in next step')

Combined shape: (12228, 20)

By source:
_source
olympic       10699
paralympic     1529
Name: count, dtype: int64

By games_type:
games_type
Olympic       10699
Paralympic     1529
Name: count, dtype: int64

By games_season:
games_season
Summer    9719
Winter    1960
None       290
Both       259
Name: count, dtype: int64

Columns: ['athlete_id', 'gender', 'birth_date', 'birth_place', 'birth_country', 'height_cm', 'weight_kg', 'name', 'first_games_year', 'last_games_year', 'games_count', 'primary_sport', 'gold_count', 'silver_count', 'bronze_count', 'total_medals', 'games_type', 'games_season', '_source', 'classification_code']

SPOT CHECK: Paralympic "Both" season athletes: 243
       name     primary_sport  first_games_year  last_games_year  games_count games_season  total_medals
    Brown D         Athletics            1976.0           2016.0          4.0         Both           5.0
   Asbury A         Athletics            1988.0           1996.0          3.0         Both           4

In [84]:
# ‚îÄ‚îÄ Phase 4, Step 2: Fix "Both" season misclassification ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Reclassify using sport ‚Üí season mapping
# These are unambiguously Winter Paralympic sports
WINTER_SPORTS = {
    'Alpine Skiing', 'Biathlon', 'Cross-Country Skiing', 'Ice Sledge Hockey',
    'Para Ice Hockey', 'Wheelchair Curling', 'Nordic Skiing', 'Snowboard',
    'Ice Sledge Speed Racing'
}

both_mask = all_athletes['games_season'] == 'Both'
both_athletes = all_athletes[both_mask].copy()

reclassified = 0
for idx, row in both_athletes.iterrows():
    sport = str(row['primary_sport']).strip()
    if sport in WINTER_SPORTS:
        all_athletes.at[idx, 'games_season'] = 'Winter'
        reclassified += 1
    elif sport != 'nan' and sport != 'None':
        # Known sport and NOT winter ‚Üí Summer
        all_athletes.at[idx, 'games_season'] = 'Summer'
        reclassified += 1

still_both = (all_athletes['games_season'] == 'Both').sum()
print(f'Reclassified {reclassified} / {len(both_athletes)} "Both" athletes')
print(f'Remaining "Both": {still_both}')
print(f'\nUpdated season distribution:')
print(all_athletes['games_season'].value_counts(dropna=False))

Reclassified 259 / 259 "Both" athletes
Remaining "Both": 0

Updated season distribution:
games_season
Summer    9922
Winter    2016
None       290
Name: count, dtype: int64


In [85]:
# ‚îÄ‚îÄ Phase 4, Step 3: Normalize sports & fill missing seasons ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Fix list-bracket sports from Paris 2024 gap athletes: "['Boxing']" ‚Üí "Boxing"
def clean_sport(s):
    if pd.isna(s):
        return s
    s = str(s).strip()
    # Strip list brackets and quotes
    if s.startswith('[') and s.endswith(']'):
        s = s.strip('[]').strip("'\"")
    # Strip parenthetical suffixes: "Ski Jumping (Skiing)" ‚Üí "Ski Jumping"
    s = re.sub(r'\s*\(.*?\)\s*$', '', s)
    return s.strip()

all_athletes['primary_sport'] = all_athletes['primary_sport'].apply(clean_sport)

# Verify bracket fix
bracket_remaining = all_athletes['primary_sport'].str.contains(r'[\[\]]', na=False).sum()
print(f'Sports with remaining brackets: {bracket_remaining}')

# Map obvious sports ‚Üí season for the 290 None records
SPORT_SEASON = {
    # Winter
    'Alpine Skiing': 'Winter', 'Biathlon': 'Winter', 'Bobsled': 'Winter',
    'Bobsleigh': 'Winter', 'Cross-Country Skiing': 'Winter', 'Curling': 'Winter',
    'Figure Skating': 'Winter', 'Freestyle Skiing': 'Winter',
    'Ice Hockey': 'Winter', 'Luge': 'Winter', 'Nordic Combined': 'Winter',
    'Short Track Speed Skating': 'Winter', 'Skeleton': 'Winter',
    'Ski Jumping': 'Winter', 'Snowboard': 'Winter', 'Speed Skating': 'Winter',
}

none_mask = all_athletes['games_season'].isna() | (all_athletes['games_season'] == 'None')
none_before = none_mask.sum()

def assign_season(row):
    if row['primary_sport'] in SPORT_SEASON:
        return SPORT_SEASON[row['primary_sport']]
    elif pd.notna(row['primary_sport']):
        return 'Summer'  # Default ‚Äî vast majority of sports are Summer
    return None

fixed_seasons = all_athletes[none_mask].apply(assign_season, axis=1)
all_athletes.loc[none_mask, 'games_season'] = fixed_seasons

none_after = (all_athletes['games_season'].isna() | (all_athletes['games_season'] == 'None')).sum()
print(f'None seasons: {none_before} ‚Üí {none_after}')
print(f'\nFinal season distribution:')
print(all_athletes['games_season'].value_counts(dropna=False))

# Quick look at sport cleanup
print(f'\nUnique sports: {all_athletes["primary_sport"].nunique()}')
print(f'\nTop 15 sports:')
print(all_athletes['primary_sport'].value_counts().head(15))

Sports with remaining brackets: 0
None seasons: 290 ‚Üí 274

Final season distribution:
games_season
Summer    9938
Winter    2016
None       274
Name: count, dtype: int64

Unique sports: 93

Top 15 sports:
primary_sport
Athletics              2542
Swimming                986
Rowing                  737
Ice Hockey              434
Wrestling               348
Artistic Gymnastics     333
Shooting                319
Basketball              295
Fencing                 287
Football                285
Alpine Skiing           280
Sailing                 266
Boxing                  260
Figure Skating          223
Volleyball              219
Name: count, dtype: int64


In [86]:
# ‚îÄ‚îÄ Phase 4, Step 4: Deduplicate & Generate UUID5 athlete_id ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
import uuid

all_athletes['_norm'] = all_athletes['name'].apply(norm_name)

# Find potential duplicates
dupes = all_athletes.groupby(['_norm', 'games_type']).size()
dupe_groups = dupes[dupes > 1]
print(f'Potential duplicate groups: {len(dupe_groups)}')

# Only merge when SAME sport or overlapping year ranges (likely same person)
# Different sport + different era = different person
before_dedup = len(all_athletes)
drop_indices = []

for (norm, gtype), count in dupe_groups.items():
    rows = all_athletes[(all_athletes['_norm'] == norm) & (all_athletes['games_type'] == gtype)]
    idxs = rows.index.tolist()

    # Group by sport ‚Äî same sport = likely same person
    sport_groups = rows.groupby('primary_sport')
    if len(sport_groups) == 1:
        # Same sport: keep row with most data
        keeper = rows.sort_values(['games_count', 'total_medals'], ascending=False).index[0]
        drop_indices.extend([i for i in idxs if i != keeper])
    else:
        # Different sports ‚Äî check year overlap
        sorted_rows = rows.sort_values('first_games_year')
        for i in range(len(sorted_rows) - 1):
            r1 = sorted_rows.iloc[i]
            r2 = sorted_rows.iloc[i + 1]
            # If year ranges overlap, likely same person switching sports
            if pd.notna(r1['last_games_year']) and pd.notna(r2['first_games_year']):
                if r1['last_games_year'] >= r2['first_games_year']:
                    # Overlapping ‚Äî keep the one with more games
                    pair = sorted_rows.iloc[i:i+2]
                    loser = pair.sort_values(['games_count', 'total_medals']).index[0]
                    if loser not in drop_indices:
                        drop_indices.append(loser)
            # Non-overlapping + different sport = different people, keep both

all_athletes = all_athletes.drop(index=drop_indices).reset_index(drop=True)
after_dedup = len(all_athletes)
print(f'Dedup: {before_dedup} ‚Üí {after_dedup} (removed {before_dedup - after_dedup})')

# Remaining same-name groups (different people we're keeping)
remaining_dupes = all_athletes.groupby(['_norm', 'games_type']).size()
remaining_multi = remaining_dupes[remaining_dupes > 1]
print(f'Retained distinct athletes sharing a name: {len(remaining_multi)} groups')

# UUID5: norm_name + games_type + first_games_year + primary_sport
NAMESPACE = uuid.UUID('a1b2c3d4-e5f6-7890-abcd-ef1234567890')

def make_athlete_id(row):
    parts = [row['_norm'], str(row['games_type'])]
    if pd.notna(row['first_games_year']):
        parts.append(str(int(row['first_games_year'])))
    if pd.notna(row['primary_sport']):
        parts.append(str(row['primary_sport']))
    seed = '|'.join(parts)
    return str(uuid.uuid5(NAMESPACE, seed))

all_athletes['athlete_id'] = all_athletes.apply(make_athlete_id, axis=1)

id_dupes = all_athletes['athlete_id'].duplicated().sum()
print(f'UUID5 collisions: {id_dupes}')
print(f'Final athlete count: {len(all_athletes)}')

Potential duplicate groups: 77
Dedup: 12228 ‚Üí 12207 (removed 21)
Retained distinct athletes sharing a name: 57 groups
UUID5 collisions: 0
Final athlete count: 12207


In [87]:
# ‚îÄ‚îÄ Phase 4, Step 5: Final column selection & QC ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Drop working columns
all_athletes = all_athletes.drop(columns=['_norm', '_source', 'birth_place',
                                          'birth_country'], errors='ignore')

# Ensure classification_code is clean
all_athletes['classification_code'] = all_athletes['classification_code'].replace(
    {'None': None, 'nan': None, '': None}
)

# Column order matching final schema
SCHEMA_COLS = [
    'athlete_id', 'name', 'gender', 'birth_date', 'games_type', 'games_season',
    'primary_sport', 'classification_code', 'height_cm', 'weight_kg',
    'first_games_year', 'last_games_year', 'games_count',
    'gold_count', 'silver_count', 'bronze_count', 'total_medals'
]
# profile_summary and embedding added in Phase 6

# Check for missing columns
missing = [c for c in SCHEMA_COLS if c not in all_athletes.columns]
extra = [c for c in all_athletes.columns if c not in SCHEMA_COLS]
print(f'Missing schema cols: {missing}')
print(f'Extra cols (will drop): {extra}')

all_athletes = all_athletes[[c for c in SCHEMA_COLS if c in all_athletes.columns]]

print(f'\n{"=" * 60}')
print(f'PHASE 4 QC: UNIFICATION')
print(f'{"=" * 60}')
print(f'\nüìä FINAL COUNTS')
print(f'  Total athletes: {len(all_athletes):,}')
print(f'  Olympic:        {(all_athletes["games_type"]=="Olympic").sum():,}')
print(f'  Paralympic:     {(all_athletes["games_type"]=="Paralympic").sum():,}')
print(f'\n‚öß GENDER')
print(all_athletes['gender'].value_counts(dropna=False))
print(f'\nüèüÔ∏è SEASON')
print(all_athletes['games_season'].value_counts(dropna=False))
print(f'\nüèÖ MEDALS')
for col in ['gold_count', 'silver_count', 'bronze_count', 'total_medals']:
    print(f'  {col}: {all_athletes[col].sum():,.0f}')
print(f'  Athletes with 1+ medal: {(all_athletes["total_medals"]>0).sum():,}')
print(f'\nüè∑Ô∏è CLASSIFICATION (Paralympic only)')
para = all_athletes[all_athletes['games_type'] == 'Paralympic']
print(f'  Coverage: {para["classification_code"].notna().sum()} / {len(para)}')
print(f'\nüìã SAMPLE (5 random):')
print(all_athletes.sample(5, random_state=42)[SCHEMA_COLS].to_string(index=False))
print(f'\n‚ö†Ô∏è  KNOWN ISSUES')
print(f'  Missing gender: {all_athletes["gender"].isna().sum() + (all_athletes["gender"]=="None").sum()}')
print(f'  Missing season: {all_athletes["games_season"].isna().sum() + (all_athletes["games_season"]=="None").sum()}')
print(f'  Missing sport:  {all_athletes["primary_sport"].isna().sum()}')
print(f'  ~100 aggressive dedup losses from earlier pass (mixed true dupes + distinct athletes)')
print(f'\n{"=" * 60}')
print(f'Ready to proceed to Phase 5: Results Table & Career Stat Backfill')
print(f'{"=" * 60}')

Missing schema cols: []
Extra cols (will drop): []

PHASE 4 QC: UNIFICATION

üìä FINAL COUNTS
  Total athletes: 12,207
  Olympic:        10,685
  Paralympic:     1,522

‚öß GENDER
gender
Male      8298
Female    3777
None       132
Name: count, dtype: int64

üèüÔ∏è SEASON
games_season
Summer    9920
Winter    2015
None       272
Name: count, dtype: int64

üèÖ MEDALS
  gold_count: 3,869
  silver_count: 2,764
  bronze_count: 2,447
  total_medals: 9,080
  Athletes with 1+ medal: 5,221

üè∑Ô∏è CLASSIFICATION (Paralympic only)
  Coverage: 691 / 1522

üìã SAMPLE (5 random):
                          athlete_id           name gender birth_date games_type games_season primary_sport classification_code  height_cm  weight_kg  first_games_year  last_games_year  games_count  gold_count  silver_count  bronze_count  total_medals
d25208cf-6a56-50e7-b528-28b351382566 Anne Abernathy Female 1953-04-12    Olympic         None           NaN                 NaN      165.0       75.0               NaN 

---
## Phase 5: Results Table & Career Stat Backfill

Build the unified `team_usa_results` fact table (one row per athlete √ó event √ó Games) and backfill any missing career stats in `all_athletes`.

**Steps:**
1. Build results from keithgalli results.csv (Olympic backbone)
2. Build results from katiepress medal_athlete.csv (Paralympic historical medalists)
3. Concatenate, normalize, attach athlete_id via name matching
4. Backfill medal counts and career stats for athletes whose counts were set to 0 (Paris 2024 Olympic, piterfm Paralympic)
5. QC report

In [88]:
# ‚îÄ‚îÄ Phase 5, Step 1: Olympic results from keithgalli ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

results_raw = pd.read_csv(f'{LOCAL_DIR}/olympic-keithgalli/results.csv')
oly_results = results_raw[results_raw['noc'] == 'USA'].copy()

print(f'Raw USA Olympic results: {len(oly_results):,}')
print(f'Columns: {list(oly_results.columns)}')
print(f'\nSample:')
print(oly_results.head(3).to_string(index=False))

# Standardize column names to match results schema
oly_results = oly_results.rename(columns={
    'year': 'games_year',
    'type': 'games_season',  # 'Summer' / 'Winter' in keithgalli
})

# Clean up
oly_results['games_type'] = 'Olympic'
oly_results['classification_code'] = None

# Extract athlete name ‚Äî prefer 'Used name' if in bios, else use 'name'
# But results.csv uses 'name' column directly
oly_results['athlete_name'] = oly_results['as'].apply(clean_name_mechanical)

# Normalize medal values
oly_results['medal'] = oly_results['medal'].replace({
    'Gold': 'Gold', 'Silver': 'Silver', 'Bronze': 'Bronze'
})
oly_results.loc[~oly_results['medal'].isin(['Gold', 'Silver', 'Bronze']), 'medal'] = None

# Select results schema columns
RESULTS_COLS = ['athlete_name', 'games_year', 'games_season', 'games_type',
                'sport', 'discipline', 'event', 'medal', 'classification_code']
# Keep only columns that exist
available = [c for c in RESULTS_COLS if c in oly_results.columns]
missing_cols = [c for c in RESULTS_COLS if c not in oly_results.columns]
print(f'\nAvailable results cols: {available}')
print(f'Missing (will add empty): {missing_cols}')
for c in missing_cols:
    oly_results[c] = None

oly_results['sport'] = oly_results['discipline']
oly_results = oly_results[RESULTS_COLS]

print(f'\nOlympic results: {len(oly_results):,}')
print(f'With medals: {oly_results["medal"].notna().sum():,}')
print(f'Year range: {oly_results["games_year"].min():.0f}‚Äì{oly_results["games_year"].max():.0f}')
print(f'Unique athletes: {oly_results["athlete_name"].nunique():,}')

Raw USA Olympic results: 21,353
Columns: ['year', 'type', 'discipline', 'event', 'as', 'athlete_id', 'noc', 'team', 'place', 'tied', 'medal']

Sample:
  year   type discipline                       event            as  athlete_id noc          team  place  tied medal
2008.0 Summer    Archery Individual, Women (Olympic) Khatuna Lorig         504 USA           NaN    5.0 False   NaN
2012.0 Summer    Archery Individual, Women (Olympic) Khatuna Lorig         504 USA           NaN    4.0 False   NaN
2012.0 Summer    Archery       Team, Women (Olympic) Khatuna Lorig         504 USA United States    6.0 False   NaN

Available results cols: ['athlete_name', 'games_year', 'games_season', 'games_type', 'discipline', 'event', 'medal', 'classification_code']
Missing (will add empty): ['sport']

Olympic results: 21,353
With medals: 5,995
Year range: 1896‚Äì2022
Unique athletes: 10,118


In [89]:
# ‚îÄ‚îÄ Phase 5, Step 2: Paralympic results from katiepress ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

katie_raw = pd.read_csv(f'{LOCAL_DIR}/paralympic-katiepress/medal_athlete.csv')
print(f'Raw katiepress rows: {len(katie_raw):,}')
print(f'Columns: {list(katie_raw.columns)}')
print(f'\nSample:')
print(katie_raw.head(3).to_string(index=False))

# Filter to USA
usa_col = [c for c in katie_raw.columns if 'country' in c.lower() or 'npc' in c.lower() or 'noc' in c.lower()]
print(f'\nCountry columns: {usa_col}')

# Identify the right filter column and value
for col in usa_col:
    usa_vals = katie_raw[col].value_counts()
    us_matches = [v for v in usa_vals.index if 'US' in str(v).upper() or 'UNITED' in str(v).upper() or 'AMERICA' in str(v).upper()]
    if us_matches:
        print(f'  {col}: {us_matches[:5]}')

Raw katiepress rows: 29,170
Columns: ['games_code', 'games_year', 'games_city', 'games_country', 'games_continent', 'games_start', 'games_end', 'games_season', 'sport', 'sport_code', 'event_dates', 'event_venue', 'events', 'npcs', 'athletes', 'event', 'medal', 'npc', 'npc_new', 'npc_name', 'athlete_name', 'athlete_info_og']

Sample:
games_code  games_year games_city games_country games_continent  games_start    games_end games_season   sport sport_code event_dates event_venue  events  npcs  athletes                         event  medal npc npc_new       npc_name athlete_name  athlete_info_og
    PG1960        1960       Rome         Italy          Europe 18 September 25 September       Summer Archery         AR 18 - 25 Sep         NaN       8     8        19         Men's FITA Round Open Bronze GBR     GBR United Kingdom       POTTER     POTTER (GBR)
    PG1960        1960       Rome         Italy          Europe 18 September 25 September       Summer Archery         AR 18 - 25 Sep    

In [90]:
# ‚îÄ‚îÄ Phase 5, Step 3: Filter & standardize katiepress Paralympic results ‚îÄ‚îÄ

para_results = katie_raw[katie_raw['npc'] == 'USA'].copy()
print(f'USA Paralympic results: {len(para_results):,}')

# Clean names
para_results['athlete_name'] = para_results['athlete_name'].apply(
    lambda x: clean_name_mechanical(str(x).strip())
)

# Extract classification from event
para_results['classification_code'] = para_results['event'].apply(extract_classification)

# Standardize medal values
para_results['medal'] = para_results['medal'].replace({
    'Gold': 'Gold', 'Silver': 'Silver', 'Bronze': 'Bronze'
})
para_results.loc[~para_results['medal'].isin(['Gold', 'Silver', 'Bronze']), 'medal'] = None

# Add missing columns
para_results['games_type'] = 'Paralympic'
para_results['discipline'] = para_results['sport']

# Select results schema columns
para_results = para_results[RESULTS_COLS]

print(f'With medals: {para_results["medal"].notna().sum():,}')
print(f'Year range: {para_results["games_year"].min():.0f}‚Äì{para_results["games_year"].max():.0f}')
print(f'Unique athletes: {para_results["athlete_name"].nunique():,}')
print(f'Unique sports: {para_results["sport"].nunique()}')
print(f'\nSample:')
print(para_results.head(5).to_string(index=False))

USA Paralympic results: 3,105
With medals: 3,105
Year range: 1960‚Äì2018
Unique athletes: 1,169
Unique sports: 29

Sample:
athlete_name  games_year games_season games_type     sport discipline                         event  medal classification_code
     Sones P        1960       Summer Paralympic   Archery    Archery Men's St. Nicholas Round Open Bronze                None
   Whitman J        1960       Summer Paralympic   Archery    Archery         Men's FITA Round Open   Gold                None
   Whitman J        1960       Summer Paralympic   Archery    Archery      Men's Windsor Round Open   Gold                None
    Welger S        1960       Summer Paralympic Athletics  Athletics              Men's Shot Put C Bronze                None
     Stein R        1960       Summer Paralympic Athletics  Athletics              Men's Shot Put C   Gold                None


In [91]:
# ‚îÄ‚îÄ Phase 5, Step 4: Concatenate results & attach athlete_id ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

all_results = pd.concat([oly_results, para_results], ignore_index=True)
print(f'Combined results: {len(all_results):,}')
print(f'  Olympic:    {(all_results["games_type"]=="Olympic").sum():,}')
print(f'  Paralympic: {(all_results["games_type"]=="Paralympic").sum():,}')

# Match results to all_athletes via norm_name + games_type
all_results['_norm'] = all_results['athlete_name'].apply(norm_name)
all_athletes['_norm'] = all_athletes['name'].apply(norm_name)

# Build lookup: (norm_name, games_type) ‚Üí athlete_id
# For athletes sharing a name, include sport to disambiguate
id_lookup = {}
for _, row in all_athletes.iterrows():
    key = (row['_norm'], row['games_type'])
    if key not in id_lookup:
        id_lookup[key] = row['athlete_id']
    # else: first match wins (most games/medals from dedup sort)

all_results['athlete_id'] = all_results.apply(
    lambda r: id_lookup.get((r['_norm'], r['games_type'])), axis=1
)

matched = all_results['athlete_id'].notna().sum()
unmatched = all_results['athlete_id'].isna().sum()
print(f'\nMatched to athlete_id: {matched:,} ({matched/len(all_results)*100:.1f}%)')
print(f'Unmatched: {unmatched:,}')

if unmatched > 0:
    unmatched_sample = all_results[all_results['athlete_id'].isna()].sample(
        min(10, unmatched), random_state=42
    )
    print('\nUnmatched sample:')
    print(unmatched_sample[['athlete_name', '_norm', 'games_type', 'games_year',
                             'sport']].to_string(index=False))

# Drop working column
all_results = all_results.drop(columns=['_norm'])
all_athletes = all_athletes.drop(columns=['_norm'], errors='ignore')

Combined results: 24,458
  Olympic:    21,353
  Paralympic: 3,105

Matched to athlete_id: 24,054 (98.3%)
Unmatched: 404

Unmatched sample:
    athlete_name            _norm games_type  games_year                            sport
     Connie Lenz      CONNIE LENZ    Olympic      1948.0 Artistic Gymnastics (Gymnastics)
     Alison Owen      ALISON OWEN    Olympic      1972.0    Cross Country Skiing (Skiing)
  Mia Manganello   MANGANELLO MIA    Olympic      2018.0          Speed Skating (Skating)
  Karen O'Connor    KAREN OCONNOR    Olympic      2008.0 Equestrian Eventing (Equestrian)
Jessica Newberry JESSICA NEWBERRY    Olympic      1964.0 Equestrian Dressage (Equestrian)
    Muriel Davis     DAVIS MURIEL    Olympic      1956.0 Artistic Gymnastics (Gymnastics)
  Danielle Scott   DANIELLE SCOTT    Olympic      2000.0          Volleyball (Volleyball)
     Kara Winger      KARA WINGER    Olympic      2016.0                        Athletics
      Des Davila       DAVILA DES    Olympic      2

In [92]:
# ‚îÄ‚îÄ Phase 5, Step 5: Backfill career stats from results ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Compute fresh career stats from all_results for matched athletes
matched_results = all_results[all_results['athlete_id'].notna()].copy()

fresh_stats = matched_results.groupby('athlete_id').agg(
    results_count=('games_year', 'size'),
    first_year=('games_year', 'min'),
    last_year=('games_year', 'max'),
    games_years=('games_year', 'nunique'),
    gold=('medal', lambda x: (x == 'Gold').sum()),
    silver=('medal', lambda x: (x == 'Silver').sum()),
    bronze=('medal', lambda x: (x == 'Bronze').sum()),
).reset_index()

fresh_stats['total'] = fresh_stats['gold'] + fresh_stats['silver'] + fresh_stats['bronze']

print(f'Fresh stats computed for {len(fresh_stats):,} athletes')

# Identify athletes needing backfill (medal counts were set to 0)
zero_medals = all_athletes[
    (all_athletes['total_medals'] == 0) &
    (all_athletes['athlete_id'].isin(fresh_stats['athlete_id']))
]
print(f'Athletes with 0 medals eligible for backfill: {len(zero_medals):,}')

# Backfill: update medals and year ranges where current values are 0 or missing
backfilled = 0
for _, stats in fresh_stats.iterrows():
    mask = all_athletes['athlete_id'] == stats['athlete_id']
    idx = all_athletes[mask].index
    if len(idx) == 0:
        continue
    i = idx[0]
    row = all_athletes.loc[i]

    updated = False
    # Backfill medals if current is 0 and results show medals
    if row['total_medals'] == 0 and stats['total'] > 0:
        all_athletes.at[i, 'gold_count'] = stats['gold']
        all_athletes.at[i, 'silver_count'] = stats['silver']
        all_athletes.at[i, 'bronze_count'] = stats['bronze']
        all_athletes.at[i, 'total_medals'] = stats['total']
        updated = True

    # Extend year range if results show wider range
    if pd.notna(stats['first_year']):
        if pd.isna(row['first_games_year']) or stats['first_year'] < row['first_games_year']:
            all_athletes.at[i, 'first_games_year'] = stats['first_year']
            updated = True
    if pd.notna(stats['last_year']):
        if pd.isna(row['last_games_year']) or stats['last_year'] > row['last_games_year']:
            all_athletes.at[i, 'last_games_year'] = stats['last_year']
            updated = True

    # Update games_count if results show more
    if stats['games_years'] > (row['games_count'] or 0):
        all_athletes.at[i, 'games_count'] = stats['games_years']
        updated = True

    if updated:
        backfilled += 1

print(f'Athletes backfilled: {backfilled:,}')
print(f'\nUpdated medal totals:')
for col in ['gold_count', 'silver_count', 'bronze_count', 'total_medals']:
    print(f'  {col}: {all_athletes[col].sum():,.0f}')
print(f'  Athletes with 1+ medal: {(all_athletes["total_medals"] > 0).sum():,}')

Fresh stats computed for 11,085 athletes
Athletes with 0 medals eligible for backfill: 5,915
Athletes backfilled: 63

Updated medal totals:
  gold_count: 3,879
  silver_count: 2,770
  bronze_count: 2,449
  total_medals: 9,098
  Athletes with 1+ medal: 5,235


In [93]:
# ‚îÄ‚îÄ PHASE 5 QC REPORT ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('=' * 60)
print('PHASE 5 QC: RESULTS TABLE & CAREER STAT BACKFILL')
print('=' * 60)

print(f'\nüìä RESULTS TABLE')
print(f'  Total results:  {len(all_results):,}')
print(f'  Olympic:        {(all_results["games_type"]=="Olympic").sum():,}')
print(f'  Paralympic:     {(all_results["games_type"]=="Paralympic").sum():,}')
print(f'  With medals:    {all_results["medal"].notna().sum():,}')
print(f'  Matched to ID:  {all_results["athlete_id"].notna().sum():,} ({all_results["athlete_id"].notna().mean()*100:.1f}%)')
print(f'  Unmatched:      {all_results["athlete_id"].isna().sum():,}')
print(f'  Year range:     {all_results["games_year"].min():.0f}‚Äì{all_results["games_year"].max():.0f}')
print(f'  Unique sports:  {all_results["sport"].nunique()}')

print(f'\nüìä ATHLETES TABLE (post-backfill)')
print(f'  Total athletes: {len(all_athletes):,}')
print(f'  Olympic:        {(all_athletes["games_type"]=="Olympic").sum():,}')
print(f'  Paralympic:     {(all_athletes["games_type"]=="Paralympic").sum():,}')

print(f'\nüèÖ MEDAL TOTALS')
for col in ['gold_count', 'silver_count', 'bronze_count', 'total_medals']:
    print(f'  {col}: {all_athletes[col].sum():,.0f}')
print(f'  Athletes with 1+ medal: {(all_athletes["total_medals"]>0).sum():,}')

# Cross-check: do athlete medal sums match results medal counts?
results_medals = all_results[all_results['medal'].notna() & all_results['athlete_id'].notna()]
results_gold = (results_medals['medal'] == 'Gold').sum()
results_silver = (results_medals['medal'] == 'Silver').sum()
results_bronze = (results_medals['medal'] == 'Bronze').sum()
athlete_gold = all_athletes['gold_count'].sum()
athlete_silver = all_athletes['silver_count'].sum()
athlete_bronze = all_athletes['bronze_count'].sum()

print(f'\nüîÑ CROSS-CHECK (results vs athletes table)')
print(f'  Gold:   results={results_gold:,.0f}, athletes={athlete_gold:,.0f}, diff={athlete_gold-results_gold:+,.0f}')
print(f'  Silver: results={results_silver:,.0f}, athletes={athlete_silver:,.0f}, diff={athlete_silver-results_silver:+,.0f}')
print(f'  Bronze: results={results_bronze:,.0f}, athletes={athlete_bronze:,.0f}, diff={athlete_bronze-results_bronze:+,.0f}')

print(f'\nüìã RESULTS SAMPLE (5 random with medals):')
medal_sample = all_results[all_results['medal'].notna()].sample(5, random_state=42)
print(medal_sample[['athlete_name', 'games_year', 'games_type', 'sport',
                     'event', 'medal']].to_string(index=False))

print(f'\n‚ö†Ô∏è  KNOWN ISSUES')
print(f'  Unmatched results: {all_results["athlete_id"].isna().sum()} (minor name variations)')
print(f'  No 2024 Olympic results (not in keithgalli)')
print(f'  Paralympic results are medal-only (katiepress has no non-medal results)')

print(f'\n{"=" * 60}')
print(f'Ready to proceed to Phase 6: Gemini Enrichment')
print(f'{"=" * 60}')

PHASE 5 QC: RESULTS TABLE & CAREER STAT BACKFILL

üìä RESULTS TABLE
  Total results:  24,458
  Olympic:        21,353
  Paralympic:     3,105
  With medals:    9,100
  Matched to ID:  24,054 (98.3%)
  Unmatched:      404
  Year range:     1896‚Äì2022
  Unique sports:  92

üìä ATHLETES TABLE (post-backfill)
  Total athletes: 12,207
  Olympic:        10,685
  Paralympic:     1,522

üèÖ MEDAL TOTALS
  gold_count: 3,879
  silver_count: 2,770
  bronze_count: 2,449
  total_medals: 9,098
  Athletes with 1+ medal: 5,235

üîÑ CROSS-CHECK (results vs athletes table)
  Gold:   results=3,834, athletes=3,879, diff=+45
  Silver: results=2,730, athletes=2,770, diff=+40
  Bronze: results=2,421, athletes=2,449, diff=+28

üìã RESULTS SAMPLE (5 random with medals):
 athlete_name  games_year games_type                   sport                       event  medal
Lee Stecklein      2022.0    Olympic Ice Hockey (Ice Hockey) Ice Hockey, Women (Olympic) Silver
       Levy B      1996.0 Paralympic          

In [94]:
# ‚îÄ‚îÄ Athlete count investigation ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('=== OLYMPIC BREAKDOWN ===')
print(f'Total Olympic: {len(all_athletes[all_athletes["games_type"]=="Olympic"]):,}')
# How many came from each source in Phase 2?
print(f'\nKeithgalli bios (pre-Paris):  Check olympic_athletes before Paris concat')

print('\n=== PARALYMPIC BREAKDOWN ===')
para = all_athletes[all_athletes['games_type'] == 'Paralympic']
print(f'Total Paralympic: {len(para):,}')

# The big question: did katiepress get properly filtered to USA?
# Reload and check
katie_check = pd.read_csv(f'{LOCAL_DIR}/paralympic-katiepress/medal_athlete.csv')
katie_usa = katie_check[katie_check['npc'] == 'USA']
katie_all_countries = katie_check['npc'].nunique()
print(f'\nKatiepress total rows: {len(katie_check):,}')
print(f'Katiepress USA rows:   {len(katie_usa):,}')
print(f'Katiepress countries:  {katie_all_countries}')
print(f'Katiepress USA unique athletes: {katie_usa["athlete_name"].nunique():,}')

# What did we actually get in paralympic_athletes?
print(f'\nParalympic athletes with 0 medals: {(para["total_medals"]==0).sum():,}')
print(f'Paralympic athletes with 1+ medal: {(para["total_medals"]>0).sum():,}')

# Spot check: are there non-USA athletes in the mix?
# Look for athletes with non-American names in katiepress-origin rows
print(f'\nParalympic athletes sample (low medal count):')
print(para.nsmallest(10, 'total_medals')[['name', 'primary_sport', 'first_games_year',
    'games_count', 'total_medals', 'classification_code']].to_string(index=False))

=== OLYMPIC BREAKDOWN ===
Total Olympic: 10,685

Keithgalli bios (pre-Paris):  Check olympic_athletes before Paris concat

=== PARALYMPIC BREAKDOWN ===
Total Paralympic: 1,522

Katiepress total rows: 29,170
Katiepress USA rows:   3,105
Katiepress countries:  127
Katiepress USA unique athletes: 1,169

Paralympic athletes with 0 medals: 358
Paralympic athletes with 1+ medal: 1,164

Paralympic athletes sample (low medal count):
                name         primary_sport  first_games_year  games_count  total_medals classification_code
Abrahams David Henry              Swimming            2020.0          2.0           0.0       S13,SB13,SM13
       Jazmin Almlie              Shooting            2020.0          2.0           0.0                 SH2
        Charles Aoki      Wheelchair Rugby            2020.0          1.0           0.0                 3.0
    Danielle Aravich             Athletics            2020.0          1.0           0.0                 T47
      Ryohei Ariyasu           

---
## Phase 6: Gemini Enrichment

*Single Gemini pass per athlete: name verification + profile generation using Google Search grounding. Then embeddings in a second pass.*

**Phase 6a: Profile Summaries + Name Verification**
1. Setup & configuration
2. Thread-local Gemini client
3. Prompt builder + generation functions (structured JSON output)
4. Test with sample athletes
5. Parallel processing ‚Äî full run (~12K athletes)
6. Profile QC

**Phase 6b: Vector Embeddings**
7. Embedding functions
8. Test single embedding
9. Parallel processing ‚Äî full run
10. Merge enrichments into `all_athletes`
11. Phase 6 QC + GCS checkpoint

**Key design decisions:**
- Context-first prompts: every known fact (sport, classification, Games years, events, medals) is included to maximize Google Search accuracy
- Name verification: Gemini returns `verified_name` with confidence level; abbreviated names (katiepress) are expanded when confidently identified
- Safety rails: last name must match, first initial must match, UNVERIFIED if uncertain
- UNVERIFIED athletes get a data-only profile sentence built from table facts
- Resume-safe: progress checkpointed to `/tmp/` CSV files; reruns only process failures

In [96]:
# ‚îÄ‚îÄ Phase 6, Step 1: Setup & Configuration ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
!pip install -q google-genai tqdm

import json
import time
import csv
import re
import threading
import hashlib
from pathlib import Path
from typing import Optional, Tuple, List
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime
from tqdm.notebook import tqdm


class ProfileConfig:
    """Configuration for profile summary + name verification"""
    MODEL_NAME = 'gemini-2.5-flash'
    TEMPERATURE = 1.2
    MAX_OUTPUT_TOKENS = 1500  # Slightly more than v1 to accommodate JSON wrapper

    BATCH_SIZE = 200
    MAX_WORKERS = 50
    SAVE_INTERVAL = 200
    MAX_RETRIES = 3
    RETRY_DELAY = 2

    MIN_PROFILE_LENGTH = 80
    MAX_PROFILE_LENGTH = 5000

    PROGRESS_FILE = '/tmp/v2_athlete_profiles_progress.csv'
    ERROR_LOG = '/tmp/v2_athlete_profiles_errors.csv'


class EmbeddingConfig:
    """Configuration for embedding generation"""
    MODEL_NAME = 'gemini-embedding-001'
    OUTPUT_DIMENSION = 3072

    BATCH_SIZE = 200
    MAX_WORKERS = 75
    SAVE_INTERVAL = 200
    MAX_RETRIES = 3
    RETRY_DELAY = 2

    PROGRESS_FILE = '/tmp/v2_athlete_embeddings_progress.csv'
    ERROR_LOG = '/tmp/v2_athlete_embeddings_errors.csv'


# Build event lookup from results table: athlete_id ‚Üí list of "event (medal)" strings
# Build event lookup from results table: athlete_id ‚Üí list of "event (medal)" strings
event_lookup = {}
for aid, group in all_results.groupby('athlete_id'):
    events = []
    for _, r in group.iterrows():
        event_str = str(r.get('event', '')).strip()
        year = r.get('games_year', '')
        medal = r.get('medal', '')
        if event_str and event_str != 'nan':
            label = event_str
            if pd.notna(year):
                label += f" ({int(year)})"
            if pd.notna(medal) and medal in ('Gold', 'Silver', 'Bronze'):
                label += f" ‚Äî {medal}"
            events.append(label)
    if events:
        event_lookup[aid] = events

print(f"Profile config:   {ProfileConfig.MODEL_NAME}, {ProfileConfig.MAX_WORKERS} workers")
print(f"Embedding config: {EmbeddingConfig.MODEL_NAME}, {EmbeddingConfig.OUTPUT_DIMENSION} dims, {EmbeddingConfig.MAX_WORKERS} workers")
print(f"Athletes to process: {len(all_athletes):,}")
print(f"Athletes with event detail: {len(event_lookup):,}")
print(f"Athletes without event detail: {len(all_athletes) - len(event_lookup):,}")

Profile config:   gemini-2.5-flash, 50 workers
Embedding config: gemini-embedding-001, 3072 dims, 75 workers
Athletes to process: 12,207
Athletes with event detail: 11,085
Athletes without event detail: 1,122


In [97]:
# ‚îÄ‚îÄ Phase 6, Step 2: Thread-local Gemini client ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
from google import genai
from google.genai import types
import google.auth

_thread_local = threading.local()

def get_client():
    """Get or create a thread-local Gemini client."""
    if getattr(_thread_local, "client", None) is None:
        credentials, _ = google.auth.default()
        _thread_local.client = genai.Client(
            vertexai=True,
            project=PROJECT_ID,
            location=REGION,
        )
    return _thread_local.client

# Test connection
try:
    test_response = get_client().models.generate_content(
        model=ProfileConfig.MODEL_NAME,
        contents='Say "API connected" and nothing else.',
        config=types.GenerateContentConfig(max_output_tokens=20),
    )
    response_text = test_response.text if test_response.text else "(empty response)"
    print(f"‚úÖ Gemini API connected")
    print(f"   Project: {PROJECT_ID}")
    print(f"   Region:  {REGION}")
    print(f"   Model:   {ProfileConfig.MODEL_NAME}")
    print(f"   Response: {response_text.strip()}")
except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    raise

‚úÖ Gemini API connected
   Project: qwiklabs-gcp-01-bafc8841fc77
   Region:  us-central1
   Model:   gemini-2.5-flash
   Response: (empty response)


In [103]:
# ‚îÄ‚îÄ Phase 6, Step 3: Prompt builder + generation functions ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def is_abbreviated(name):
    """Check if name looks like katiepress abbreviated format: 'Last F' or 'Last Fi'"""
    if pd.isna(name):
        return False
    parts = str(name).strip().split()
    if len(parts) < 2:
        return False
    # Last token is 1-2 chars = likely initial/abbreviation
    return len(parts[-1]) <= 2 or len(parts[0]) <= 2


def build_data_only_profile(row):
    """Build a factual sentence from table data for UNVERIFIED athletes."""
    name = str(row.get('name', 'Unknown')).strip()
    games_type = row.get('games_type', 'Olympic')
    sport = row.get('primary_sport', '')
    classification = row.get('classification_code', '')
    first_yr = row.get('first_games_year', '')
    last_yr = row.get('last_games_year', '')
    games_count = row.get('games_count', '')
    total_medals = row.get('total_medals', 0)

    parts = [f"{name} represented the United States as a {games_type} athlete"]

    if pd.notna(sport) and str(sport) != 'nan':
        parts[0] += f" in {sport}"

    if pd.notna(classification) and str(classification) not in ('nan', 'None', ''):
        parts.append(f"competing in classification {classification}")

    if pd.notna(first_yr) and pd.notna(last_yr):
        fy, ly = int(first_yr), int(last_yr)
        if fy == ly:
            parts.append(f"at the {fy} Games")
        else:
            gc = f" across {int(games_count)} Games" if pd.notna(games_count) and int(games_count) > 1 else ""
            parts.append(f"from {fy} to {ly}{gc}")

    if pd.notna(total_medals) and int(total_medals) > 0:
        gold = int(row.get('gold_count', 0)) if pd.notna(row.get('gold_count')) else 0
        silver = int(row.get('silver_count', 0)) if pd.notna(row.get('silver_count')) else 0
        bronze = int(row.get('bronze_count', 0)) if pd.notna(row.get('bronze_count')) else 0
        medal_parts = []
        if gold > 0: medal_parts.append(f"{gold} gold")
        if silver > 0: medal_parts.append(f"{silver} silver")
        if bronze > 0: medal_parts.append(f"{bronze} bronze")
        parts.append(f"earning {', '.join(medal_parts)}")

    return ', '.join(parts) + '.'


def create_profile_prompt(row) -> str:
    """Build a context-rich prompt for name verification + profile generation."""

    name = str(row.get('name', 'Unknown')).strip()
    games_type = row.get('games_type', 'Olympic')
    athlete_id = row.get('athlete_id', '')
    abbreviated = is_abbreviated(name)

    # Header
    lines = [
        f"You are researching Team USA {games_type} athletes.",
        f"Given the following data, use Google Search to identify this athlete "
        f"and write a profile.",
        "",
        f"=== ATHLETE DATA ===",
        f"Name on file: {name}" + (" (likely abbreviated)" if abbreviated else ""),
    ]

    # Structured facts
    if pd.notna(row.get('primary_sport')) and str(row['primary_sport']) != 'nan':
        lines.append(f"Sport: {row['primary_sport']}")
    if pd.notna(row.get('classification_code')) and str(row['classification_code']) not in ('nan', 'None', ''):
        lines.append(f"Paralympic classification: {row['classification_code']}")
    if pd.notna(row.get('gender')) and str(row['gender']) not in ('nan', 'None'):
        lines.append(f"Gender: {row['gender']}")
    if pd.notna(row.get('birth_date')) and str(row['birth_date']) != 'nan':
        lines.append(f"Birth date: {row['birth_date']}")

    # Games info
    if pd.notna(row.get('games_count')) and row['games_count'] > 0:
        fy = int(row['first_games_year']) if pd.notna(row.get('first_games_year')) else '?'
        ly = int(row['last_games_year']) if pd.notna(row.get('last_games_year')) else '?'
        lines.append(f"Games appearances: {int(row['games_count'])} ({fy}‚Äì{ly})")

    # Medals
    medal_parts = []
    for label, col in [('Gold', 'gold_count'), ('Silver', 'silver_count'), ('Bronze', 'bronze_count')]:
        if pd.notna(row.get(col)) and row[col] > 0:
            medal_parts.append(f"{int(row[col])} {label}")
    if medal_parts:
        lines.append(f"Medals: {', '.join(medal_parts)}")

    # Event history from results table
    events = event_lookup.get(athlete_id, [])
    if events:
        lines.append(f"\nEvent history:")
        for e in events[:20]:  # Cap at 20 to keep prompt manageable
            lines.append(f"  - {e}")
        if len(events) > 20:
            lines.append(f"  ... and {len(events) - 20} more events")

    # Bio fields (Paris 2024 Paralympic)
    norm_key = norm_name(name)
    bio = paris_para_bios.get(norm_key, {})
    if bio:
        lines.append(f"\nAdditional background:")
        for field, label in [('reason', 'Why they compete'), ('hero', 'Hero/inspiration'),
                              ('philosophy', 'Philosophy'), ('other_sports', 'Other sports'),
                              ('coach', 'Coach'), ('hobbies', 'Hobbies'),
                              ('occupation', 'Occupation'), ('education', 'Education')]:
            if field in bio:
                lines.append(f"  {label}: {bio[field]}")

    # Instructions
    lines.append(f"\n=== INSTRUCTIONS ===")
    lines.append("Using Google Search and the data above, return ONLY valid JSON (no markdown, no backticks):")
    lines.append("""
{
  "verified_name": "Full Name",
  "name_confidence": "high/medium/low",
  "name_reasoning": "Brief explanation of how you identified this athlete",
  "profile": "Two-paragraph profile, 150-250 words...",
  "profile_confidence": "high/medium/low"
}""")

    lines.append(f"\nName verification rules:")
    lines.append(f"- The verified athlete MUST have represented the United States")
    lines.append(f"- Sport and Games years must match the data on file")
    lines.append(f"- The last name must match the name on file")
    if abbreviated:
        first_token = name.split()[-1] if len(name.split()) >= 2 else ''
        last_token = name.split()[0] if len(name.split()) >= 2 else ''
        # katiepress format: "LastName F" ‚Äî last token is the initial
        if len(name.split()[-1]) <= 2:
            lines.append(f"- The first name must start with '{name.split()[-1]}'")
        elif len(name.split()[0]) <= 2:
            lines.append(f"- The first name must start with '{name.split()[0]}'")
    lines.append(f"- If multiple athletes could match, set verified_name to \"UNVERIFIED\"")
    lines.append(f"- If you cannot confidently identify the specific individual, set verified_name to \"UNVERIFIED\"")
    lines.append(f"- A wrong name is WORSE than UNVERIFIED")

    lines.append(f"\nProfile guidelines:")
    lines.append(f"- Paragraph 1: Athletic career ‚Äî achievements, competition highlights, what makes them notable")
    lines.append(f"- Paragraph 2: Personal background ‚Äî how they entered their sport, training, what drives them")
    if pd.notna(row.get('classification_code')) and str(row['classification_code']) not in ('nan', 'None', ''):
        lines.append(f"- Briefly explain what classification {row['classification_code']} means in {row.get('primary_sport', 'their sport')}")

    return '\n'.join(lines)


def parse_gemini_response(response_text: str) -> dict:
    """Parse JSON from Gemini response, handling common formatting issues."""
    if not response_text:
        return None

    text = response_text.strip()

    # Strip markdown code fences
    if text.startswith('```'):
        text = re.sub(r'^```(?:json)?\s*', '', text)
        text = re.sub(r'\s*```$', '', text)

    # Try direct parse
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # Try to find JSON object in the response
    match = re.search(r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}', text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            pass

    return None


def generate_profile(row, retries: int = ProfileConfig.MAX_RETRIES) -> dict:
    """Generate a profile + verified name with Google Search grounding. Returns dict."""

    prompt = create_profile_prompt(row)
    name = str(row.get('name', 'Unknown')).strip()

    config = types.GenerateContentConfig(
        temperature=ProfileConfig.TEMPERATURE,
        max_output_tokens=ProfileConfig.MAX_OUTPUT_TOKENS,
        tools=[types.Tool(google_search=types.GoogleSearch())],
        thinking_config=types.ThinkingConfig(thinking_budget=0),
    )

    for attempt in range(retries):
        try:
            response = get_client().models.generate_content(
                model=ProfileConfig.MODEL_NAME,
                contents=prompt,
                config=config,
            )

            # Extract text from response
            raw_text = None
            if hasattr(response, 'text') and response.text:
                raw_text = response.text
            elif hasattr(response, 'candidates') and response.candidates:
                candidate = response.candidates[0]
                if hasattr(candidate, 'content') and hasattr(candidate.content, 'parts'):
                    parts = candidate.content.parts
                    if parts and hasattr(parts[0], 'text'):
                        raw_text = parts[0].text

            if not raw_text:
                if attempt < retries - 1:
                    time.sleep(ProfileConfig.RETRY_DELAY)
                    continue
                return {
                    'verified_name': name,
                    'name_confidence': 'low',
                    'name_reasoning': 'No API response',
                    'profile': build_data_only_profile(row),
                    'profile_confidence': 'data_only',
                    'error': 'No text in API response'
                }

            # Parse JSON
            parsed = parse_gemini_response(raw_text)

            if parsed and 'profile' in parsed:
                profile = str(parsed['profile']).strip()

                # Validate profile length
                if len(profile) < ProfileConfig.MIN_PROFILE_LENGTH:
                    if attempt < retries - 1:
                        time.sleep(ProfileConfig.RETRY_DELAY)
                        continue

                if len(profile) > ProfileConfig.MAX_PROFILE_LENGTH:
                    profile = profile[:ProfileConfig.MAX_PROFILE_LENGTH] + "..."

               # Clean whitespace
                profile = re.sub(r'\s+', ' ', profile).strip()
                # Strip Gemini search citation artifacts like [1], [1, 2, 7], etc.
                profile = re.sub(r'\s*\[[\d,\s]+\]', '', profile)

                # Detect refusal profiles and swap in data-only fallback
                refusal_signals = ['could not be confidently', 'cannot be accurately',
                                   'could not be identified', 'unable to verify',
                                   'insufficient information', 'cannot be confirmed']
                if any(signal in profile.lower() for signal in refusal_signals):
                    profile = build_data_only_profile(row)
                    return {
                        'verified_name': name,  # Keep original
                        'name_confidence': 'low',
                        'name_reasoning': parsed.get('name_reasoning', 'Gemini returned refusal profile'),
                        'profile': profile,
                        'profile_confidence': 'data_only',
                        'error': None
                    }

                verified = str(parsed.get('verified_name', name)).strip()
                if verified.upper() == 'UNVERIFIED':
                    verified = name  # Keep original

                return {
                    'verified_name': verified,
                    'name_confidence': parsed.get('name_confidence', 'low'),
                    'name_reasoning': str(parsed.get('name_reasoning', '')).strip(),
                    'profile': profile,
                    'profile_confidence': parsed.get('profile_confidence', 'low'),
                    'error': None
                }
            else:
                # Couldn't parse JSON ‚Äî try using raw text as profile
                if attempt < retries - 1:
                    time.sleep(ProfileConfig.RETRY_DELAY)
                    continue
                return {
                    'verified_name': name,
                    'name_confidence': 'low',
                    'name_reasoning': 'Could not parse JSON response',
                    'profile': build_data_only_profile(row),
                    'profile_confidence': 'data_only',
                    'error': f'JSON parse failed. Raw: {raw_text[:200]}'
                }

        except Exception as e:
            if attempt < retries - 1:
                time.sleep(ProfileConfig.RETRY_DELAY * (attempt + 1))
                continue
            return {
                'verified_name': name,
                'name_confidence': 'low',
                'name_reasoning': str(e),
                'profile': build_data_only_profile(row),
                'profile_confidence': 'data_only',
                'error': f"{type(e).__name__}: {str(e)}"
            }

    return {
        'verified_name': name,
        'name_confidence': 'low',
        'name_reasoning': 'Max retries exceeded',
        'profile': build_data_only_profile(row),
        'profile_confidence': 'data_only',
        'error': 'Max retries exceeded'
    }


def log_error(athlete_id, name, error_msg, filename):
    """Log errors to CSV."""
    file_exists = Path(filename).exists()
    with open(filename, 'a', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['timestamp', 'athlete_id', 'name', 'error'])
        if not file_exists:
            writer.writeheader()
        writer.writerow({
            'timestamp': datetime.now().isoformat(),
            'athlete_id': athlete_id,
            'name': name,
            'error': error_msg
        })


print("‚úÖ Profile + name verification functions defined")
print(f"   Abbreviated name example: {is_abbreviated('Smith J')} (Smith J)")
print(f"   Full name example: {is_abbreviated('Jessica Smith')} (Jessica Smith)")

# Quick test of data-only fallback
test_row = all_athletes.iloc[0]
print(f"\n   Data-only fallback test:")
print(f"   {build_data_only_profile(test_row)}")

‚úÖ Profile + name verification functions defined
   Abbreviated name example: True (Smith J)
   Full name example: False (Jessica Smith)

   Data-only fallback test:
   Khatuna Kvrivishvili-Lorig represented the United States as a Olympic athlete in Archery, from 2008 to 2012 across 2 Games.


In [104]:
# ‚îÄ‚îÄ Phase 6, Step 4: Test with sample athletes ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

test_cases = [
    ("Well-known Olympic", all_athletes[
        all_athletes['name'].str.contains('Simone', na=False) &
        (all_athletes['games_type'] == 'Olympic')
    ].head(1)),

    ("Abbreviated katiepress Paralympic", all_athletes[
        (all_athletes['games_type'] == 'Paralympic') &
        (all_athletes['name'].apply(is_abbreviated)) &
        (all_athletes['total_medals'] > 2)
    ].sample(1, random_state=42)),

    ("Paris 2024 Paralympic (with bio)", all_athletes[
        (all_athletes['games_type'] == 'Paralympic') &
        (all_athletes['name'].apply(
            lambda n: norm_name(n) in paris_para_bios if pd.notna(n) else False
        ))
    ].sample(1, random_state=42)),
]

for label, test_df in test_cases:
    if len(test_df) == 0:
        print(f"\n‚ö†Ô∏è  No test athlete found for: {label}")
        continue

    row = test_df.iloc[0]
    print(f"\n{'=' * 70}")
    print(f"TEST: {label}")
    print(f"  Name: {row['name']} | Sport: {row.get('primary_sport', 'N/A')}")
    print(f"  Type: {row['games_type']} | Games: {row.get('games_count', '?')} ({row.get('first_games_year', '?')}‚Äì{row.get('last_games_year', '?')})")
    print(f"  Medals: {row.get('total_medals', 0)} | Classification: {row.get('classification_code', 'N/A')}")
    print(f"  Events in lookup: {len(event_lookup.get(row['athlete_id'], []))}")
    print(f"  Abbreviated: {is_abbreviated(row['name'])}")
    print(f"{'=' * 70}")

    # Show prompt preview
    prompt = create_profile_prompt(row)
    print(f"\nPrompt ({len(prompt)} chars, first 500):")
    print(prompt[:500])
    print("...\n")

    # Call Gemini
    print("Calling Gemini + Google Search...")
    result = generate_profile(row)

    print(f"\n  verified_name:    {result['verified_name']}")
    print(f"  name_confidence:  {result['name_confidence']}")
    print(f"  name_reasoning:   {result['name_reasoning'][:150]}")
    print(f"  profile_confidence: {result['profile_confidence']}")
    print(f"  error:            {result['error']}")
    print(f"\n  Profile ({len(result['profile'])} chars):")
    print(f"  {result['profile'][:400]}")
    if len(result['profile']) > 400:
        print(f"  ...")


TEST: Well-known Olympic
  Name: Simone Schaller | Sport: Athletics
  Type: Olympic | Games: 2.0 (1932.0‚Äì1936.0)
  Medals: 0.0 | Classification: nan
  Events in lookup: 2
  Abbreviated: False

Prompt (1350 chars, first 500):
You are researching Team USA Olympic athletes.
Given the following data, use Google Search to identify this athlete and write a profile.

=== ATHLETE DATA ===
Name on file: Simone Schaller
Sport: Athletics
Gender: Female
Birth date: 1912-08-22
Games appearances: 2 (1932‚Äì1936)

Event history:
  - 80 metres Hurdles, Women (Olympic) (1932)
  - 80 metres Hurdles, Women (Olympic) (1936)

=== INSTRUCTIONS ===
Using Google Search and the data above, return ONLY valid JSON (no markdown, no backticks):


...

Calling Gemini + Google Search...

  verified_name:    Simone Schaller
  name_confidence:  high
  name_reasoning:   The athlete's full name, birth date, sport, gender, and Olympic appearances (years and events) all perfectly match the provided data. Multiple sourc

In [105]:
# ‚îÄ‚îÄ Phase 6, Step 5: Embedding functions + test ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def create_embedding_text(row, profile: str) -> str:
    """Create rich text for embedding from structured data + profile."""

    name = str(row.get('name', 'Unknown')).strip()
    parts = [f"Team USA {row.get('games_type', '')} athlete: {name}"]

    if pd.notna(row.get('primary_sport')) and str(row['primary_sport']) != 'nan':
        parts.append(f"Sport: {row['primary_sport']}")
    if pd.notna(row.get('classification_code')) and str(row['classification_code']) not in ('nan', 'None', ''):
        parts.append(f"Classification: {row['classification_code']}")
    if pd.notna(row.get('games_count')) and row['games_count'] > 0:
        year_str = ""
        if pd.notna(row.get('first_games_year')) and pd.notna(row.get('last_games_year')):
            fy, ly = int(row['first_games_year']), int(row['last_games_year'])
            year_str = f" ({fy}‚Äì{ly})"
        medal_str = ""
        if pd.notna(row.get('total_medals')) and row['total_medals'] > 0:
            medal_str = f", {int(row['total_medals'])} medals"
        parts.append(f"Career: {int(row['games_count'])} Games{year_str}{medal_str}")

    # Add profile if it's a real Gemini-generated one (data-only fallbacks are redundant)
    if profile and not profile.startswith('[Profile generation failed'):
        parts.append(f"Profile: {profile}")

    return ". ".join(parts)


def generate_embedding(text: str, retries: int = EmbeddingConfig.MAX_RETRIES) -> Tuple[Optional[List[float]], Optional[str]]:
    """Generate embedding for text."""

    for attempt in range(retries):
        try:
            response = get_client().models.embed_content(
                model=EmbeddingConfig.MODEL_NAME,
                contents=text,
                config=types.EmbedContentConfig(
                    output_dimensionality=EmbeddingConfig.OUTPUT_DIMENSION
                )
            )

            if hasattr(response, 'embeddings') and response.embeddings:
                embedding = response.embeddings[0]
                if hasattr(embedding, 'values') and embedding.values:
                    return list(embedding.values), None

            if attempt < retries - 1:
                time.sleep(EmbeddingConfig.RETRY_DELAY)
                continue
            return None, "No embedding values in response"

        except Exception as e:
            if attempt < retries - 1:
                time.sleep(EmbeddingConfig.RETRY_DELAY * (attempt + 1))
                continue
            return None, f"{type(e).__name__}: {str(e)}"

    return None, "Max retries exceeded"


# ‚îÄ‚îÄ Quick test ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
test_row = all_athletes.iloc[0]
test_profile = "A decorated Olympic archer who represented the United States across multiple Games."

text = create_embedding_text(test_row, test_profile)
print(f"Embedding text ({len(text)} chars):")
print(f"  {text[:300]}")
if len(text) > 300:
    print(f"  ...")

print(f"\nCalling embedding API ({EmbeddingConfig.OUTPUT_DIMENSION} dims)...")
embedding, error = generate_embedding(text)

if embedding:
    print(f"‚úÖ Success!")
    print(f"   Dimensions: {len(embedding)}")
    print(f"   Expected:   {EmbeddingConfig.OUTPUT_DIMENSION}")
    print(f"   First 5:    {[f'{v:.6f}' for v in embedding[:5]]}")
    print(f"   ‚úì Dimension check: {'PASS' if len(embedding) == EmbeddingConfig.OUTPUT_DIMENSION else 'FAIL'}")
else:
    print(f"‚ùå Error: {error}")

Embedding text (191 chars):
  Team USA Olympic athlete: Khatuna Kvrivishvili-Lorig. Sport: Archery. Career: 2 Games (2008‚Äì2012). Profile: A decorated Olympic archer who represented the United States across multiple Games.

Calling embedding API (3072 dims)...
‚úÖ Success!
   Dimensions: 3072
   Expected:   3072
   First 5:    ['0.003113', '0.006287', '-0.007408', '-0.054036', '-0.000653']
   ‚úì Dimension check: PASS


In [114]:
# ‚îÄ‚îÄ Phase 6, Step 6: Parallel profile generation ‚Äî full run ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def process_single_profile(idx_row) -> dict:
    """Process a single athlete for profile + name verification."""
    idx, row = idx_row
    athlete_id = row['athlete_id']
    name = str(row.get('name', 'Unknown')).strip()

    result = generate_profile(row)
    result['athlete_id'] = athlete_id
    result['original_name'] = name

    if result.get('error'):
        log_error(athlete_id, name, result['error'], ProfileConfig.ERROR_LOG)

    return result


def save_profile_progress(results, filename):
    """Save profile progress to CSV."""
    df = pd.DataFrame(results)
    cols = ['athlete_id', 'original_name', 'verified_name', 'name_confidence',
            'name_reasoning', 'profile', 'profile_confidence', 'error']
    cols = [c for c in cols if c in df.columns]
    df[cols].to_csv(filename, index=False, encoding='utf-8', quoting=csv.QUOTE_ALL)
    print(f"  üíæ Progress: {filename} ({len(df):,} profiles)")


def generate_all_profiles(df: pd.DataFrame) -> pd.DataFrame:
    """Generate profiles for all athletes with parallel processing + resume."""

    results = []
    errors = []
    processed_ids = set()

    # Resume from progress ‚Äî only count error-free entries as done
    progress_path = Path(ProfileConfig.PROGRESS_FILE)
    if progress_path.exists():
        print(f"\nüìÇ Found progress file: {ProfileConfig.PROGRESS_FILE}")
        df_progress = pd.read_csv(ProfileConfig.PROGRESS_FILE)

        # Successful = no error at all (including intentional data_only with no error)
        successful = df_progress[df_progress['error'].isna() | (df_progress['error'] == '')]
        failed = df_progress[df_progress['error'].notna() & (df_progress['error'] != '')]
        processed_ids = set(successful['athlete_id'])

        for _, row in successful.iterrows():
            results.append({
                'athlete_id': row['athlete_id'],
                'original_name': row.get('original_name', ''),
                'verified_name': row.get('verified_name', ''),
                'name_confidence': row.get('name_confidence', ''),
                'name_reasoning': row.get('name_reasoning', ''),
                'profile': row.get('profile', ''),
                'profile_confidence': row.get('profile_confidence', ''),
                'error': None
            })

        print(f"  ‚úÖ Loaded {len(processed_ids):,} successful profiles")
        print(f"  üîÑ Will retry {len(failed):,} failed profiles (429s, timeouts, etc.)")

    # Filter to unprocessed + failed
    df_todo = df[~df['athlete_id'].isin(processed_ids)].copy()

    if len(df_todo) == 0:
        print(f"\n‚úÖ All profiles already generated!")
        return pd.DataFrame(results)

    print(f"\n{'=' * 60}")
    print(f"Starting PARALLEL profile generation")
    print(f"  Total athletes:      {len(df):,}")
    print(f"  Already processed:   {len(processed_ids):,}")
    print(f"  To process:          {len(df_todo):,}")
    print(f"  Workers:             {ProfileConfig.MAX_WORKERS}")
    print(f"  Model:               {ProfileConfig.MODEL_NAME} + Google Search")
    print(f"{'=' * 60}\n")

    start_time = time.time()
    rows_to_process = list(df_todo.iterrows())
    total_batches = (len(rows_to_process) + ProfileConfig.BATCH_SIZE - 1) // ProfileConfig.BATCH_SIZE

    with tqdm(total=len(rows_to_process), desc="Generating profiles") as pbar:
        for batch_num in range(total_batches):
            start_idx = batch_num * ProfileConfig.BATCH_SIZE
            end_idx = min(start_idx + ProfileConfig.BATCH_SIZE, len(rows_to_process))
            batch = rows_to_process[start_idx:end_idx]

            with ThreadPoolExecutor(max_workers=ProfileConfig.MAX_WORKERS) as executor:
                futures = {executor.submit(process_single_profile, item): item for item in batch}

                for future in as_completed(futures):
                    result = future.result()
                    results.append(result)
                    if result.get('error'):
                        errors.append(result['athlete_id'])
                    pbar.update(1)

            # Checkpoint
            if (batch_num + 1) % max(1, ProfileConfig.SAVE_INTERVAL // ProfileConfig.BATCH_SIZE) == 0 \
               or end_idx == len(rows_to_process):
                save_profile_progress(results, ProfileConfig.PROGRESS_FILE)

    # Final save
    save_profile_progress(results, ProfileConfig.PROGRESS_FILE)

    elapsed = time.time() - start_time
    print(f"\n{'=' * 60}")
    print(f"Profile Generation Complete!")
    print(f"{'=' * 60}")
    print(f"  Newly processed: {len(df_todo):,}")
    print(f"  Total profiles:  {len(results):,}")
    print(f"  Successful:      {len(results) - len(errors):,}")
    print(f"  Errors:          {len(errors):,}")
    if len(df_todo) > 0:
        print(f"  Time:            {elapsed/60:.1f} minutes")
        print(f"  Throughput:      {len(df_todo)/elapsed*60:.1f} athletes/minute")
    if errors:
        print(f"  Error rate:      {len(errors)/len(results)*100:.1f}%")

    return pd.DataFrame(results)


# ‚îÄ‚îÄ Run it ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
df_profiles = generate_all_profiles(all_athletes)


üìÇ Found progress file: /tmp/v2_athlete_profiles_progress.csv
  ‚úÖ Loaded 12,206 successful profiles
  üîÑ Will retry 1 failed profiles (429s, timeouts, etc.)

Starting PARALLEL profile generation
  Total athletes:      12,207
  Already processed:   12,206
  To process:          1
  Workers:             50
  Model:               gemini-2.5-flash + Google Search



Generating profiles:   0%|          | 0/1 [00:00<?, ?it/s]

  üíæ Progress: /tmp/v2_athlete_profiles_progress.csv (12,207 profiles)
  üíæ Progress: /tmp/v2_athlete_profiles_progress.csv (12,207 profiles)

Profile Generation Complete!
  Newly processed: 1
  Total profiles:  12,207
  Successful:      12,206
  Errors:          1
  Time:            0.2 minutes
  Throughput:      6.2 athletes/minute
  Error rate:      0.0%


In [121]:
# Patch the 8 NaN profiles with data-only fallbacks
patched = 0
for idx, row in df_profiles.iterrows():
    if not isinstance(row['profile'], str) or pd.isna(row['profile']):
        aid = row['athlete_id']
        ath_row = all_athletes[all_athletes['athlete_id'] == aid]
        if len(ath_row) > 0:
            df_profiles.at[idx, 'profile'] = build_data_only_profile(ath_row.iloc[0])
            df_profiles.at[idx, 'profile_confidence'] = 'data_only'
            patched += 1

# Rebuild profile lookup
profile_lookup = dict(zip(df_profiles['athlete_id'], df_profiles['profile']))

# Verify
still_bad = df_profiles[df_profiles['profile'].isna() | df_profiles['profile'].apply(lambda x: not isinstance(x, str))]
print(f"Patched: {patched}")
print(f"Remaining non-string profiles: {len(still_bad)}")

# Save updated progress
save_profile_progress(df_profiles.to_dict('records'), ProfileConfig.PROGRESS_FILE)

Patched: 8
Remaining non-string profiles: 0
  üíæ Progress: /tmp/v2_athlete_profiles_progress.csv (12,207 profiles)


In [122]:
# Check for NaN profiles in df_profiles
nan_profiles = df_profiles[df_profiles['profile'].isna() | (df_profiles['profile'].apply(lambda x: not isinstance(x, str)))]
print(f"Non-string profiles in df_profiles: {len(nan_profiles)}")
if len(nan_profiles) > 0:
    print(nan_profiles[['original_name', 'verified_name', 'name_confidence', 'profile_confidence', 'error']].to_string())

Non-string profiles in df_profiles: 0


In [123]:
# ‚îÄ‚îÄ Phase 6, Step 7: Parallel embedding generation ‚Äî full run ‚îÄ‚îÄ‚îÄ‚îÄ

# Build profile lookup from completed profiles
profile_lookup = dict(zip(df_profiles['athlete_id'], df_profiles['profile']))


def process_single_embedding(item) -> dict:
    """Process a single athlete for embedding."""
    athlete_id, row, profile = item

    text = create_embedding_text(row, profile)
    embedding, error = generate_embedding(text)

    if embedding:
        return {'athlete_id': athlete_id, 'embedding': embedding, 'error': None}
    else:
        name = str(row.get('name', 'Unknown')).strip()
        log_error(athlete_id, name, error, EmbeddingConfig.ERROR_LOG)
        return {'athlete_id': athlete_id, 'embedding': None, 'error': error}


def save_embedding_progress(results, filename):
    """Save embedding progress to CSV."""
    df = pd.DataFrame([r for r in results if r['embedding'] is not None])
    if len(df) > 0:
        df['embedding'] = df['embedding'].apply(json.dumps)
        df[['athlete_id', 'embedding']].to_csv(filename, index=False, encoding='utf-8')
    print(f"  üíæ Progress: {filename} ({len(df):,} embeddings)")


def generate_all_embeddings(df: pd.DataFrame) -> pd.DataFrame:
    """Generate embeddings for all athletes with parallel processing + resume."""

    results = []
    errors = []
    processed_ids = set()

    # Resume from progress
    progress_path = Path(EmbeddingConfig.PROGRESS_FILE)
    if progress_path.exists():
        print(f"\nüìÇ Found progress file: {EmbeddingConfig.PROGRESS_FILE}")
        df_progress = pd.read_csv(EmbeddingConfig.PROGRESS_FILE)
        df_progress['embedding'] = df_progress['embedding'].apply(
            lambda x: json.loads(x) if isinstance(x, str) and x.startswith('[') else None)
        successful = df_progress[df_progress['embedding'].notna()]
        processed_ids = set(successful['athlete_id'])

        for _, row in successful.iterrows():
            results.append({
                'athlete_id': row['athlete_id'],
                'embedding': row['embedding'],
                'error': None
            })
        print(f"  ‚úÖ Loaded {len(successful):,} embeddings")

    # Filter to unprocessed
    df_todo = df[~df['athlete_id'].isin(processed_ids)].copy()

    if len(df_todo) == 0:
        print(f"\n‚úÖ All embeddings already generated!")
        return pd.DataFrame(results)

    print(f"\n{'=' * 60}")
    print(f"Starting PARALLEL embedding generation")
    print(f"  Total athletes:      {len(df):,}")
    print(f"  Already processed:   {len(processed_ids):,}")
    print(f"  To process:          {len(df_todo):,}")
    print(f"  Workers:             {EmbeddingConfig.MAX_WORKERS}")
    print(f"  Model:               {EmbeddingConfig.MODEL_NAME}")
    print(f"  Dimensions:          {EmbeddingConfig.OUTPUT_DIMENSION}")
    print(f"{'=' * 60}\n")

    start_time = time.time()

    # Build items: (athlete_id, row, profile)
    items = []
    for idx, row in df_todo.iterrows():
        profile = profile_lookup.get(row['athlete_id'], '')
        items.append((row['athlete_id'], row, profile))

    total_batches = (len(items) + EmbeddingConfig.BATCH_SIZE - 1) // EmbeddingConfig.BATCH_SIZE

    with tqdm(total=len(items), desc="Generating embeddings") as pbar:
        for batch_num in range(total_batches):
            start_idx = batch_num * EmbeddingConfig.BATCH_SIZE
            end_idx = min(start_idx + EmbeddingConfig.BATCH_SIZE, len(items))
            batch = items[start_idx:end_idx]

            with ThreadPoolExecutor(max_workers=EmbeddingConfig.MAX_WORKERS) as executor:
                futures = {executor.submit(process_single_embedding, item): item for item in batch}

                for future in as_completed(futures):
                    result = future.result()
                    results.append(result)
                    if result['error']:
                        errors.append(result['athlete_id'])
                    pbar.update(1)

            if (batch_num + 1) % max(1, EmbeddingConfig.SAVE_INTERVAL // EmbeddingConfig.BATCH_SIZE) == 0 \
               or end_idx == len(items):
                save_embedding_progress(results, EmbeddingConfig.PROGRESS_FILE)

    # Final save
    save_embedding_progress(results, EmbeddingConfig.PROGRESS_FILE)

    elapsed = time.time() - start_time
    print(f"\n{'=' * 60}")
    print(f"Embedding Generation Complete!")
    print(f"{'=' * 60}")
    print(f"  Newly processed: {len(df_todo):,}")
    print(f"  Total embeddings: {len([r for r in results if r['embedding']]):,}")
    print(f"  Errors:           {len(errors):,}")
    if len(df_todo) > 0:
        print(f"  Time:             {elapsed/60:.1f} minutes")
        print(f"  Throughput:       {len(df_todo)/elapsed*60:.1f} athletes/minute")

    return pd.DataFrame(results)


# ‚îÄ‚îÄ Run it ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
df_embeddings = generate_all_embeddings(all_athletes)


üìÇ Found progress file: /tmp/v2_athlete_embeddings_progress.csv
  ‚úÖ Loaded 1,200 embeddings

Starting PARALLEL embedding generation
  Total athletes:      12,207
  Already processed:   1,200
  To process:          11,007
  Workers:             75
  Model:               gemini-embedding-001
  Dimensions:          3072



Generating embeddings:   0%|          | 0/11007 [00:00<?, ?it/s]

  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (1,400 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (1,600 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (1,800 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (2,000 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (2,200 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (2,400 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (2,600 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (2,800 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (3,000 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (3,200 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (3,400 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (3,600 embeddings)
  üíæ Progress: /tmp/v2_athlete_embeddings_progress.csv (3,800 embeddings)
  üíæ Progr

In [125]:
# ‚îÄ‚îÄ Phase 6, Step 8: Profile + Embedding QC ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

print('=' * 70)
print('PHASE 6 QC: GEMINI ENRICHMENT')
print('=' * 70)

# ‚îÄ‚îÄ PROFILE SUMMARIES ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f'\n{"‚îÄ" * 70}')
print('PROFILE SUMMARIES')
print(f'{"‚îÄ" * 70}')

total = len(df_profiles)
has_error = df_profiles['error'].notna() & (df_profiles['error'] != '')
data_only = df_profiles['profile_confidence'] == 'data_only'
gemini_profiles = ~has_error & ~data_only

print(f'\nüìä COUNTS')
print(f'  Total:              {total:,}')
print(f'  Gemini-generated:   {gemini_profiles.sum():,} ({gemini_profiles.sum()/total*100:.1f}%)')
print(f'  Data-only fallback: {data_only.sum():,} ({data_only.sum()/total*100:.1f}%)')
print(f'  Errors:             {has_error.sum():,} ({has_error.sum()/total*100:.1f}%)')

if gemini_profiles.sum() > 0:
    lengths = df_profiles.loc[gemini_profiles, 'profile'].str.len()
    print(f'\nüìè PROFILE LENGTH (Gemini-generated)')
    print(f'  Min:    {lengths.min():,} chars')
    print(f'  Max:    {lengths.max():,} chars')
    print(f'  Mean:   {lengths.mean():,.0f} chars')
    print(f'  Median: {lengths.median():,.0f} chars')

# ‚îÄ‚îÄ NAME VERIFICATION ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f'\n{"‚îÄ" * 70}')
print('NAME VERIFICATION')
print(f'{"‚îÄ" * 70}')

print(f'\nüè∑Ô∏è NAME CONFIDENCE')
print(df_profiles['name_confidence'].value_counts().to_string())

name_changed = df_profiles['verified_name'] != df_profiles['original_name']
print(f'\nüîÑ NAME CHANGES')
print(f'  Names updated:    {name_changed.sum():,}')
print(f'  Names unchanged:  {(~name_changed).sum():,}')

# Show name changes by confidence
if name_changed.sum() > 0:
    changed = df_profiles[name_changed]
    print(f'\n  By confidence:')
    print(f'    High:   {(changed["name_confidence"] == "high").sum():,}')
    print(f'    Medium: {(changed["name_confidence"] == "medium").sum():,}')
    print(f'    Low:    {(changed["name_confidence"] == "low").sum():,}')

    print(f'\n  Sample name changes (high confidence, first 10):')
    high_changes = changed[changed['name_confidence'] == 'high'].head(10)
    for _, r in high_changes.iterrows():
        print(f'    {r["original_name"]:25s} ‚Üí {r["verified_name"]}')

# ‚îÄ‚îÄ PROFILE CONFIDENCE ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f'\n{"‚îÄ" * 70}')
print('PROFILE CONFIDENCE')
print(f'{"‚îÄ" * 70}')
print(df_profiles['profile_confidence'].value_counts().to_string())

# ‚îÄ‚îÄ BY GAMES TYPE ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f'\n{"‚îÄ" * 70}')
print('BREAKDOWN BY GAMES TYPE')
print(f'{"‚îÄ" * 70}')

merged = df_profiles.merge(
    all_athletes[['athlete_id', 'games_type', 'name']].rename(columns={'name': 'ath_name'}),
    on='athlete_id', how='left'
)

for gt in ['Olympic', 'Paralympic']:
    sub = merged[merged['games_type'] == gt]
    gem = (sub['profile_confidence'] != 'data_only') & (sub['error'].isna() | (sub['error'] == ''))
    do = sub['profile_confidence'] == 'data_only'
    err = sub['error'].notna() & (sub['error'] != '')
    nc = sub['verified_name'] != sub['original_name']
    print(f'\n  {gt}:')
    print(f'    Total:            {len(sub):,}')
    print(f'    Gemini profiles:  {gem.sum():,}')
    print(f'    Data-only:        {do.sum():,}')
    print(f'    Errors:           {err.sum():,}')
    print(f'    Names updated:    {nc.sum():,}')

# ‚îÄ‚îÄ EMBEDDINGS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f'\n{"‚îÄ" * 70}')
print('EMBEDDINGS')
print(f'{"‚îÄ" * 70}')

has_emb = df_embeddings['embedding'].apply(lambda x: x is not None and isinstance(x, list))
emb_errors = df_embeddings['error'].notna()

print(f'\nüìä COUNTS')
print(f'  Total:       {len(df_embeddings):,}')
print(f'  Successful:  {has_emb.sum():,} ({has_emb.sum()/len(df_embeddings)*100:.1f}%)')
print(f'  Failed:      {emb_errors.sum():,}')

if has_emb.sum() > 0:
    sample_emb = df_embeddings.loc[has_emb, 'embedding'].iloc[0]
    print(f'  Dimensions:  {len(sample_emb)}')
    print(f'  Expected:    {EmbeddingConfig.OUTPUT_DIMENSION}')

# ‚îÄ‚îÄ SAMPLE ERRORS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if has_error.sum() > 0:
    print(f'\n{"‚îÄ" * 70}')
    print('SAMPLE ERRORS (first 10)')
    print(f'{"‚îÄ" * 70}')
    err_sample = df_profiles[has_error][['original_name', 'error']].head(10)
    for _, r in err_sample.iterrows():
        print(f'  {r["original_name"]:25s} | {str(r["error"])[:80]}')

# ‚îÄ‚îÄ SAMPLE PROFILES ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f'\n{"‚îÄ" * 70}')
print('SAMPLE PROFILES')
print(f'{"‚îÄ" * 70}')

for label, mask in [('Gemini high-confidence', gemini_profiles & (df_profiles['profile_confidence'] == 'high')),
                     ('Data-only fallback', data_only)]:
    sub = df_profiles[mask]
    if len(sub) > 0:
        s = sub.sample(1, random_state=42).iloc[0]
        print(f'\n  --- {label}: {s["verified_name"]} ---')
        print(f'  Confidence: name={s["name_confidence"]}, profile={s["profile_confidence"]}')
        print(f'  {s["profile"][:300]}')
        if len(s['profile']) > 300:
            print(f'  ...')

print(f'\n{"=" * 70}')
print('Ready to proceed to Step 9: Merge enrichments into all_athletes')
print('=' * 70)

PHASE 6 QC: GEMINI ENRICHMENT

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
PROFILE SUMMARIES
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

üìä COUNTS
  Total:              12,207
  Gemini-generated:   12,147 (99.5%)
  Data-only fallback: 60 (0.5%)
  Errors:             1 (0.0%)

üìè PROFILE LENGTH (Gemini-generated)
  Min:    10 chars
  Max:    2,238 chars
  Mean:   1,275 chars
  Median: 1,285 chars

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
NAME VERIFICATION
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î

In [128]:
# ‚îÄ‚îÄ Phase 6, Step 9: Merge enrichments into all_athletes ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# 1. Apply verified names
name_map = {}
for _, row in df_profiles.iterrows():
    if row['name_confidence'] in ('high', 'medium') and \
       pd.notna(row.get('verified_name')) and \
       row['verified_name'] != row.get('original_name', ''):
        name_map[row['athlete_id']] = row['verified_name']

print(f"Applying {len(name_map):,} verified name updates...")
all_athletes['name'] = all_athletes.apply(
    lambda r: name_map.get(r['athlete_id'], r['name']), axis=1
)

# 2. Add profile summaries
profile_map = dict(zip(df_profiles['athlete_id'], df_profiles['profile']))
all_athletes['profile_summary'] = all_athletes['athlete_id'].map(profile_map)

# 3. Add embeddings
embedding_map = dict(zip(
    df_embeddings[df_embeddings['embedding'].apply(
        lambda x: x is not None and isinstance(x, list))]['athlete_id'],
    df_embeddings[df_embeddings['embedding'].apply(
        lambda x: x is not None and isinstance(x, list))]['embedding']
))
all_athletes['embedding'] = all_athletes['athlete_id'].map(embedding_map)

# 4. Sync updated names to results table
print("Syncing names to results table...")
results_name_map = dict(zip(all_athletes['athlete_id'], all_athletes['name']))
all_results['athlete_name'] = all_results.apply(
    lambda r: results_name_map.get(r['athlete_id'], r['athlete_name']), axis=1
)

# 5. Save enriched checkpoints
enriched = all_athletes.copy()
enriched['embedding'] = enriched['embedding'].apply(
    lambda x: json.dumps(x) if isinstance(x, list) else None)

ath_path = '/tmp/team_usa_athletes_enriched.csv'
enriched.to_csv(ath_path, index=False, encoding='utf-8', quoting=csv.QUOTE_ALL)

res_path = '/tmp/team_usa_results.csv'
all_results.to_csv(res_path, index=False)

# Upload both files
ath_dest = f'{BUCKET}/enriched/team_usa_athletes_enriched.csv'
res_dest = f'{BUCKET}/enriched/team_usa_results.csv'

!gcloud storage cp {ath_path} {ath_dest}
!gcloud storage cp {res_path} {res_dest}

Applying 5,736 verified name updates...
Syncing names to results table...
[1;31mERROR:[0m Cannot check if the destination bucket is compatible for running parallel composite uploads as the user does not permission to perform GET operation on the bucket. The operation will be performed without parallel composite upload feature and hence might perform relatively slower.
Copying file:///tmp/team_usa_athletes_enriched.csv to gs://class-demo/team-usa/enriched/team_usa_athletes_enriched.csv

Average throughput: 81.9MiB/s
Copying file:///tmp/team_usa_results.csv to gs://class-demo/team-usa/enriched/team_usa_results.csv


---
## Phase 7: Validation & Export

*Final QC, vector similarity smoke test, schema freeze, and export to GCS.*

**Steps:**
1. Dataset overview ‚Äî temporal coverage, gender, sport diversity, medal leaders
2. Paralympic classification analysis
3. Vector similarity smoke test
4. Schema freeze + final export to `gs://class-demo/team-usa/final/`
5. Data card

In [129]:
# ‚îÄ‚îÄ Phase 7, Step 1: Dataset overview ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
import numpy as np

print('=' * 70)
print('DATASET OVERVIEW')
print('=' * 70)

# Temporal span
print('\nüìÖ TEMPORAL COVERAGE')
for gt in ['Olympic', 'Paralympic']:
    sub = all_athletes[all_athletes['games_type'] == gt]
    first = sub['first_games_year'].dropna()
    last = sub['last_games_year'].dropna()
    if len(first) > 0:
        print(f'  {gt:15s} {int(first.min())}‚Äì{int(last.max())}')

# Gender breakdown
print('\nüë• GENDER DISTRIBUTION')
gender_by_type = all_athletes.groupby(['games_type', 'gender']).size().unstack(fill_value=0)
print(gender_by_type.to_string())

# Sport diversity
print('\nüèÖ SPORT DIVERSITY')
for gt in ['Olympic', 'Paralympic']:
    sub = all_athletes[all_athletes['games_type'] == gt]
    n_sports = sub['primary_sport'].nunique()
    top_3 = sub['primary_sport'].value_counts().head(3)
    print(f'  {gt}: {n_sports} sports ‚Äî top 3: {", ".join(f"{s} ({c})" for s, c in top_3.items())}')

# Medal leaders
print('\nü•á TOP 10 MEDALISTS (ALL-TIME)')
top = all_athletes.nlargest(10, 'total_medals')[
    ['name', 'games_type', 'primary_sport', 'gold_count', 'total_medals', 'games_count']
].reset_index(drop=True)
top.index = top.index + 1
print(top.to_string())

# Multi-Games athletes
print('\nüîÅ MULTI-GAMES ATHLETES')
for gt in ['Olympic', 'Paralympic']:
    sub = all_athletes[(all_athletes['games_type'] == gt) & (all_athletes['games_count'] >= 2)]
    multi4 = (all_athletes['games_type'] == gt) & (all_athletes['games_count'] >= 4)
    print(f'  {gt:15s} 2+ Games: {len(sub):,}   4+ Games: {multi4.sum():,}')

# Enrichment coverage
print('\nüìù ENRICHMENT COVERAGE')
has_profile = all_athletes['profile_summary'].notna()
has_embedding = all_athletes['embedding'].notna()
print(f'  Profiles:    {has_profile.sum():,} / {len(all_athletes):,} ({has_profile.sum()/len(all_athletes)*100:.1f}%)')
print(f'  Embeddings:  {has_embedding.sum():,} / {len(all_athletes):,} ({has_embedding.sum()/len(all_athletes)*100:.1f}%)')

DATASET OVERVIEW

üìÖ TEMPORAL COVERAGE
  Olympic         1896‚Äì2024
  Paralympic      1960‚Äì2024

üë• GENDER DISTRIBUTION
gender      Female  Male
games_type              
Olympic       3151  7534
Paralympic     626   764

üèÖ SPORT DIVERSITY
  Olympic: 78 sports ‚Äî top 3: Athletics (2070), Swimming (758), Rowing (713)
  Paralympic: 34 sports ‚Äî top 3: Athletics (466), Swimming (228), Wheelchair Basketball (132)

ü•á TOP 10 MEDALISTS (ALL-TIME)
                name  games_type  primary_sport  gold_count  total_medals  games_count
1       Trischa Zorn  Paralympic       Swimming        32.0          46.0          7.0
2     Michael Phelps     Olympic       Swimming        23.0          28.0          5.0
3       Jessica Long  Paralympic       Swimming        13.0          23.0          4.0
4        Bart Dodson  Paralympic      Athletics        13.0          20.0          5.0
5      Erin Popovich  Paralympic       Swimming        14.0          19.0          3.0
6     Rosalie Hixson

In [131]:
# ‚îÄ‚îÄ Phase 7, Step 2: Classification analysis + similarity test ‚îÄ‚îÄ‚îÄ
from numpy.linalg import norm

print('=' * 70)
print('PARALYMPIC CLASSIFICATION ANALYSIS')
print('=' * 70)

para = all_athletes[all_athletes['games_type'] == 'Paralympic']
has_class = para['classification_code'].notna() & (~para['classification_code'].isin(['None', 'nan', '']))
print(f'\n  Total Paralympic athletes: {len(para):,}')
print(f'  With classification:      {has_class.sum():,} ({has_class.sum()/len(para)*100:.1f}%)')
print(f'  Unique codes:             {para.loc[has_class, "classification_code"].nunique()}')
print(f'\n  Top 15 classifications:')
print(para.loc[has_class, 'classification_code'].value_counts().head(15).to_string())

# ‚îÄ‚îÄ Vector similarity smoke test ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print(f'\n{"=" * 70}')
print('VECTOR SIMILARITY SMOKE TEST')
print('=' * 70)

def cosine_similarity(a, b):
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (norm(a) * norm(b))

# Pick a well-known swimmer and find nearest neighbors
athletes_with_emb = all_athletes[all_athletes['embedding'].notna()].copy()

# Find a swimmer with an embedding
swimmer = athletes_with_emb[
    (athletes_with_emb['primary_sport'].str.contains('Swim', na=False)) &
    (athletes_with_emb['total_medals'] > 3)
].head(1)

if len(swimmer) > 0:
    query = swimmer.iloc[0]
    query_emb = query['embedding']
    print(f'\nQuery athlete: {query["name"]} ({query["games_type"]}, {query["primary_sport"]})')
    print(f'  Medals: {int(query["total_medals"]) if pd.notna(query["total_medals"]) else 0}, Games: {int(query["games_count"]) if pd.notna(query["games_count"]) else 0}')

    # Compute similarities
    sims = []
    for _, row in athletes_with_emb.iterrows():
        if row['athlete_id'] != query['athlete_id']:
            sim = cosine_similarity(query_emb, row['embedding'])
            sims.append((row['name'], row['games_type'], row['primary_sport'],
                         int(row['total_medals']) if pd.notna(row.get('total_medals')) else 0, sim))

    sims.sort(key=lambda x: x[4], reverse=True)

    print(f'\n  Top 10 most similar athletes:')
    print(f'  {"Name":35s} {"Type":12s} {"Sport":25s} {"Medals":>7s} {"Similarity":>10s}')
    print(f'  {"-"*35} {"-"*12} {"-"*25} {"-"*7} {"-"*10}')
    for name, gt, sport, medals, sim in sims[:10]:
        print(f'  {name:35s} {gt:12s} {str(sport):25s} {medals:>7d} {sim:>10.4f}')

    # Cross-type check: find most similar Paralympic athlete to our Olympic swimmer
    para_sims = [s for s in sims if s[1] == 'Paralympic']
    if para_sims:
        print(f'\n  Most similar Paralympic athlete:')
        ps = para_sims[0]
        print(f'  {ps[0]} ({ps[2]}, {ps[3]} medals) ‚Äî similarity: {ps[4]:.4f}')
else:
    print('\n  ‚ö†Ô∏è No swimmer with embedding found for similarity test')

PARALYMPIC CLASSIFICATION ANALYSIS

  Total Paralympic athletes: 1,522
  With classification:      691 (45.4%)
  Unique codes:             154

  Top 15 classifications:
classification_code
B3     41
B1     31
B2     28
C3     19
C1     17
S7     16
C5     16
C4     16
LW2    15
T54    14
T52    13
S9     12
T10    12
T42    12
T53    12

VECTOR SIMILARITY SMOKE TEST

Query athlete: Shirley Babashoff (Olympic, Swimming)
  Medals: 8, Games: 2

  Top 10 most similar athletes:
  Name                                Type         Sport                      Medals Similarity
  ----------------------------------- ------------ ------------------------- ------- ----------
  Jack Babashoff                      Olympic      Swimming                        1     0.8441
  Wendy Weinberg                      Olympic      Swimming                        1     0.7895
  Wendy Boglioli                      Olympic      Swimming                        2     0.7738
  Jill Sterkel                        Oly

In [133]:
# ‚îÄ‚îÄ Phase 7, Step 3: Schema freeze + final export ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

ATHLETE_COLUMNS = [
    'athlete_id', 'name', 'gender', 'birth_date',
    'height_cm', 'weight_kg',
    'games_type', 'games_season', 'primary_sport', 'classification_code',
    'first_games_year', 'last_games_year', 'games_count',
    'gold_count', 'silver_count', 'bronze_count', 'total_medals',
    'profile_summary', 'embedding',
]

RESULT_COLUMNS = [
    'athlete_id', 'athlete_name', 'games_year', 'games_season',
    'games_type', 'sport', 'discipline', 'event', 'medal',
    'classification_code',
]

# Apply column order
athletes_final = all_athletes[[c for c in ATHLETE_COLUMNS if c in all_athletes.columns]].copy()
results_final = all_results[[c for c in RESULT_COLUMNS if c in all_results.columns]].copy()

# Check for missing columns
missing_ath = [c for c in ATHLETE_COLUMNS if c not in all_athletes.columns]
missing_res = [c for c in RESULT_COLUMNS if c not in all_results.columns]
if missing_ath:
    print(f'‚ö†Ô∏è Athletes missing columns: {missing_ath}')
if missing_res:
    print(f'‚ö†Ô∏è Results missing columns: {missing_res}')

# Serialize embeddings for CSV export
athletes_final['embedding'] = athletes_final['embedding'].apply(
    lambda x: json.dumps(x) if isinstance(x, list) else None)

# Save locally
ath_final_path = '/tmp/team_usa_athletes_final.csv'
res_final_path = '/tmp/team_usa_results_final.csv'

athletes_final.to_csv(ath_final_path, index=False, encoding='utf-8', quoting=csv.QUOTE_ALL)
results_final.to_csv(res_final_path, index=False, encoding='utf-8')

# Upload to GCS
ath_dest = f'{BUCKET}/final/team_usa_athletes.csv'
res_dest = f'{BUCKET}/final/team_usa_results.csv'

!gcloud storage cp {ath_final_path} {ath_dest}
!gcloud storage cp {res_final_path} {res_dest}

print(f'\n{"=" * 60}')
print(f'FINAL EXPORT')
print(f'{"=" * 60}')
print(f'  Athletes: {len(athletes_final):,} rows √ó {len(athletes_final.columns)} columns')
print(f'  Results:  {len(results_final):,} rows √ó {len(results_final.columns)} columns')
print(f'\n  Athletes schema: {list(athletes_final.columns)}')
print(f'  Results schema:  {list(results_final.columns)}')
print(f'\n  ‚úÖ {ath_dest}')
print(f'  ‚úÖ {res_dest}')

[1;31mERROR:[0m Cannot check if the destination bucket is compatible for running parallel composite uploads as the user does not permission to perform GET operation on the bucket. The operation will be performed without parallel composite upload feature and hence might perform relatively slower.
Copying file:///tmp/team_usa_athletes_final.csv to gs://class-demo/team-usa/final/team_usa_athletes.csv

Average throughput: 144.0MiB/s
Copying file:///tmp/team_usa_results_final.csv to gs://class-demo/team-usa/final/team_usa_results.csv

FINAL EXPORT
  Athletes: 12,207 rows √ó 19 columns
  Results:  24,458 rows √ó 10 columns

  Athletes schema: ['athlete_id', 'name', 'gender', 'birth_date', 'height_cm', 'weight_kg', 'games_type', 'games_season', 'primary_sport', 'classification_code', 'first_games_year', 'last_games_year', 'games_count', 'gold_count', 'silver_count', 'bronze_count', 'total_medals', 'profile_summary', 'embedding']
  Results schema:  ['athlete_id', 'athlete_name', 'games_year'