# Backfill: transform_word_decisions

Part of historical backfill pipeline

- One backfill pipeline run per year
- Work in batches of one month
- For each month:
    - Get the filepaths of puzzles for that month
    - Transform each puzzle into `word_decisions` table rows
    - write to the bronze table
    - perform validation checks and audit logs before and after each write op

In [None]:
%run "./00_setup.ipynb"

In [None]:
from pyspark.sql.types import * 
import pyspark.sql.functions as F
from typing import Any

In [None]:
from src.fileutils import get_latest_wordlist, word_file_to_set, get_puzzle_paths
from src.wordutils import get_letter_set_map, transform_puzzle_to_word_decisions_by_path
from src.bronzeutils import bronze_schema, rows_to_bronze_df

In [None]:
wordlist_filename, wordlist_version = get_latest_wordlist()
wordlist = word_file_to_set(wordlist_filename)
letter_set_map = get_letter_set_map(wordlist)

In [None]:
def process_month(year: int, month: int) -> list[dict[str, Any]]:
    """
    Returns word_decision rows for all puzzles in the given year/month
    """
    rows = []
    puzzle_paths = get_puzzle_paths(year, month)
    for puzzle_path in sorted(puzzle_paths):
        curr_rows = transform_puzzle_to_word_decisions_by_path(puzzle_path,
                                                               wordlist,
                                                               letter_set_map,
                                                               wordlist_version)
        rows.extend(curr_rows)

    return rows_to_bronze_df(rows, spark)

In [None]:
# TODO: Parameterized _YEAR
_YEAR = 2024
for month in range(1, 13):
    print(f"Processing year {_YEAR}, month {month}...")
    df = process_month(_YEAR, month)
    print(f"{df.count()} rows")

    # TODO: create db if doesn't exist, write to table, audit & log, etc.
    

In [None]:
# ===== TODOS / notes below this line =======

- Create database if doesn't exist (parameterized db names?)
- Write to table (using replaceWhere, MERGE, something else??)
- one pipeline to backfill, another for daily ingestion
- backfill runs for a year, one month at a time, with verification and audit steps
- backfill gets the paths for a given month (`glob` locally, `dbutils.fs.ls()` in cloud), then reads in each puzzle one at a time, writing to in-memory rows, then writes to a dataframe, then uses `uses replaceWhere` with Delta
- daily can use the delete + write pattern (or will `replaceWhere` work for this as well??)
- helper methods: `get_puzzle_by_date`, `ingest_puzzle_by_date` (for daily), `get_puzzle_paths`, `get_puzzle_by_path`, `ingest_puzzle_by_path` (for backfill) 

- Backfill script validates as it goes, uses replaceWhere with delta runs for a given year only, one chunk at a time
- Daily ingest script that writes one file for a specific day/month/year
- Repurpose helper methods to write to table, create db if it doesn't exist ... again with local and dbx code paths??
- Try to do all writes at once or find a batch size
- Need a way to redo the run, 1 write per puzzle date? Is that efficient??