# 02_transform_word_decisions

Part of single-file daily ingestion pipeline
- Read in the puzzle file for the parameterized date
- find all possible words 
- extract the explicit and implict decisions about each word
- write to the bronze table


In [None]:
%run "./00_setup.ipynb"

In [None]:
import pyspark.sql.functions as F

In [None]:
# from src.constants import DATE_FORMAT, RAW_SOLUTIONS_PATH, LOCAL_DATA_LAKE_PREFIX
from src.fileutils import get_latest_wordlist, word_file_to_set
from src.wordutils import get_letter_set_map, transform_puzzle_to_word_decisions_by_date
from src.bronzeutils import bronze_schema, rows_to_bronze_df

In [None]:
# TODO: Parameterized _PUZZLE_DATE
_PUZZLE_DATE = "2024-05-03"

In [None]:
wordlist_filename, wordlist_version = get_latest_wordlist()
wordlist = word_file_to_set(wordlist_filename)
letter_set_map = get_letter_set_map(wordlist)

In [None]:
rows = transform_puzzle_to_word_decisions_by_date(_PUZZLE_DATE, 
                                                  wordlist, 
                                                  letter_set_map, 
                                                  wordlist_version)

In [None]:
df = rows_to_bronze_df(rows, spark)

In [None]:
print(df.count())
df.printSchema()

In [None]:
# ===== TODOS / notes below this line =======

- Create database if doesn't exist (parameterized db names?)
- Write to table (using replaceWhere, MERGE, something else??)
- one pipeline to backfill, another for daily ingestion
- backfill runs for a year, one month at a time, with verification and audit steps
- backfill gets the paths for a given month (`glob` locally, `dbutils.fs.ls()` in cloud), then reads in each puzzle one at a time, writing to in-memory rows, then writes to a dataframe, then uses `uses replaceWhere` with Delta
- daily can use the delete + write pattern (or will `replaceWhere` work for this as well??)
- helper methods: `get_puzzle_by_date`, `ingest_puzzle_by_date` (for daily), `get_puzzle_paths`, `get_puzzle_by_path`, `ingest_puzzle_by_path` (for backfill) 

- Backfill script validates as it goes, uses replaceWhere with delta runs for a given year only, one chunk at a time
- Daily ingest script that writes one file for a specific day/month/year
- Repurpose helper methods to write to table, create db if it doesn't exist ... again with local and dbx code paths??
- Try to do all writes at once or find a batch size
- Need a way to redo the run, 1 write per puzzle date? Is that efficient??