# bootstrap_puzzles_01_insert_word_decisions

Processes past puzzles in raw storage. For each puzzle, writes a `word_decision` record for each word in the puzzle's answers, as well as each word in the `bronze.words` table which _could_ have been an officila answer (it was possible to form it from the game's letters) but was implicitly rejected.

- One backfill pipeline run per year
- Get words from bronze.words, transform into letter_set_map
- Work in batches of one month
- For each month:
    - Get the filepaths of puzzles for that month
    - Transform each puzzle into `word_decisions` table rows, using the letter_set_map
    - Write batch of rows to the bronze table
    - TODO: perform validation checks and audit logs before and after each write op

## NOTE:
- Bootstrap for all puzzles up to and including 2025-06-23
- Do not include puzzles beyond this date 

## ⚠️ Not working locally? ⚠️

To run this notebook locally, edit the first code cell:

Change:  
`%run "./00_setup"`  
To:  
`%run "./00_setup.ipynb"`

👉 _Please **do not commit** this change — it's only for local execution._

In [0]:
%run "./00_setup"

In [0]:
from src.bronzeutils import rows_to_word_decisions_df, WORD_DECISIONS_PARTITIONS
from src.fileutils import get_puzzle_paths
from src.sparkdbutils import create_db, write_to_table_replace_where
from src.wordutils import get_letter_set_map, transform_puzzle_to_word_decisions_by_path

In [0]:
from typing import Any

In [0]:
# TODO: Parameterize _YEAR, _TARGET_DB_NAME, _TABLE_NAME
_YEAR = 2023
_SOURCE_DB_NAME = "bronze"
_SOURCE_TABLE_NAME = "words"
_TARGET_DB_NAME = "bronze"
_TARGET_TABLE_NAME = "word_decisions"

In [0]:
# Get all words and convert to letter_set_map
words_df = spark.sql(f"SELECT word FROM {_SOURCE_DB_NAME}.{_SOURCE_TABLE_NAME}")
words_list = sorted([row.word for row in words_df.select("word").collect()])
letter_set_map = get_letter_set_map(words_list)

In [0]:
def process_month(year: int, month: int, letter_set_map: dict[str, list[Any]]) -> list[dict[str, Any]]:
    """
    Returns word_decision rows for all puzzles in the given year/month
    """
    rows = []
    puzzle_paths = get_puzzle_paths(year, month)
    for puzzle_path in sorted(puzzle_paths):
        curr_rows = transform_puzzle_to_word_decisions_by_path(puzzle_path, letter_set_map)
        rows.extend(curr_rows)

    return rows_to_word_decisions_df(rows, spark)

In [0]:
# Create db if not done already
create_db(spark, _TARGET_DB_NAME)

In [0]:
total_rows = 0

for month in range(1, 13):
    print(f"Processing year {_YEAR}, month {month}...")
    df = process_month(_YEAR, month, letter_set_map)
    
    curr_count = df.count()
    total_rows += curr_count
    print(f"Writing {curr_count} rows to {_TARGET_DB_NAME}.{_TARGET_TABLE_NAME}")
    replace_where_dict = {
        "year": _YEAR,
        "month": month,
    }
    write_to_table_replace_where(spark,
                   df,
                   _TARGET_DB_NAME,
                   _TARGET_TABLE_NAME,
                   replace_where_dict,
                   WORD_DECISIONS_PARTITIONS)

    # TODO: validation, audit log, etc.
print(f"{total_rows} rows written in total")

In [0]:
# Uncomment for debugging / validation
# df2 = spark.sql("SELECT * FROM bronze.word_decisions")
# print(f"{df2.count()} total rows in table")
# df2.show(10, False)

In [0]:
# Uncomment for debugging / validation
# df2.select(["year", "month"]).distinct().sort(["year", "month",]).show(50, False)