# 02_transform_word_decisions

Part of single-file daily ingestion pipeline
- Read in the puzzle file for the parameterized date
- find all possible words 
- extract the explicit and implict decisions about each word
- write to the bronze table

## ⚠️ Not working locally? ⚠️

To run this notebook locally, edit the first code cell:

Change:  
`%run "./00_setup"`  
To:  
`%run "./00_setup.ipynb"`

👉 _Please **do not commit** this change — it's only for local execution._


In [0]:
%run "./00_setup"


In [0]:
from src.sparkdbutils import get_or_create_db, write_to_table
from src.fileutils import get_latest_wordlist, word_file_to_set
from src.wordutils import get_letter_set_map, transform_puzzle_to_word_decisions_by_date
from src.bronzeutils import rows_to_word_decisions_df, WORD_DECISIONS_PARTITIONS

In [0]:
# TODO: Parameterized _PUZZLE_DATE
_PUZZLE_DATE = "YYYY-MM-DD"

In [0]:
wordlist_filename, wordlist_version = get_latest_wordlist()
wordlist = word_file_to_set(wordlist_filename)
letter_set_map = get_letter_set_map(wordlist)

In [0]:
rows = transform_puzzle_to_word_decisions_by_date(_PUZZLE_DATE, 
                                                  wordlist, 
                                                  letter_set_map, 
                                                  wordlist_version)

In [0]:
df = rows_to_word_decisions_df(rows, spark)

In [0]:
print(df.count())
df.printSchema()

In [0]:
df.show(10, False)

In [0]:
# TODO: Pipeline parameter for db name, table name, puzzle_date, etc.
# TODO: Do not set this as a variable here
_TARGET_DB_NAME = "bronze"
get_or_create_db(spark, _TARGET_DB_NAME)

In [0]:
# TODO: Pipeline parameter for table name
_TABLE_NAME = "word_decisions"

# TODO: Extract to a helper function
year, month, day = _PUZZLE_DATE.split("-")
replace_where_dict = {
    "year": int(year),
    "month": int(month),
    "day": int(day)
}
partitions = ["year", "month"]

write_to_table(spark, 
               df, 
               _TARGET_DB_NAME, 
               _TABLE_NAME, 
               replace_where_dict, 
               WORD_DECISIONS_PARTITIONS)

In [0]:
df2 = spark.sql("SELECT * FROM bronze.word_decisions")
print(df2.count())
df2.show(10, False)

In [0]:
df2.select(["year", "month"]).distinct().sort(["year", "month"]).show()