# bootstrap_words_03_word_states

Creates `silver.word_states` table with columns:
- `word`
- `letter_set`
- `frequency`
- `embedding`
- `last_seen_on` (date type, nullable)
- `label` (`0.0` or `1.0`, nullable)
- `batch_id` (string, `"bootstrap_{stage}_{num}"` or `str(puzzle_date)`

Steps in the process:
- reduce the batch reader size to avoid vectorized reader using too much memory
- TODO: Find out if this step should only happen locally or on Databricks too
    - if local only, put this config change in `if not is_databricks_env():` block
- read in `bronze.words` Delta table
- rename `date_added` -> `last_seen_on` (should all be null for bootstrap)
- label column = null (float type)
- batch_id col = `"bootstrap_words_01"`
- drop version column
- save as Delta table `silver.word_states`

In [None]:
%run './00_setup.ipynb'

In [None]:
import pyspark.sql.functions as F
from pyspark.sql.types import *

from src.envutils import is_databricks_env
from src.sparkdbutils import create_db, create_repartitioned_table

In [None]:
# Config for this notebook, possibly local only
if not is_databricks_env():
    print("updating spark config for this notebook")
    spark.conf.set("spark.sql.parquet.columnarReaderBatchSize", "1024")

In [None]:
# TODO: Should be pipeline parameters
_SOURCE_DB_NAME = "bronze"
_SOURCE_TABLE_NAME = "words"
_TARGET_DB_NAME = "silver"
_TARGET_TABLE_NAME = "word_states"

In [None]:
# Read in bronze.words table
df = spark.sql(f"SELECT * FROM {_SOURCE_DB_NAME}.{_SOURCE_TABLE_NAME}")

In [None]:
# Rename date_added -> last_seen_on
df = df.withColumnRenamed("date_added", "last_seen_on")

In [None]:
# Add label column (1.0, or 0.0, all null for now)
df = df.withColumn("label", F.lit(None).cast("float")) 

In [None]:
# Add batch id ("bootstrap_words_1" for this batch)
df = df.withColumn("batch_id", F.lit("bootstrap_words_1"))

In [None]:
# Drop version column
df = df.drop("version")

In [None]:
# Save to target db.table
create_db(spark, _TARGET_DB_NAME)
create_repartitioned_table(spark, df, _TARGET_TABLE_NAME, _TARGET_DB_NAME, 10)