Skip to content

v0.2.10 — Direct CSV ingestion via Polars streaming

Choose a tag to compare

@kayhendriksen kayhendriksen released this 31 Mar 18:44
· 97 commits to main since this release

Ingest pipeline rewrite

The Databricks ingest step no longer requires a Parquet intermediate layer.
Raw CSVs are now read directly from the Unity Catalog Volume using Polars
scan_csv with engine="streaming", then written to Delta tables via
Arrow → Spark.

  • Large historical collections (SMN, SMN Precip, SMN Tower) use chunked
    writes to keep peak memory bounded (configurable via --chunk-size)
  • Falls back to eager parse_csv_bytes if streaming collect fails due to
    mixed-type columns
  • Column comments are automatically applied from _meta_parameters.csv
    English descriptions
  • Local spark-submit is now supported: Unity Catalog DDL is skipped
    when DATABRICKS_RUNTIME_VERSION is not set
  • Databricks job tasks updated: download now runs with --no-parquet,
    ingest receives --historical for the historical job