# Demonstration: `split_by_date_formats`

This notebook showcases how to parse string dates that arrive in multiple formats
by using `spark_fuse.utils.transformations.split_by_date_formats`.


## Prerequisites

* A working PySpark environment (Spark 3.4+ recommended).
* The `spark-fuse` package available on the Python path.


In [6]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("SplitByDateFormatsDemo")
    .master("local[*]")
    .getOrCreate()
)

spark

## Sample data

We will work with a small dataset where the `raw` column contains dates in more than one
format, plus an invalid entry that needs special handling.


In [7]:
from spark_fuse.utils.transformations import split_by_date_formats

data = [
    {"id": 1, "raw": "2023-03-05"},
    {"id": 2, "raw": "05/06/2023"},
    {"id": 3, "raw": "00.00.0000"},  # invalid
]

df = spark.createDataFrame(data)
df.show(truncate=False)

                                                                                

+---+----------+
|id |raw       |
+---+----------+
|1  |2023-03-05|
|2  |05/06/2023|
|3  |00.00.0000|
+---+----------+



## Parsing with multiple formats

We can provide a list of Spark-compatible date patterns. The rows that match any of the
patterns are parsed into a `date` column; others are left unmatched so they can be
addressed separately.


In [8]:
parsed_df, unmatched_df = split_by_date_formats(
    df,
    column="raw",
    formats=["yyyy-MM-dd", "MM/dd/yyyy","dd.MM.yyyy"],
    return_unmatched=True,
)

print("Parsed rows:")
parsed_df.orderBy("id").show(truncate=False)

print("Unmatched rows:")
unmatched_df.show(truncate=False)

Parsed rows:


                                                                                

+---+----------+----------+
|id |raw       |raw_date  |
+---+----------+----------+
|1  |2023-03-05|2023-03-05|
|2  |05/06/2023|2023-05-06|
|3  |00.00.0000|NULL      |
+---+----------+----------+

Unmatched rows:
+---+----------+--------+
|id |raw       |raw_date|
+---+----------+--------+
|3  |00.00.0000|NULL    |
+---+----------+--------+



## Applying a default value

When invalid values should fall back to a known date, set `handle_errors="default"`
and provide `default_value`. The unmatched rows are still available if you request them.


In [9]:
default_df, default_unmatched = split_by_date_formats(
    df,
    column="raw",
    formats=["yyyy-MM-dd", "MM/dd/yyyy", "dd.MM.yyyy"],
    handle_errors="default",
    default_value="1900-01-01",
    return_unmatched=True,
)

print("Rows with defaults applied:")
default_df.orderBy("id").show(truncate=False)

print("Unmatched rows (still available for auditing):")
default_unmatched.show(truncate=False)

Rows with defaults applied:


                                                                                

+---+----------+----------+
|id |raw       |raw_date  |
+---+----------+----------+
|1  |2023-03-05|2023-03-05|
|2  |05/06/2023|2023-05-06|
|3  |00.00.0000|1900-01-01|
+---+----------+----------+

Unmatched rows (still available for auditing):
+---+----------+--------+
|id |raw       |raw_date|
+---+----------+--------+
|3  |00.00.0000|NULL    |
+---+----------+--------+



## Clean up

Stop the Spark session if you no longer need it.


In [10]:
spark.stop()