# Background
The Delta Lake [`replaceWhere`](https://mungingdata.com/delta-lake/updating-partitions-with-replacewhere/) option allows users to selectively apply updates to specific data partitions rather than to full lakes, which may result in significant speed gains. This notebook briefly illustrates the usage of `replaceWhere` option. For more details, see:
- [Selectively updating Delta partitions with replaceWhere](https://mungingdata.com/delta-lake/updating-partitions-with-replacewhere/) (this notebook will be following the example from this blog)
- [Selectively overwrite data with Delta Lake](https://docs.databricks.com/delta/selective-overwrite.html)
- [Table batch reads and writes: overwrite](https://docs.delta.io/latest/delta-batch.html#overwrite)

In [1]:
import pyspark
from delta import *
from pyspark.sql.functions import col, lit
from pyspark.sql.types import StringType

builder = (
    pyspark.sql.SparkSession.builder.appName("MyApp")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

23/12/14 14:32:49 WARN Utils: Your hostname, Richards-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 172.20.10.4 instead (on interface en0)
23/12/14 14:32:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/Users/rpelgrim/miniforge3/envs/pyspark-340-delta-240/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/rpelgrim/.ivy2/cache
The jars for the packages stored in: /Users/rpelgrim/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-032475ef-69bc-4905-9c8c-c5eec64e57ec;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.4.0 in central
	found io.delta#delta-storage;2.4.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 266ms :: artifacts dl 14ms
	:: modules in use:
	io.delta#delta-core_2.12;2.4.0 from central in [default]
	io.delta#delta-storage;2.4.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   | 

## Simple replaceWhere example

In [2]:
df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3), ("d", 4)]).toDF(
    "letter", "number"
)

In [4]:
df.write.format("delta").save("tmp/my_data")

23/12/14 14:33:11 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

In [5]:
spark.read.format("delta").load("tmp/my_data").orderBy(col("number").asc()).show()

+------+------+
|letter|number|
+------+------+
|     a|     1|
|     b|     2|
|     c|     3|
|     d|     4|
+------+------+



In [6]:
df2 = spark.createDataFrame(
    [
        ("x", 7),
        ("y", 8),
        ("z", 9),
    ]
).toDF("letter", "number")

In [7]:
df2.show()

+------+------+
|letter|number|
+------+------+
|     x|     7|
|     y|     8|
|     z|     9|
+------+------+



In [8]:
(
    df2.write.format("delta")
    .option("replaceWhere", "number > 2")
    .mode("overwrite")
    .save("tmp/my_data")
)

                                                                                

In [9]:
spark.read.format("delta").load("tmp/my_data").orderBy(col("number").asc()).show()

+------+------+
|letter|number|
+------+------+
|     a|     1|
|     b|     2|
|     x|     7|
|     y|     8|
|     z|     9|
+------+------+



## Simple replaceWhere example with partitions

In [8]:
df = spark.createDataFrame(
    [
        ("aa", 11),
        ("bb", 22),
        ("aa", 33),
        ("cc", 33),
    ]
).toDF("patient_id", "medical_code")

In [19]:
df.write.format("delta").partitionBy("medical_code").save("tmp/patients")

In [20]:
!tree tmp/patients

[01;34mtmp/patients[0m
├── [01;34m_delta_log[0m
│   └── [00m00000000000000000000.json[0m
├── [01;34mmedical_code=11[0m
│   └── [00mpart-00002-49a164ed-7590-4d4c-8216-bc1a6947ff3b.c000.snappy.parquet[0m
├── [01;34mmedical_code=22[0m
│   └── [00mpart-00004-8364a37a-f5d8-4cfa-8daa-065b5760bedd.c000.snappy.parquet[0m
└── [01;34mmedical_code=33[0m
    ├── [00mpart-00007-522512ed-d6ad-4c3f-996d-5a737b12030b.c000.snappy.parquet[0m
    └── [00mpart-00009-d708e56b-0d87-4545-b3b7-9fc4d3053560.c000.snappy.parquet[0m

4 directories, 5 files


In [24]:
(
    spark.read.format("delta")
    .load("tmp/patients")
    .orderBy(col("medical_code").asc())
    .show()
)

+----------+------------+
|patient_id|medical_code|
+----------+------------+
|        aa|          11|
|        bb|          22|
|        aa|          33|
|        cc|          33|
+----------+------------+



In [30]:
df2 = spark.createDataFrame(
    [
        ("dd", 33),
        ("f", 33),
    ]
).toDF("patient_id", "medical_code")

In [31]:
(
    df2.write.format("delta")
    .option("replaceWhere", "medical_code = '33'")
    .mode("overwrite")
    .partitionBy("medical_code")
    .save("tmp/patients")
)

In [32]:
(
    spark.read.format("delta")
    .load("tmp/patients")
    .orderBy(col("medical_code").asc())
    .show()
)

+----------+------------+
|patient_id|medical_code|
+----------+------------+
|        aa|          11|
|        bb|          22|
|        dd|          33|
|         f|          33|
+----------+------------+



## More complicated Example

### 1. Load some Data

In [10]:
df = spark.read.options(header="True", charset="UTF8").csv(
    "../../data/people_countries.csv"
)

df.show()

+----------+---------+---------+---------+
|first_name|last_name|  country|continent|
+----------+---------+---------+---------+
|   Ernesto|  Guevara|Argentina|     null|
|     Bruce|      Lee|    China|     null|
|      Jack|       Ma|    China|     null|
|  Wolfgang|   Manche|  Germany|     null|
|    Soraya|     Jala|  Germany|     null|
+----------+---------+---------+---------+



### Partition on Country
Now we'll repartition the DataFrame on `country` and write it to disk in the Delta Lake format, partitioned by `country`.

In [12]:
from pyspark.sql.functions import col

deltaPath = "../../data/people_countries_delta/"

(
    df.repartition(col("country"))
    .write.partitionBy("country")
    .format("delta")
    .mode("overwrite")
    .save(deltaPath)
)

                                                                                

Now we write a function to add `continent` values to a DataFrame based on the value of `country`.

In [13]:
from pyspark.sql.functions import col, when


def withContinent(df):
    return df.withColumn(
        "continent",
        when(col("country") == "Germany", "Europe")
        .when(col("country") == "China", "Asia")
        .when(col("country") == "Argentina", "South America"),
    )

Here's where `replaceWhere` comes in. Suppose we only want to populate the `continent` column when `country == 'China'`.

In [14]:
df = spark.read.format("delta").load(deltaPath)
df = df.where(col("country") == "China").transform(withContinent)

(
    df.write.format("delta")
    .option("replaceWhere", "country = 'China'")
    .mode("overwrite")
    .save(deltaPath)
)

                                                                                

In [15]:
spark.read.format("delta").load(deltaPath).show(truncate=False)

+----------+---------+---------+---------+
|first_name|last_name|country  |continent|
+----------+---------+---------+---------+
|Bruce     |Lee      |China    |Asia     |
|Jack      |Ma       |China    |Asia     |
|Ernesto   |Guevara  |Argentina|null     |
|Wolfgang  |Manche   |Germany  |null     |
|Soraya    |Jala     |Germany  |null     |
+----------+---------+---------+---------+



Let's see what happened by taking a look at the most recent log:

In [16]:
import glob
import json
import os

# get path to latest log
path_to_logs = str(deltaPath + "_delta_log/*.json")
list_of_logs = glob.glob(path_to_logs)
latest_log = max(list_of_logs, key=os.path.getctime)
latest_log

# open latest log
with open(latest_log, "r") as f:
    for line in f:
        data = json.loads(line)
        if "add" in data or "remove" in data:
            print(json.dumps(data, indent=4))

{
    "add": {
        "path": "country=China/part-00000-adf67d14-a5a1-4f0f-8d7c-99cb5bb8b2dd.c000.snappy.parquet",
        "partitionValues": {
            "country": "China"
        },
        "size": 1002,
        "modificationTime": 1702564443485,
        "dataChange": true,
        "stats": "{\"numRecords\":2,\"minValues\":{\"first_name\":\"Bruce\",\"last_name\":\"Lee\",\"continent\":\"Asia\"},\"maxValues\":{\"first_name\":\"Jack\",\"last_name\":\"Ma\",\"continent\":\"Asia\"},\"nullCount\":{\"first_name\":0,\"last_name\":0,\"continent\":0}}"
    }
}
{
    "remove": {
        "path": "country=China/part-00000-2fae942f-c7e3-450a-aa4f-4fe991d84c5f.c000.snappy.parquet",
        "deletionTimestamp": 1702564441288,
        "dataChange": true,
        "extendedFileMetadata": true,
        "partitionValues": {
            "country": "China"
        },
        "size": 929
    }
}


We can see that only the `country=China/part-00000-87aebbc2-aff3-4bd6-b369-aa9aacbb93be.c000.snappy.parquet` file was modified. The other partitions were not.

For more details, read the [blog post]().

## Update Multiple Partitions
Let's go one step further to see how we can use `replaceWhere` to update rows spread over multiple partitions.

Start by creating a Delta table with multiple countries in the same continent:

In [17]:
df = (
    spark.read.options(header="True", charset="UTF8")
    .csv("../../data/people_countries.csv")
    .withColumn("continent", lit(None).cast(StringType()))
)

df.show()

+----------+---------+---------+---------+
|first_name|last_name|  country|continent|
+----------+---------+---------+---------+
|   Ernesto|  Guevara|Argentina|     null|
|     Bruce|      Lee|    China|     null|
|      Jack|       Ma|    China|     null|
|  Wolfgang|   Manche|  Germany|     null|
|    Soraya|     Jala|  Germany|     null|
+----------+---------+---------+---------+



In [18]:
# add continents to all
from pyspark.sql.functions import col, when


def withContinent(df):
    return df.withColumn(
        "continent",
        when(col("country") == "Germany", "Europe")
        .when(col("country") == "China", "Asia")
        .when(col("country") == "Argentina", "South America"),
    )


df = df.transform(withContinent)

df.show()

+----------+---------+---------+-------------+
|first_name|last_name|  country|    continent|
+----------+---------+---------+-------------+
|   Ernesto|  Guevara|Argentina|South America|
|     Bruce|      Lee|    China|         Asia|
|      Jack|       Ma|    China|         Asia|
|  Wolfgang|   Manche|  Germany|       Europe|
|    Soraya|     Jala|  Germany|       Europe|
+----------+---------+---------+-------------+



In [20]:
from pyspark.sql.functions import col

deltaPath = "../../data/people_countries_delta/"

(
    df.repartition(col("country"))
    .write.partitionBy("country")
    .format("delta")
    .mode("overwrite")
    .save(deltaPath)
)

                                                                                

In [21]:
# read to confirm
spark.read.format("delta").load(deltaPath).show()

+----------+---------+---------+-------------+
|first_name|last_name|  country|    continent|
+----------+---------+---------+-------------+
|   Ernesto|  Guevara|Argentina|South America|
|  Wolfgang|   Manche|  Germany|       Europe|
|    Soraya|     Jala|  Germany|       Europe|
|     Bruce|      Lee|    China|         Asia|
|      Jack|       Ma|    China|         Asia|
+----------+---------+---------+-------------+



Now create a second DataFrame with 3 more entries:

In [22]:
# append df with more countries
df2 = spark.createDataFrame(
    [
        ("Hamed", "Snouba", "Lebanon", "Asia"),
        ("Jasmine", "Terrywin", "Thailand", "Asia"),
        ("Janneke", "Bosma", "Belgium", "Europe"),
    ]
).toDF("first_name", "last_name", "country", "continent")

df2.show()

+----------+---------+--------+---------+
|first_name|last_name| country|continent|
+----------+---------+--------+---------+
|     Hamed|   Snouba| Lebanon|     Asia|
|   Jasmine| Terrywin|Thailand|     Asia|
|   Janneke|    Bosma| Belgium|   Europe|
+----------+---------+--------+---------+



In [23]:
# append new rows
(df2.write.format("delta").mode("append").save(deltaPath))

                                                                                

In [24]:
# read to confirm
df = spark.read.format("delta").load(deltaPath)
df.show()

+----------+---------+---------+-------------+
|first_name|last_name|  country|    continent|
+----------+---------+---------+-------------+
|   Ernesto|  Guevara|Argentina|South America|
|  Wolfgang|   Manche|  Germany|       Europe|
|    Soraya|     Jala|  Germany|       Europe|
|   Jasmine| Terrywin| Thailand|         Asia|
|   Janneke|    Bosma|  Belgium|       Europe|
|     Hamed|   Snouba|  Lebanon|         Asia|
|     Bruce|      Lee|    China|         Asia|
|      Jack|       Ma|    China|         Asia|
+----------+---------+---------+-------------+



In [25]:
# do we still have the correct partitions?
! ls ../../data/people_countries_delta

[34m_delta_log[m[m        [34mcountry=Belgium[m[m   [34mcountry=Germany[m[m   [34mcountry=Thailand[m[m
[34mcountry=Argentina[m[m [34mcountry=China[m[m     [34mcountry=Lebanon[m[m


In [26]:
# define function
from pyspark.sql.functions import translate


def anonymizeLastname(df):
    return df.withColumn("last_name", translate("last_name", "aeiou", "12345"))

In [27]:
# perform a replaceWhere on a continent == "Asia"
df = df.where(col("continent") == "Asia").transform(anonymizeLastname)
df.show()

+----------+---------+--------+---------+
|first_name|last_name| country|continent|
+----------+---------+--------+---------+
|   Jasmine| T2rryw3n|Thailand|     Asia|
|     Hamed|   Sn45b1| Lebanon|     Asia|
|     Bruce|      L22|   China|     Asia|
|      Jack|       M1|   China|     Asia|
+----------+---------+--------+---------+



In [28]:
# (selective) overwrite to disk
(
    df.write.format("delta")
    .option("replaceWhere", "continent = 'Asia'")
    .mode("overwrite")
    .save(deltaPath)
)

                                                                                

In [29]:
df = spark.read.format("delta").load(deltaPath)
df.show()

+----------+---------+---------+-------------+
|first_name|last_name|  country|    continent|
+----------+---------+---------+-------------+
|   Ernesto|  Guevara|Argentina|South America|
|  Wolfgang|   Manche|  Germany|       Europe|
|    Soraya|     Jala|  Germany|       Europe|
|   Jasmine| T2rryw3n| Thailand|         Asia|
|   Janneke|    Bosma|  Belgium|       Europe|
|     Hamed|   Sn45b1|  Lebanon|         Asia|
|     Bruce|      L22|    China|         Asia|
|      Jack|       M1|    China|         Asia|
+----------+---------+---------+-------------+



Great job! 

Let's just check the most recent log to confirm what happened:

In [16]:
# get path to latest log
path_to_logs = str(deltaPath + "_delta_log/*.json")
list_of_logs = glob.glob(path_to_logs)
latest_log = max(list_of_logs, key=os.path.getctime)
latest_log

# open latest log
with open(latest_log, "r") as f:
    for line in f:
        data = json.loads(line)
        if "add" in data or "remove" in data:
            print(json.dumps(data, indent=4))

{
    "add": {
        "path": "country=Thailand/part-00000-90e36b14-623b-455b-917a-11a6063ecccb.c000.snappy.parquet",
        "partitionValues": {
            "country": "Thailand"
        },
        "size": 1032,
        "modificationTime": 1702406183349,
        "dataChange": true,
        "stats": "{\"numRecords\":1,\"minValues\":{\"first_name\":\"Jasmine\",\"last_name\":\"T2rryw3n\",\"continent\":\"Asia\"},\"maxValues\":{\"first_name\":\"Jasmine\",\"last_name\":\"T2rryw3n\",\"continent\":\"Asia\"},\"nullCount\":{\"first_name\":0,\"last_name\":0,\"continent\":0}}"
    }
}
{
    "add": {
        "path": "country=Lebanon/part-00001-e419556d-7d8d-4263-b6fd-915a4edff62b.c000.snappy.parquet",
        "partitionValues": {
            "country": "Lebanon"
        },
        "size": 1004,
        "modificationTime": 1702406183349,
        "dataChange": true,
        "stats": "{\"numRecords\":1,\"minValues\":{\"first_name\":\"Hamed\",\"last_name\":\"Sn45b1\",\"continent\":\"Asia\"},\"maxVal

Nice work - only the partitions for the countries in Asia were affected by our `replaceWhere` operation.

## Read the full blog
This was just a quick demonstration. For the full walkthrough with detailed explanation, check out [the blog]().

In [None]:
(
    df.write.format("delta")
    .mode("overwrite")
    .option("partitionOverwriteMode", "dynamic")
    .saveAsTable("default.people10m")
)