d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Delete user records
Under the European Union General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA),
a user of the health tracker device has the right to request that their data be expunged from the system.
We might simply do this by deleting all records associated with that user's device id.

## Notebook Configuration

Before you run this cell, make sure to add a unique user name to the file
<a href="$./includes/configuration" target="_blank">
includes/configuration</a>, e.g.

```
username = "yourfirstname_yourlastname"
```

In [0]:
%run ./includes/configuration

#### Step 1: Delete all records for the device 4
We use the `DELETE` Spark SQL command to remove all records from the `health_tracker_processed`
table that match the given predicate.

In [0]:
from delta.tables import DeltaTable

processedDeltaTable = DeltaTable.forPath(spark, health_tracker + "processed")
processedDeltaTable.delete("p_device_id = 4")

## Recover the Lost Data
In the previous lesson, we deleted all records from the `health_tracker_processed` table
for the health tracker device with id, 4. 

Suppose that the user did not wish to remove all of their data,
but merely to have their name scrubbed from the system.

In this lesson,
we use the Time Travel capability of Delta Lake to recover everything but the user’s name.

#### Step 1: Prepare New upserts View
We prepare a view for upserting using Time Travel to recover the missing records.
Note that we have replaced the entire name column with the value `NULL`.
Complete the `.where()` to grab just `p_device_id` records that are equal to 4.

In [0]:
# ANSWER
from pyspark.sql.functions import lit

upsertsDF = (
  spark.read
  .option("versionAsOf", 5)
  .format("delta")
  .load(health_tracker + "processed")
  .where("p_device_id = 4")
  .select("dte", "time",
          "heartrate", lit(None).alias("name"), "p_device_id")
)

#### Step 2: Perform Upsert Into the `health_tracker_processed` Table
Once more, we upsert into the `health_tracker_processed` Table using the DeltaTable command `.merge()`.
Note that it is necessary to define:
1. The reference to the Delta table
1. The insert logic because the schema has changed.

Our keys will be our original column names and our values will be
`"upserts+columnName"`

In [0]:
# ANSWER
processedDeltaTable = DeltaTable.forPath(spark, health_tracker + "processed")

update_match = """health_tracker.time = upserts.time
                  AND
                  health_tracker.p_device_id = upserts.p_device_id"""
update = {"heartrate" : "upserts.heartrate"}

insert = {
      "p_device_id" : "upserts.p_device_id",
      "heartrate" : "upserts.heartrate",
      "name" : "upserts.name",
      "time" : "upserts.time",
      "dte" : "upserts.dte"
}

(processedDeltaTable.alias("health_tracker")
 .merge(upsertsDF.alias("upserts"), update_match)
 .whenMatchedUpdate(set=update)
 .whenNotMatchedInsert(values=insert)
 .execute())

-sandbox

#### Step 3: Count the Most Recent Version
When we look at the current version, we expect to see:

$$ 5 devices \times 24 hours \times (31 + 29 + 31) days $$

That should give us 10920 records. 

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Note that the range of data includes the month of February during a leap year. That is why there are 29 days in the month.

In [0]:
(
  spark.read
  .format("delta")
  .load(health_tracker + "processed")
  .count()
)

#### Step 4: Query Device 4 to Demonstrate Compliance
We query the `health_tracker_processed` table to demonstrate that the name associated with device 4 has indeed been removed.

In [0]:
display(
  spark.read
  .format("delta")
  .load(health_tracker + "processed")
  .where("p_device_id = 4")
)

## Maintaining Compliance with a Vacuum Operation
Unfortunately, with the power of the Delta Lake Time Travel feature, we are still out of compliance as the table could simply be queried against an earlier version to identify the name of the user associated with device 4.

#### Step 1: Query an Earlier Table Version
We query the `health_tracker_processed` table against an earlier version to demonstrate that it is still possible to retrieve the name associated with device 4.

In [0]:
display(
  spark.read
  .option("versionAsOf", 2)
  .format("delta")
  .load(health_tracker + "processed")
  .where("p_device_id = 4")
)

#### Step 2: Vacuum Table to Remove Old Files
The `VACUUM` Spark SQL command can be used to solve this problem. The `VACUUM` command recursively vacuums directories associated with the Delta table and removes files that are no longer in the latest state of the transaction log for that table and that are older than a retention threshold. The default threshold is 7 days.

In [0]:
from pyspark.sql.utils import IllegalArgumentException

try:
  processedDeltaTable.vacuum(0)
except IllegalArgumentException as error:
  print(error)

## Delta Table Retention Period
When we run this command, we receive the below error. The default threshold is in place
to prevent corruption of the Delta table.
```
IllegalArgumentException: requirement failed: Are you sure you would like
to vacuum files with such a low retention period?
If you have writers that are currently writing to this table, there is a risk
that you may corrupt the state of your Delta table.

If you are certain that there are no operations being performed on this table, such as insert/upsert/delete/optimize, then you may turn off this check by setting: spark.databricks.delta.retentionDurationCheck.enabled = false

If you are not sure, please use a value not less than "168 hours".
```

#### Step 3: Set Delta to Allow the Operation
To demonstrate the `VACUUM` command, we set our retention period to 0 hours
to be able to remove the questionable files now. This is typically not a best practice
and in fact, there are safeguards in place to prevent this operation from being performed.
For demonstration purposes, we will set Delta to allow this operation.

In [0]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", False)

#### Step 4: Vacuum Table to Remove Old Files

In [0]:
processedDeltaTable.vacuum(0)

#### Step 5: Attempt to Query an Earlier Version
Now when we attempt to query an earlier version, an error is thrown.
This error indicates that we are not able to query data from this earlier version because the files have been expunged from the system.

In [0]:
display(
  spark.read
  .option("versionAsOf", 4)
  .format("delta")
  .load(health_tracker + "processed")
  .where("p_device_id = 4")
)


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>