-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Deidentified PII Access

This lesson explores approaches for reducing risk of PII leakage while working with potentially sensitive information for analytics and reporting.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_user_bins.png" width="60%" />

## Learning Objectives
By the end of this lesson, students will be able to:
- Apply dynamic views to sensitive data to obscure columns containing PII
- Use dynamic views to filter data, only showing relevant rows to relevant audiences
- Create binned tables to generalize data and obscure PII

Begin by running the following cell to set up relevant databases and paths.

In [0]:
%run ../Includes/Classroom-Setup-6.3

## Dynamic Views

Databricks <a href="https://docs.databricks.com/security/access-control/table-acls/object-privileges.html#dynamic-view-functions" target="_blank">dynamic views</a> allow user or group identity ACLs to be applied to data at the column (or row) level.

Database administrators can configure data access privileges to disallow access to a source table and only allow users to query a redacted view. 

Users with sufficient privileges will be able to see all fields, while restricted users will be shown arbitrary results, as defined at view creation.

Consider our **`users`** table with the following columns.

In [0]:
%sql 
DESCRIBE TABLE users

Obviously first name, last name, date of birth, and street address are problematic. 

We'll also obfuscate zip code (as zip code combined with date of birth has a very high confidence in identifying data).

In [0]:
%sql
CREATE OR REPLACE VIEW users_vw AS
  SELECT
    alt_id,
    CASE 
      WHEN is_member('ade_demo') THEN dob
      ELSE 'REDACTED'
    END AS dob,
    sex,
    gender,
    CASE 
      WHEN is_member('ade_demo') THEN first_name
      ELSE 'REDACTED'
    END AS first_name,
    CASE 
      WHEN is_member('ade_demo') THEN last_name
      ELSE 'REDACTED'
    END AS last_name,
    CASE 
      WHEN is_member('ade_demo') THEN street_address
      ELSE 'REDACTED'
    END AS street_address,
    city,
    state,
    CASE 
      WHEN is_member('ade_demo') THEN zip
      ELSE 'REDACTED'
    END AS zip,
    updated
  FROM users

Now when we query from **`users_vw`**, only members of the group **`ade_demo`** will be able to see results in plain text.

**NOTE**: You may not have privileges to create groups or assign membership. Your instructor should be able to demonstrate how group membership will change query results.

In [0]:
%sql
SELECT * FROM users_vw

## Adding Conditional Row Access

Adding views with **`WHERE`** clauses to filter source data on different conditions for teams throughout an organization can be a beneficial option for granting access to only the necessary data to each audience. Dynamic views add the option to create these views with full access to underlying data for users with elevated privileges.

Note the views can be layered on top of one another; below, the **`users_vw`** from the previous step is modified with conditional access. Users that aren't members of the specified group will only be able to see records from the city of Los Angeles that have been updated after the specified date.

In [0]:
%sql
CREATE OR REPLACE VIEW users_la_vw AS
SELECT * FROM users_vw
WHERE 
  CASE 
    WHEN is_member('ade_demo') THEN TRUE
    ELSE city = "Los Angeles" AND updated > "2019-12-12"
  END

In [0]:
%sql
SELECT * FROM users_la_vw

## Provide Provisional Access to **`user_lookup`** Table

Our **`user_lookup`** table allows our ETL pipelines to match up our various identifiers with our **`alt_id`** and pull demographic information, as necessary.

Most of our team will not need access to our full PII, but may need to use this table to match up various natural keys from different systems.

Define a dynamic view named **`user_lookup_vw`** below that provides conditional access to the **`alt_id`** but full access to the other info in our **`user_lookup`** table.

In [0]:
%sql
-- ANSWER
CREATE OR REPLACE VIEW user_lookup_vw AS
SELECT 
  CASE 
    WHEN is_member('ade_demo') THEN alt_id
    ELSE 'REDACTED'
  END AS alt_id,
  device_id, mac_address, user_id
FROM user_lookup

In [0]:
%sql
SELECT * FROM user_lookup_vw

## Generalize PII in Aggregate Tables

Another approach to reducing chance of exposing PII is only providing access to data at a less specific level.

In this section, we'll assign users to age bins while maintaining their gender, city, and state information. 

This will provide sufficient demographic information to build comparative dashboards without revealing specific user identity.

Here we're just defining custom logic for replacing values with manually-specified labels.

In [0]:
def age_bins(dob_col):
    age_col = F.floor(F.months_between(F.current_date(), dob_col)/12).alias("age")
    
    return (F.when((age_col < 18), "under 18")
             .when((age_col >= 18) & (age_col < 25), "18-25")
             .when((age_col >= 25) & (age_col < 35), "25-35")
             .when((age_col >= 35) & (age_col < 45), "35-45")
             .when((age_col >= 45) & (age_col < 55), "45-55")
             .when((age_col >= 55) & (age_col < 65), "55-65")
             .when((age_col >= 65) & (age_col < 75), "65-75")
             .when((age_col >= 75) & (age_col < 85), "75-85")
             .when((age_col >= 85) & (age_col < 95), "85-95")
             .when((age_col >= 95), "95+")
             .otherwise("invalid age").alias("age"))

Because this aggregate view of demographic information is no longer personally identifiable, we can safely store this using our natural key.

We'll reference our **`user_lookup`** table to match our IDs.

In [0]:
from pyspark.sql import functions as F

users_df = spark.table("users")
lookup_df = spark.table("user_lookup").select("alt_id", "user_id")

bins_df = users_df.join(lookup_df, ["alt_id"], "left").select("user_id", age_bins(F.col("dob")),"gender", "city", "state")

In [0]:
display(bins_df)

This binned demographic data will be saved to a table for our analysts to reference.

In [0]:
(bins_df.write
        .format("delta")
        .option("path", f"{DA.paths.working_dir}/user_bins")
        .mode("overwrite")
        .saveAsTable("user_bins"))

In [0]:
%sql
SELECT * FROM user_bins

Note that as currently implemented, each time this logic is processed, all records will be overwritten with newly calculated values. To decrease chances of identifying birth date at binned boundaries, random noise could be added to the values used to calculate age bins (generally keeping age bins accurate, but reducing the likelihood of transitioning a user to a new bin on their exact birthday).

#### Data object privileges - Manage object privileges
Note

An owner or an administrator of an object can perform GRANT, DENY, REVOKE, and SHOW GRANTS operations. However, an administrator cannot deny privileges to or revoke privileges from an owner.

A principal that’s not an owner or administrator can perform an operation only if the required privilege has been granted.

To grant, deny, or revoke a privilege for all users, specify the keyword users after TO. For example,

SQL
Copy to clipboardCopy
GRANT SELECT ON ANY FILE TO users

In [0]:
%sql
CREATE SCHEMA accounting;
GRANT USAGE ON SCHEMA accounting TO finance;
GRANT CREATE ON SCHEMA accounting TO finance;

GRANT SELECT ON SCHEMA <schema-name> TO `<user>@<domain-name>`
GRANT SELECT ON ANONYMOUS FUNCTION TO `<user>@<domain-name>`
GRANT SELECT ON ANY FILE TO `<user>@<domain-name>`

SHOW GRANTS `<user>@<domain-name>` ON SCHEMA <schema-name>

DENY SELECT ON <table-name> TO `<user>@<domain-name>`

REVOKE ALL PRIVILEGES ON SCHEMA default FROM `<user>@<domain-name>`
REVOKE SELECT ON <table-name> FROM `<user>@<domain-name>`

GRANT SELECT ON ANY FILE TO users

-- Dynamic view functions -  
-- current_user(): return the current user name.
-- is_member(): determine if the current user is a member of a specific Databricks group.
-- Return: true if the user is a member and false if they are not
SELECT
  current_user as user,
-- Check to see if the current user is a member of the "Managers" group.
  is_member("Managers") as admin

### Column-level permissions

In [0]:
%sql
-- Alias the field 'email' to itself (as 'email') to prevent the
-- permission logic from showing up directly in the column name results.
CREATE VIEW sales_redacted AS
SELECT
  user_id,
  CASE WHEN
    is_member('auditors') THEN email
    ELSE 'REDACTED'
  END AS email,
  country,
  product,
  total
FROM sales_raw

### Row-level permissions

In [0]:
%sql
CREATE VIEW sales_redacted AS
SELECT
  user_id,
  country,
  product,
  total
FROM sales_raw
WHERE
  CASE
    WHEN is_member('managers') THEN TRUE
    ELSE total <= 1000000
  END;

### data masking

In [0]:
%sql
-- The regexp_extract function takes an email address such as
-- user.x.lastname@example.com and extracts 'example', allowing
-- analysts to query the domain name

CREATE VIEW sales_redacted AS
SELECT
  user_id,
  region,
  CASE
    WHEN is_member('auditors') THEN email
    ELSE regexp_extract(email, '^.*@(.*)$', 1)
  END
  FROM sales_raw

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>