# Column Level Masking in Databricks

**Welcome back!**
In this session, we will cover how to protect sensitive data using **Column Level Masking**.

### Learning Objectives
1.  Understand what Column Level Masking is and why it is used for PII (Personally Identifiable Information).
2.  **Method 1:** Implement masking using a **Custom Metadata Table** (Dynamic mapping).
3.  **Method 2:** Implement masking using the built-in `IS_MEMBER()` function (Group-based).

### What is Column Level Masking?
Unlike Row Level Security (which hides entire rows), Column Level Masking allows all users to see the rows but obscures (masks) the content of specific sensitive columns (like Social Security Numbers, Phone Numbers, Email IDs) based on the user's privileges.

*   **Privileged Users (e.g., Admin/HR):** See the actual data.
*   **Restricted Users:** See redacted data (e.g., `***-***-1234`).

In [None]:
# Setup: Define Catalog and Schema
# Ensure you are using a Unity Catalog enabled workspace for these features to work optimally.
catalog = "dev"
schema = "bronze"
table_name = "customer_raw"

spark.sql(f"USE CATALOG {catalog}")
spark.sql(f"USE SCHEMA {schema}")

# Let's inspect the original data specifically the 'c_phone' column which is PII
display(spark.sql(f"SELECT * FROM {table_name} LIMIT 10"))

## Method 1: Column Masking using Custom Metadata Table

In this approach, we control access by maintaining a custom table that maps users to specific roles (e.g., `admin`, `user`). We then create a User Defined Function (UDF) that checks this table to decide whether to show the raw data or a mask.

### Step 1: Create the Metadata Table
We will store user emails and their assigned groups.

In [None]:
-- Create a metadata table to store user groups
CREATE TABLE IF NOT EXISTS dev.metadata.user_groups (
    user_email STRING,
    group_name STRING
);

-- Clean up previous data for demo purposes
TRUNCATE TABLE dev.metadata.user_groups;

### Step 2: Insert mappings
We will map the CURRENT_USER (you) to 'admin' so you can see the data.

We will map a dummy user to 'user' role.

In [None]:
current_user = spark.sql("SELECT current_user()").collect()[0][0]

print(f"Mapping current user ({current_user}) to 'admin' group.")

spark.sql(f"""
    INSERT INTO dev.metadata.user_groups VALUES 
    ('{current_user}', 'admin'),
    ('dummy_user@example.com', 'user')
""")

### Step 3: Create the Masking Function (UDF)

This SQL function will:
1.  Check the `user_groups` table for the current user's role.
2.  If the role is `admin`, return the actual column value.
3.  Otherwise, return a masked string (e.g., `***-***-****`).

*Note: We use `MAX()` in the subquery to ensure it returns a scalar value.*

In [None]:
CREATE OR REPLACE FUNCTION dev.metadata.phone_mask(col_value STRING)
RETURNS STRING
LANGUAGE SQL
RETURN 
  CASE 
    WHEN (
      SELECT MAX(group_name) 
      FROM dev.metadata.user_groups 
      WHERE user_email = current_user()
    ) = 'admin' 
    THEN col_value
    ELSE '***-***-****' 
  END;

In [None]:
-- Test the function manually before applying it to the table
SELECT 
    '123-456-7890' as original, 
    dev.metadata.phone_mask('123-456-7890') as masked_view;

### Step 4: Apply the Mask to the Table

We use the `ALTER TABLE` command with `SET MASK` to bind our function to the specific column (`c_phone`).

In [None]:
ALTER TABLE dev.bronze.customer_raw 
ALTER COLUMN c_phone 
SET MASK dev.metadata.phone_mask;

### Step 5: Verify the Data
Since you are mapped as 'admin', you should see the phone numbers.

If you remove yourself from the metadata table or change role to 'user', you will see stars.


In [None]:
print("Viewing data as Admin (Current User):")
display(spark.sql("SELECT c_phone, c_name FROM dev.bronze.customer_raw LIMIT 5"))

### Clean Up Method 1
Let's drop the mask so we can demonstrate the second method.

In [None]:
ALTER TABLE dev.bronze.customer_raw 
ALTER COLUMN c_phone 
DROP MASK;

## Method 2: Column Masking using `IS_MEMBER()`

This method uses Databricks built-in group management. Instead of maintaining a custom table, we check if a user belongs to a specific Workspace or Account group (e.g., `admins`, `data_engineers`).

#### `IS_MEMBER('group_name')`
Returns `true` if the current user is a member of the specified group.

In [None]:
-- Check if you are part of the 'admins' group
SELECT current_user() as user, is_member('admins') as is_admin;

### Step 1: Create the UDF using `IS_MEMBER`

In [None]:
CREATE OR REPLACE FUNCTION dev.metadata.phone_mask_member(col_value STRING)
RETURNS STRING
LANGUAGE SQL
RETURN 
  CASE 
    WHEN is_member('admins') THEN col_value
    ELSE '***-***-****' -- Redacted value for non-admins
  END;

### Step 2: Apply the Mask

In [None]:
ALTER TABLE dev.bronze.customer_raw 
ALTER COLUMN c_phone 
SET MASK dev.metadata.phone_mask_member;

### Step 3: Verify Results
If you are in the 'admins' group, you see data.

If you are logged in as a standard user (not in admins group), you see masked data.

In [None]:
display(spark.sql("SELECT c_phone, c_name FROM dev.bronze.customer_raw LIMIT 5"))

## Summary

1.  **Column Level Masking** hides sensitive data content while keeping the column visible.
2.  **Custom Metadata Approach:** Good for granular, complex logic managed within tables.
3.  **`IS_MEMBER` Approach:** Simpler, leverages native Databricks/AD group management.
4.  Applied using `ALTER TABLE ... ALTER COLUMN ... SET MASK`.
5.  Removed using `DROP MASK`.