# Unify Governance and security for all users and all data

Data governance and security is hard when it comes to a complete Data Platform. SQL GRANT on tables isn't enough and security must be enforced for multiple data assets (dashboards, Models, files etc).

To reduce risks and driving innovation, Emily's team needs to:

- Unify all data assets (Tables, Files, ML models, Features, Dashboards, Queries)
- Onboard data with multiple teams
- Share & monetize assets with external Organizations
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=1444828305810485&notebook=%2F02-Data-Governance%2F02-Data-Governance-credit-decisioning&demo_name=lakehouse-fsi-credit&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-fsi-credit%2F02-Data-Governance%2F02-Data-Governance-credit-decisioning&version=1&user_hash=7804490f0d3be4559d29a7b52959f461489c4ee5e35d4afc7b55f311360ac589">

### A cluster has been created for this demo
To run this demo, just select the cluster `dbdemos-lakehouse-fsi-credit-junyi_tiong` from the dropdown menu ([open cluster configuration](https://e2-demo-field-eng.cloud.databricks.com/#setting/clusters/0922-083237-e7fg83pu/configuration)). <br />
*Note: If the cluster was deleted after 30 days, you can re-create it with `dbdemos.create_cluster('lakehouse-fsi-credit')` or re-install the demo: `dbdemos.install('lakehouse-fsi-credit')`*

# Implementing a global data governance and security with Unity Catalog

Let's see how the Lakehouse can solve this challenge leveraging Unity Catalog.

Our Data has been saved as Delta Table by our Data Engineering team.  The next step is to secure this data while allowing cross team to access it. <br>
A typical setup would be the following:

* Data Engineers / Jobs can read and update the main data/schemas (ETL part)
* Data Scientists can read the final tables and update their features tables
* Data Analyst have READ access to the Data Engineering and Feature Tables and can ingest/transform additional data in a separate schema.
* Data is masked/anonymized dynamically based on each user access level

This is made possible by Unity Catalog. When tables are saved in the Unity Catalog, they can be made accessible to the entire organization, cross-workpsaces and cross users.

Unity Catalog is key for data governance, including creating data products or organazing teams around datamesh. It brings among other:

* Fined grained ACL,
* Audit log,
* Data lineage,
* Data exploration & discovery,
* Sharing data with external organization (Delta Sharing),
* (*coming soon*) Attribute-based access control. 

In [0]:
%run ../_resources/00-setup $reset_all_data=false

In [0]:
--CREATE CATALOG IF NOT EXISTS dbdemos;
--USE CATALOG dbdemos;
SELECT CURRENT_CATALOG();

In [0]:
SHOW TABLES;


## Step 1. Access control

In the Lakehouse, you can use simple SQL GRANT and REVOKE statements to create granular (on data and even schema and catalog levels) access control irrespective of the data source or format.

In [0]:
-- Let's grant our ANALYSTS a SELECT permission:
-- Note: make sure you created an analysts and dataengineers group first.
GRANT SELECT ON TABLE jy_demo_catalog.jy_fsi_credit_schema.credit_bureau_gold TO `jy_analysts`;
GRANT SELECT ON TABLE jy_demo_catalog.jy_fsi_credit_schema.customer_gold TO `jy_analysts`;
GRANT SELECT ON TABLE jy_demo_catalog.jy_fsi_credit_schema.fund_trans_gold TO `jy_analysts`;

-- We'll grant an extra MODIFY to our Data Engineer
GRANT SELECT, MODIFY ON SCHEMA jy_demo_catalog.jy_fsi_credit_schema TO `jy_dataengineers`;


## Step 2. PII data masking, row and column-level filtering

In the cells below we will demonstrate how to handle sensitive data through column and row masking.

In [0]:
CREATE OR REPLACE VIEW  customer_gold_secured AS
SELECT
  c.* EXCEPT (first_name),
  CASE
    WHEN is_member('data_scientists')
    THEN base64(aes_encrypt(c.first_name, 'YOUR_SECRET_FROM_MANAGER')) -- save secret in Databricks manager and load it with secret('<YOUR_SCOPE> ', '<YOUR_SECRET_NAME>')
    ELSE c.first_name
  END AS first_name
FROM
  customer_gold AS c;

In [0]:
-- CREATE GROUP data_scientists;
ALTER GROUP `data_scientists` ADD USER `quentin.ambard@databricks.com`;

SELECT
  current_user() as user,
  is_member("data_scientists") as user_is_data_scientists ;

In [0]:
SELECT cust_id, first_name FROM customer_gold_secured;

In [0]:
ALTER GROUP `data_scientists` DROP USER `quentin.ambard@databricks.com`;

In [0]:
%python time.sleep(60) #make sure the change is visible with sql alter group

In [0]:
SELECT cust_id, first_name FROM customer_gold_secured;


As we can observe from the cells above, the ```first_name``` column is masked whenever the current user requesting the data is part of the ```data-science-users``` group, and not masked if other type of users queries the data.


## Step 3. (Data and assets) Lineage

Lineage is critical for understanding compliance, audit, observability, but also discoverability of data.

These are three very common schenarios, where full data lineage becomes incredibly important:
1. **Explainability** - we need to have the means of tracing features used in machine learning to the raw data that created those features,
2. Tracing **missing values** in a dashboard or ML model to the origin,
3. **Finding specific data** - organizations have hundreds and even thousands of data tables and sources. Finiding the table or column that contains specific information can be daunting without a proper discoverability tools.

In the image below, you can see every possible data (both ingested and created internally) in the same lineage graph, irrespective of the data type (stream vs batch), file type (csv, json, xml), language (SQL, python), or tool used (DLT, SQL query, Databricks Feature Store, or a python Notebook).

**Note**: To explore the whole lineage, open navigate to the Data Explorer, and find the ```customer_gold``` table inside your catalog and database.

<img src="https://raw.githubusercontent.com/borisbanushev/CAPM_Databricks/main/UC.png" />


### 4. Secure data sharing

Once our data is ready, we can easily share it leveraging Delta Sharing, an open protocol to share your data assets with any customer or partnair.

For more details on Delta Sharing, run `dbdemos.install('delta-sharing-airlines')`

In [0]:
CREATE SHARE IF NOT EXISTS dbdemos_credit_decisioning_customer 
  COMMENT 'Sharing the Customer Gold table from the Credit Decisioning Demo.';
 
-- For the demo we'll grant ownership to all users. Typical deployments wouls have admin groups or similar.
ALTER SHARE dbdemos_credit_decisioning_customer OWNER TO `account users`;

-- Simply add the tables you want to share to your SHARE:
-- ALTER SHARE dbdemos_credit_decisioning_customer  ADD TABLE dbdemos.fsi_credit_decisioning.credit_bureau_gold ;

In [0]:
DESCRIBE SHARE dbdemos_credit_decisioning_customer;

# Next: leverage your data to better serve your customers and reduce credit default risk

Our data is now ingested, secured, and our Data Scientist can access it.

Let's get the maximum value out of the data we ingested: open the [Feature Engineering notebook]($../03-Data-Science-ML/03.1-Feature-Engineering-credit-decisioning) and start creating features for our machine learning models 

Go back to the [Introduction]($../00-Credit-Decisioning).