# Inventory Collector
Collects data on database objects (tables and views) as well as grants on those objects.
Saves all data to a delta table

## Library Definitions

In [None]:
from InventoryCollector import InventoryCollector

## Widget Setup
This notebook uses widgets to initialize the InventoryCollector as well as help scan a particular database easily.

In [None]:
%sql
use catalog hive_metastore;
show databases;

databaseName
000_demo_db
01rohitb_retail_dlt_demo
2020_demo
20220204_workshop_satoshiokayamadatabrickscom
20220221_workshop_satoshiokayamadatabrickscom
202203_workshop_satoshikuramitsudatabrickscom
_fivetran_setup_test
_fivetran_staging
_test_bir_db1
_test_bir_db1_main


In [None]:
hmsDatabaseList = [row.databaseName for row in spark.sql('show databases').select('databaseName').collect()]
if len('hmsDatabaseList') > 1024:
    print('Warning! More than 1024 HMS databases. Picker widget will only display first 1024')
dbutils.widgets.dropdown("HMS_Database", hmsDatabaseList[0], hmsDatabaseList[0:1023], "HMS Database");
dbutils.widgets.text("Inventory_Catalog",   "hive_metastore", "Write Catalog for Inventory")
dbutils.widgets.text("Inventory_Database",   "databricks_inventory", "Write Database for Inventory")

## Collector Execution
Here are snippets on how to use the InventoryCollector

### Initialization
First, we must initialize the collector with a location to store its data. The .initialize() method will create the schemas and tables for the data to be stored inside.
Remember you will have to re-execute this cell every time the code for InventoryCollector is updated.

In [None]:
#Initialize InventoryCollector
collector = InventoryCollector(spark, dbutils.widgets.get("Inventory_Catalog"), dbutils.widgets.get("Inventory_Database"))
collector.initialize()
# display(collector.get_database_inventory_summary())

Will save results to: `databricks_inventory`. Saving to HMS


### Scan of a single database
Note that there are two types of scans : objects and grants.
Each returns a pair of data: the execution id and the dataframe holding the scanned results.
All past scans are saved to an append only table. The execution_id can help you retrieve the scan as of a certain time.

With the widget code, this cell will automatically be re-run when you change the dropdown at the top.

In [None]:
selectedDatabase = dbutils.widgets.get("HMS_Database")
(exec_id_objects, objectDF) = collector.scan_database_objects(selectedDatabase)
(exec_id_grants, grantDF) = collector.scan_database_grants(selectedDatabase)

print(f"Finished scanning both grants and objects for {selectedDatabase}. ObjectExId: {exec_id_objects} GrantExId: {exec_id_grants}")

display(grantDF)

Running DB Inventory for dave_carlson_databricks_com_db with exec_id objects-56826228-60ce-4378-8ed9-ce1d1405db14 at time 2023-03-23 03:50:28.645704
dave_carlson_databricks_com_db has 6 objects
 TABLE: dave_carlson_databricks_com_db.descriptions -- TYPE: EXTERNAL
 TABLE: dave_carlson_databricks_com_db.gartner_2020 -- TYPE: EXTERNAL
 TABLE: dave_carlson_databricks_com_db.gartner_2020_featurized -- TYPE: EXTERNAL
 TABLE: dave_carlson_databricks_com_db.wisconsin_boundaries -- ERROR RETRIEVING DETAILS:
ERROR MSG: dbfs:/user/hive/warehouse/dave_carlson_databricks_com_db.db/wisconsin_boundaries doesn't exist;
DescribeRelation true, [col_name#201785, data_type#201786, comment#201787]
+- ResolvedTable com.databricks.sql.managedcatalog.UnityCatalogV2Proxy@6184a58e, dave_carlson_databricks_com_db.wisconsin_boundaries, DeltaTableV2(org.apache.spark.sql.SparkSession@9534f89,dbfs:/user/hive/warehouse/dave_carlson_databricks_com_db.db/wisconsin_boundaries,Some(CatalogTable(
Catalog: hive_metastore
D

### Auto Scan All Databases
Automatically list and scan all databases.

**Parameters:**
*rescan* -- If true, will re-scan a database even if inventory data already exists for it. If false, duplicate databases will be skipped. Default: False

In [None]:
collector.scan_all_databases(rescan = False)

First, scanning existing progress
Start inventory of database 000_demo_db. Creating inventory_execution_id: grants-b4aa6e75-56b2-4b0d-b2e4-2c231fd8c1e2 and execution_time: Column<'current_timestamp()'>
2023-03-23 03:26:06.017216 - Finished inventory of database 000_demo_db. No grants found. execution_id: grants-b4aa6e75-56b2-4b0d-b2e4-2c231fd8c1e2. Elapsed: 0:00:00.770412
Finished scanning grants for 000_demo_db. Execution ID: grants-b4aa6e75-56b2-4b0d-b2e4-2c231fd8c1e2
Skipping database 01rohitb_retail_dlt_demo as it has already been scanned and has data in the inventory.
Skipping database 2020_demo as it has already been scanned and has data in the inventory.
Skipping database 20220204_workshop_satoshiokayamadatabrickscom as it has already been scanned and has data in the inventory.
Skipping database 20220221_workshop_satoshiokayamadatabrickscom as it has already been scanned and has data in the inventory.
Skipping database 202203_workshop_satoshikuramitsudatabrickscom as it has alre

## Results Inspection

### Summary of past executions

In [None]:
display(collector.get_execution_history())

inventory_execution_id,execution_time,source_database,data_type
grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,
grants-6afd63ed-6701-498d-a792-13e2d251a2e9,2023-03-23T03:12:42.376+0000,ahecksher,
grants-eae75019-da7c-4ec9-adae-d68f0010fc78,2023-03-23T03:12:34.334+0000,ah_feature_store_taxi_demo,
grants-21bde4bc-e7dd-442f-9db6-6bf95c6c89ba,2023-03-23T03:12:30.690+0000,ag_lab,
grants-414e788f-5412-40ca-944e-a65616bc2a65,2023-03-23T03:12:25.243+0000,ag999_acl_test,
grants-6f629ffc-e058-4601-b415-e08a80e74658,2023-03-23T03:12:21.508+0000,adwpdb01,
grants-c42ddabc-4a9c-45df-b93b-7f54cd4d4fcf,2023-03-23T03:10:45.873+0000,advait_godbole_workshop_db,
grants-24256451-882f-43da-9162-96e2b8e05412,2023-03-23T03:10:37.693+0000,adult_monitor_db,
grants-cd56b143-445b-4d21-a02e-7a088b371263,2023-03-23T03:10:27.856+0000,adss_cp,
grants-cccbefe4-ed63-495f-9a61-c75fae814029,2023-03-23T03:10:24.606+0000,adss,


### Summary of all databases

In [None]:
display(collector.get_database_inventory_summary())

database,grant_last_execution_id,grant_last_execution_time,object_last_execution_id,object_last_execution_time,ERROR,EXTERNAL,MANAGED,VIEW,grant_count
000_demo_db,,,objects-079c219d-93dd-462a-8300-ef1ff61aee98,2023-03-23T03:04:27.835+0000,9.0,0.0,0.0,0.0,
01rohitb_retail_dlt_demo,grants-ad495985-c115-4ca6-b56e-368c4b1583cd,2023-03-23T03:04:37.968+0000,objects-2d4a462e-9931-4804-8b42-f1bb04d4d900,2023-03-23T03:04:32.653+0000,0.0,7.0,0.0,0.0,8.0
2020_demo,grants-414482c1-3a67-49c2-9a2b-18b555dd227a,2023-03-23T03:04:44.180+0000,objects-4edf2443-7a9d-4283-bf88-5b9e9e22b899,2023-03-23T03:04:40.879+0000,1.0,1.0,0.0,0.0,1.0
20220204_workshop_satoshiokayamadatabrickscom,grants-97e26632-437a-422a-9acc-84acf0bc1a5f,2023-03-23T03:04:49.908+0000,objects-21d6cbf7-5115-43b0-9a82-e47751e3339d,2023-03-23T03:04:47.380+0000,1.0,2.0,0.0,0.0,4.0
20220221_workshop_satoshiokayamadatabrickscom,grants-dd900aa4-0dc2-4f6f-b7d9-6a206d52a672,2023-03-23T03:04:55.631+0000,objects-6015af2f-29c1-4760-bdaf-7434ff45d200,2023-03-23T03:04:52.966+0000,1.0,2.0,0.0,0.0,4.0
202203_workshop_satoshikuramitsudatabrickscom,grants-7f8f9e3c-9d8b-4e90-9db7-6daea1093031,2023-03-23T03:05:03.049+0000,objects-dafd0bb2-eb8f-4c2c-994e-6fdaf8df5a06,2023-03-23T03:04:59.918+0000,1.0,3.0,0.0,0.0,5.0
_fivetran_setup_test,grants-c8a66aa3-b9ae-4a82-98b6-828e7d151b35,2023-03-23T03:05:07.665+0000,,,,,,,1.0
_fivetran_staging,grants-9bced3cb-3987-473c-a446-184c92692796,2023-03-23T03:05:11.481+0000,,,,,,,1.0
_test_bir_db1,grants-1ad3a644-4a16-49cf-b406-3b84ee9edc5c,2023-03-23T03:05:16.145+0000,objects-c69e7e8a-3bf2-4b80-8436-a4b1a84fad92,2023-03-23T03:05:14.337+0000,0.0,3.0,0.0,0.0,4.0
_test_bir_db1_main,grants-672c7eb0-1a77-4edc-a6fa-423959c98aa0,2023-03-23T03:05:20.772+0000,objects-afc2892b-5025-4ff8-8671-c9e90eee26b7,2023-03-23T03:05:18.864+0000,0.0,3.0,0.0,0.0,4.0


### Inspect Single Database Results
There are two types of results stored. "grants" and "objects".

In [None]:
#the "grants" result type lists out each non-inherited grant on the database and its tables and views.
db_grants = collector.get_last_results('grants', selectedDatabase)
display(db_grants)

Principal,ActionType,ObjectType,ObjectKey,inventory_execution_id,execution_time,source_database,grant_statement
unity_testers,SELECT,TABLE,`davew`.`products`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT SELECT ON TABLE `davew`.`products` TO `unity_testers`
unity_testers,READ_METADATA,TABLE,`davew`.`activepackagesview`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT READ_METADATA ON TABLE `davew`.`activepackagesview` TO `unity_testers`
group-test,SELECT,DATABASE,`davew`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT SELECT ON DATABASE `davew` TO `group-test`
m.walker@databricks.com,SELECT,TABLE,`davew`.`activepackages`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT SELECT ON TABLE `davew`.`activepackages` TO `m.walker@databricks.com`
unity_testers,MODIFY,TABLE,`davew`.`products`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT MODIFY ON TABLE `davew`.`products` TO `unity_testers`
unity_testers,MODIFY,TABLE,`davew`.`activepackagesview`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT MODIFY ON TABLE `davew`.`activepackagesview` TO `unity_testers`
group-test,MODIFY,DATABASE,`davew`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT MODIFY ON DATABASE `davew` TO `group-test`
group-test,USAGE,DATABASE,`davew`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT USAGE ON DATABASE `davew` TO `group-test`
unity_testers,SELECT,TABLE,`davew`.`activepackagesview`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT SELECT ON TABLE `davew`.`activepackagesview` TO `unity_testers`
david.whitehouse@databricks.com,OWN,TABLE,`davew`.`net_csv`,grants-51e895be-c00b-4463-a7ae-6f82ed8ea7d6,2023-03-23T03:25:39.447+0000,davew,GRANT OWN ON TABLE `davew`.`net_csv` TO `david.whitehouse@databricks.com`


In [None]:
#the "objects" lists out each table and view, along with its type (managed, external, or view). If there was an error retrieving details, the error is stored. For a view the DDL is saved too.
db_objects = collector.get_last_results('objects', selectedDatabase)
display(db_objects)

source_database,table,objectType,location,viewText,errMsg,inventory_execution_id,execution_time
davew,deltademo,EXTERNAL,dbfs:/home/davew/csvDemo/delta,,,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000
davew,activepackagesview,VIEW,,"select Upn, max(event_date), count(*) from activepackages_1 group by Upn",,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000
davew,test_table,MANAGED,dbfs:/user/hive/warehouse/davew.db/test_table,,,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000
davew,products,MANAGED,dbfs:/user/hive/warehouse/davew.db/products,,,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000
davew,net_csv1,EXTERNAL,dbfs:/FileStore/tables/net.csv,,,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000
davew,activepackages_1,EXTERNAL,dbfs:/delta/activepackages_1,,,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000
davew,products_new,MANAGED,dbfs:/user/hive/warehouse/davew.db/products_new,,,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000
davew,utas,EXTERNAL,dbfs:/home/davew/utas/delta,,,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000
davew,records_delta,EXTERNAL,dbfs:/mnt/davew/upoc_hybrid/delta_records,,,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000
davew,activepackages,EXTERNAL,dbfs:/delta/activepackages,,,objects-24e90170-8ecb-42cb-a14b-d8fd3b8162c7,2023-03-23T03:25:35.001+0000


In [None]:
#You can futher aggregate the results as well
display(db_objects.groupBy('objectType').count())

objectType,count
VIEW,1
EXTERNAL,10
MANAGED,3


## Resetting State
Upon making changes to the scanning code, you may need to reset the state. Uncomment the following cell to do so:

In [None]:
# collector.resetAllData()