# Inventory Collector
Collects data on database objects (tables and views) as well as grants on those objects.
Saves all data to a delta table

# Initialization
You will have to run both of these code cells in this section each time you reconnect to the cluster.

Note: The usage of "from ... import" works expects a single .py file, as included from github.
If you are not using github repos, create a notebook with the DbInventoryCollector.py file's contents in it, and change this line to read:

```%run ./DB-Inventory-Collector```

In [None]:
from DbInventoryCollector import InventoryCollector

In [None]:
#Create Widgets
InventoryCollector.CreateWidgets(dbutils, spark, reset=False)

#Instantiate and initialize collector class
collector = InventoryCollector(spark, dbutils.widgets.get("Inventory_Catalog"), dbutils.widgets.get("Inventory_Database"))
collector.initialize()

#This pulls out the widget values to a python variable.
#Paste these lines into a cell to enable automatic execution on widget change
whichCatalog = dbutils.widgets.get("Scan_Catalog")
sourceDatabase = dbutils.widgets.get("Scan_Database")

# Scanning Databases
Generally you will first run these scan functions to record what objects exist.

## Scan of a single database
Note that there are two types of scans : objects and grants.
Each returns a pair of data: the execution id and the dataframe holding the scanned results.
All past scans are saved to an append only table. The execution_id can help you retrieve the scan as of a certain time.

With the widget code, this cell will automatically be re-run when you change the dropdown at the top.

In [None]:
(exec_id_objects, objectDF) = collector.scan_database_objects(whichCatalog, sourceDatabase)
print(f"Finished scanning objects for {whichCatalog}.{sourceDatabase}. ObjectExId: {exec_id_objects} ")
display(objectDF)

In [None]:
(exec_id_grants, grantDF) = collector.scan_database_grants(whichCatalog, sourceDatabase)
print(f"Finished scanning grants for {whichCatalog}.{sourceDatabase}. GrantExId: {exec_id_grants}")
display(grantDF)

## Scan All Catalog Functions
Note: this is a bit of a WIP

In [None]:
collector.scan_catalog_functions(whichCatalog)

## Scan All Databases in Catalog
Automatically list and scan all databases.

**Parameters:**
*rescan* -- If true, will re-scan a database even if inventory data already exists for it. If false, duplicate databases will be skipped. Default: False

In [None]:
# collector.scan_all_databases(whichCatalog, rescan = False)

# Results Inspection

## Summary of past executions

In [None]:
display(collector.get_execution_history())

## Summary of all databases

In [None]:
dbSummary = collector.get_database_inventory_summary(whichCatalog)
display(dbSummary)

## Inspect Single Database Results
There are two types of results stored. "grants" and "objects".

In [None]:
#the "grants" result type lists out each non-inherited grant on the database and its tables and views.
db_grants = collector.get_last_results('grants', whichCatalog, sourceDatabase)
display(db_grants)

In [None]:
#the "objects" lists out each table and view, along with its type (managed, external, or view). If there was an error retrieving details, the error is stored. For a view the DDL is saved too.
db_objects = collector.get_last_results('objects', whichCatalog, sourceDatabase)
display(db_objects)

In [None]:
#You can futher aggregate the results as well
display(db_objects.groupBy('objectType').count())

## Look at all collected grants using SQL

In [None]:
%sql
WITH ranked_grants AS (
  SELECT *,
    RANK() OVER (PARTITION BY source_database ORDER BY execution_time DESC) as rank
  FROM hive_metastore.databricks_inventory.grant_statements
)
SELECT ObjectType, ActionType, ObjectKey, Principal, grant_statement
FROM ranked_grants
WHERE rank = 1
order by source_database, ObjectType, ObjectKey

## Look at all collected objects using SQL

In [None]:
%sql
WITH ranked_objects AS (
SELECT *, RANK() OVER (PARTITION BY source_catalog, source_database ORDER BY execution_time DESC) as rank
FROM hive_metastore.databricks_inventory.db_objects
WHERE source_catalog = 'hive_metastore'
)
SELECT source_catalog, source_database, `table`, errMsg, execution_time
FROM ranked_objects
WHERE rank = 1 and objectType = "ERROR"
ORDER BY source_catalog, source_database

# Resetting State
Upon making changes to the scanning code, you may need to reset the state. Uncomment the following cell to do so:

In [None]:
# collector.resetAllData()

# Generating DDL

In [None]:
sourceCatalog = dbutils.widgets.get("Scan_Catalog")
selectedDatabase = dbutils.widgets.get("Scan_Database")
destCatalog = dbutils.widgets.get("Migration_Catalog")

(ddl_objects, ddl_grants) = collector.generate_migration_ddl(selectedDatabase, destCatalog)

print("Finished Generation of both object and grant DDL")
print(';\n\n'.join(ddl_objects))
print(';\n\n'.join(ddl_grants))

In [None]:
# Execute the above using 
# collector.execute_sql_list(ddl_objects)
# collector.execute_sql_list(ddl_grants)