# Inventory Collector
Collects data on database objects (tables and views) as well as grants on those objects.
Saves all data to a delta table

# Initialization
You will have to run all 3 code cells in this section each time you reconnect to the cluster.

## Widget Setup
This notebook uses widgets to initialize the InventoryCollector as well as help scan a particular database easily.

In [None]:
try:
    spark.sql('use catalog hive_metastore;')
except Exception as e:
    pass

hmsDatabaseList = [row.databaseName for row in spark.sql('show databases').select('databaseName').collect()]
if len('hmsDatabaseList') > 1024:
    print('Warning! More than 1024 HMS databases. Picker widget will only display first 1024')
dbutils.widgets.dropdown("HMS_Database", hmsDatabaseList[0], hmsDatabaseList[0:1023], "HMS Database");
dbutils.widgets.text("Inventory_Catalog",   "hive_metastore", "Write Catalog for Inventory")
dbutils.widgets.text("Inventory_Database",   "databricks_inventory", "Write Database for Inventory")
dbutils.widgets.text("Migration_Catalog", "CATALOG_HERE", "Migration Target Catalog")

print("Database list:\n")
print('\n'.join(hmsDatabaseList))

## Import Collector Library
Note: The usage of "from ... import" works expects a single .py file, as included from github.
If you are not using github repos, create a notebook with the DbInventoryCollector.py file's contents in it, and change this line to read:
`%run ./DB-Inventory-Collector`

In [None]:
from DbInventoryCollector import InventoryCollector


## Create Collector
First, we must initialize the collector with a location to store its data. The .initialize() method will create the schemas and tables for the data to be stored inside.
Remember you will have to re-execute this cell every time the code for InventoryCollector is updated.

In [None]:
#Initialize InventoryCollector
collector = InventoryCollector(spark, dbutils.widgets.get("Inventory_Catalog"), dbutils.widgets.get("Inventory_Database"))
collector.initialize()
# display(collector.get_database_inventory_summary())

# Collector Execution
Here are snippets on how to use the InventoryCollector

## Scanning Databasees

### Scan of a single database
Note that there are two types of scans : objects and grants.
Each returns a pair of data: the execution id and the dataframe holding the scanned results.
All past scans are saved to an append only table. The execution_id can help you retrieve the scan as of a certain time.

With the widget code, this cell will automatically be re-run when you change the dropdown at the top.

In [None]:
whichDatabase = dbutils.widgets.get("HMS_Database")
(exec_id_objects, objectDF) = collector.scan_database_objects(whichDatabase)
(exec_id_grants, grantDF) = collector.scan_database_grants(whichDatabase)

print(f"Finished scanning both grants and objects for {whichDatabase}. ObjectExId: {exec_id_objects} GrantExId: {exec_id_grants}")

display(objectDF)

### Scan All Catalog Functions
Note: this is a bit of a WIP

In [None]:
collector.scan_catalog_functions()

### Auto Scan All Databases
Automatically list and scan all databases.

**Parameters:**
*rescan* -- If true, will re-scan a database even if inventory data already exists for it. If false, duplicate databases will be skipped. Default: False

In [None]:
# collector.scan_all_databases(rescan = False)

## Results Inspection

### Summary of past executions

In [None]:
display(collector.get_execution_history())

### Summary of all databases

In [None]:
dbSummary = collector.get_database_inventory_summary()
display(dbSummary)

### Inspect Single Database Results
There are two types of results stored. "grants" and "objects".

In [None]:
#the "grants" result type lists out each non-inherited grant on the database and its tables and views.
db_grants = collector.get_last_results('grants', whichDatabase)
display(db_grants)

In [None]:
#the "objects" lists out each table and view, along with its type (managed, external, or view). If there was an error retrieving details, the error is stored. For a view the DDL is saved too.
db_objects = collector.get_last_results('objects', whichDatabase)
display(db_objects)

In [None]:
#You can futher aggregate the results as well
display(db_objects.groupBy('objectType').count())

### Look at most recent collected grants using SQL

In [None]:
%sql
WITH ranked_grants AS (
  SELECT *,
    RANK() OVER (PARTITION BY source_database ORDER BY execution_time DESC) as rank
  FROM hive_metastore.databricks_inventory.grant_statements
)
SELECT ObjectType, ActionType, ObjectKey, Principal, grant_statement
FROM ranked_grants
WHERE rank = 1
order by source_database, ObjectType, ObjectKey

## Resetting State
Upon making changes to the scanning code, you may need to reset the state. Uncomment the following cell to do so:

In [None]:
# collector.resetAllData()

## Generating DDL

In [None]:
selectedDatabase = dbutils.widgets.get("HMS_Database")
destCatalog = dbutils.widgets.get("Migration_Catalog")

(ddl_objects, ddl_grants) = collector.generate_migration_ddl(selectedDatabase, destCatalog)

print("Finished Generation of both object and grant DDL")
print(';\n\n'.join(ddl_objects))
print(';\n\n'.join(ddl_grants))

In [None]:
# Execute the above using 
# collector.execute_sql_list(ddl_objects)
# collector.execute_sql_list(ddl_grants)