# Assessment Results Analysis with DuckDB

This notebook demonstrates how to analyze assessment results from the Fabric Assessment Tool using DuckDB. The tool exports data in a hierarchical folder structure that enables efficient querying and analysis.

## Overview

The Fabric Assessment Tool now exports assessment data in a structured format with separate folders for:
- **Resources**: Notebooks, jobs/pipelines, clusters/pools  
- **Admin**: Administrative components (Synapse only)
- **Data**: Hierarchical data structure with databases → schemas → tables/views

This structure enables granular analysis and better understanding of your data platform.

In [None]:
# Install DuckDB if not already installed
import subprocess
import sys

try:
    import duckdb
    print("DuckDB is already installed")
except ImportError:
    print("Installing DuckDB...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "duckdb"])
    import duckdb
    print("DuckDB installed successfully")

# Initialize DuckDB connection
conn = duckdb.connect()
print("DuckDB connection established")

## Configuration

Set up the path to your assessment results. Update the `assessment_path` variable to point to your exported assessment data.

In [None]:
# Configuration - Update this path to point to your assessment results
assessment_path = "/path/to/your/assessment/results"

# Example paths:
# assessment_path = "/tmp/assessment"
# assessment_path = "C:/assessments/my_workspace"
# assessment_path = "/home/user/fabric_assessment_results"

print(f"Assessment path: {assessment_path}")
print("\nMake sure to update the 'assessment_path' variable above to point to your actual assessment results directory.")

## Synapse Assessment Analysis

In [None]:
# Create Synapse tables for analysis

# Notebooks
conn.execute(f"""
CREATE OR REPLACE TABLE synapse_notebooks AS
SELECT * FROM read_json_auto('{assessment_path}/*/resources/notebooks/*.json');
""")

# SQL Pools
conn.execute(f"""
CREATE OR REPLACE TABLE synapse_sql_pools AS
SELECT * FROM read_json_auto('{assessment_path}/*/resources/sql_pools/*.json');
""")

# Serverless Databases
conn.execute(f"""
CREATE OR REPLACE TABLE synapse_serverless_databases AS
SELECT * FROM read_json_auto('{assessment_path}/*/data/serverless_databases/databases/*/*.json');
""")

# Serverless Schemas
conn.execute(f"""
CREATE OR REPLACE TABLE synapse_serverless_schemas AS
SELECT * FROM read_json_auto('{assessment_path}/*/data/serverless_databases/databases/*/schemas/*/*.json');
""")

# Serverless Tables
conn.execute(f"""
CREATE OR REPLACE TABLE synapse_serverless_tables AS
SELECT * FROM read_json_auto('{assessment_path}/*/data/serverless_databases/databases/*/schemas/*/tables/*.json');
""")

# Dedicated Tables
conn.execute(f"""
CREATE OR REPLACE TABLE synapse_dedicated_tables AS
SELECT * FROM read_json_auto('{assessment_path}/*/data/dedicated_databases/databases/*/schemas/*/tables/*.json');
""")

print("Synapse assessment tables created successfully!")

In [None]:
# Analyze Synapse notebooks by language
notebook_analysis = conn.execute("""
SELECT data.language AS language, 
       COUNT(*) AS notebook_count 
FROM synapse_notebooks 
GROUP BY data.language
ORDER BY notebook_count DESC;
""").df()

print("Synapse Notebooks by Language:")
print(notebook_analysis)

In [None]:
# Analyze Synapse dedicated table statistics
dedicated_stats = conn.execute("""
SELECT data.name AS table_name,
       data.database,
       data.schema,
       data.statistics.distribution_policy_name AS distribution_policy,
       data.statistics.table_row_count AS row_count,
       CAST(data.statistics.table_reserved_space_gb AS DECIMAL(18,3)) AS reserved_space_gb,
       CAST(data.statistics.table_data_space_gb AS DECIMAL(18,3)) AS data_space_gb
FROM synapse_dedicated_tables
WHERE type = 'table' AND data.statistics IS NOT NULL
ORDER BY data_space_gb DESC
LIMIT 10;
""").df()

print("Top 10 Synapse Dedicated Tables by Size:")
print(dedicated_stats)

In [None]:
# Aggregate statistics by database and distribution policy
distribution_summary = conn.execute("""
SELECT data.database as database,
       data.statistics.distribution_policy_name AS distribution_policy,
       SUM(data.statistics.table_row_count) AS total_rows,
       SUM(CAST(data.statistics.table_reserved_space_gb AS DECIMAL(18,3))) AS total_reserved_gb,
       SUM(CAST(data.statistics.table_data_space_gb AS DECIMAL(18,3))) AS total_data_gb,
       COUNT(*) AS table_count
FROM synapse_dedicated_tables
WHERE type = 'table' AND data.statistics IS NOT NULL
GROUP BY database, distribution_policy
ORDER BY total_data_gb DESC;
""").df()

print("Synapse Tables by Database and Distribution Policy:")
print(distribution_summary)

## Databricks Assessment Analysis

In [None]:
# Create Databricks tables for analysis

# Unity Catalog
conn.execute(f"""
CREATE OR REPLACE TABLE databricks_catalogs AS
SELECT * FROM read_json_auto('{assessment_path}/*/data/unity_catalog/catalogs/*/*.json');
""")

conn.execute(f"""
CREATE OR REPLACE TABLE databricks_schemas AS
SELECT * FROM read_json_auto('{assessment_path}/*/data/unity_catalog/catalogs/*/schemas/*/*.json');
""")

conn.execute(f"""
CREATE OR REPLACE TABLE databricks_tables AS
SELECT * FROM read_json_auto('{assessment_path}/*/data/unity_catalog/catalogs/*/schemas/*/tables/*.json', union_by_name=True);
""")

conn.execute(f"""
CREATE OR REPLACE TABLE databricks_volumes AS
SELECT * FROM read_json_auto('{assessment_path}/*/data/unity_catalog/catalogs/*/schemas/*/volumes/*.json');
""")

conn.execute(f"""
CREATE OR REPLACE TABLE databricks_functions AS
SELECT * FROM read_json_auto('{assessment_path}/*/data/unity_catalog/catalogs/*/schemas/*/functions/*.json');
""")

# Clusters and Jobs
conn.execute(f"""
CREATE OR REPLACE TABLE databricks_clusters AS
SELECT * FROM read_json_auto('{assessment_path}/*/resources/clusters/*.json');
""")

conn.execute(f"""
CREATE OR REPLACE TABLE databricks_jobs AS
SELECT * FROM read_json_auto('{assessment_path}/*/resources/jobs/*.json');
""")

print("Databricks assessment tables created successfully!")

In [None]:
# Analyze Databricks Unity Catalog structure
catalog_analysis = conn.execute("""
SELECT data.name AS catalog_name,
       data.comment,
       data.owner,
       data.storage_root
FROM databricks_catalogs
WHERE type = 'unity_catalog';
""").df()

print("Databricks Unity Catalogs:")
print(catalog_analysis)

In [None]:
# Analyze tables by format in Unity Catalog
table_format_analysis = conn.execute("""
SELECT data.catalog AS catalog_name,
       data.format,
       COUNT(*) AS table_count,
       SUM(data.statistics_size_bytes)/1073741824 AS total_size_gigabytes,
       SUM(data.statistics_row_count)/1000000 AS total_million_row_count
FROM databricks_tables
WHERE type = 'table' AND data.format IS NOT NULL
GROUP BY catalog_name, data.format
ORDER BY catalog_name, table_count DESC;
""").df()

print("Databricks Tables by Format:")
print(table_format_analysis)

In [None]:
# Find tables with many columns
complex_tables = conn.execute("""
SELECT data.name AS table_name,
       data.catalog AS catalog_name,
       data.schema AS schema_name,
       data.columns AS column_count,
       data.statistics_size_bytes / (1024 * 1024) AS size_megabytes
FROM databricks_tables
WHERE type = 'table' AND data.columns > 20
ORDER BY data.columns DESC
LIMIT 10;
""").df()

print("Tables with Most Columns:")
print(complex_tables)

## Summary and Next Steps

This notebook demonstrates how to analyze assessment results using the hierarchical folder structure. The key benefits include:

### Next Steps

- Customize the queries for your specific analysis needs
- Add visualizations using matplotlib, plotly, or other libraries
- Export results to different formats (CSV, Excel, etc.)
- Create automated reports based on the assessment data

### Additional Resources

- [Query Results Guide](../docs/query_results.md): Comprehensive examples of DuckDB queries
- [DuckDB Documentation](https://duckdb.org/docs/): Official DuckDB documentation
- [Fabric Assessment Tool Repository](https://github.com/microsoft/lakelense): Source code and documentation