## Fabric Semantic Model Audit

### Overview

This notebook is designed to perform a comprehensive audit of Fabric semantic models by collecting and tracking logs and metadata over time. It supports ongoing evaluation of model performance, usage patterns, and metadata changes, which can help you:

- **Identify Unused or Obsolete Columns:** Determine which columns may be removed from the model or underlying Delta tables.
- **Monitor Performance:** Evaluate DAX query performance over time.
- **Track Model Usage:** Collect historical query logs and usage statistics.

### Key Components and Functionality

1. **Initial Setup and Requirements:**
   - **Workspace Monitoring:** The notebook requires that Workspace Monitoring is enabled. See [this blog post](https://blog.fabric.microsoft.com/blog/announcing-public-preview-of-workspace-monitoring) for guidance.
   - **Scheduled Execution:** To capture detailed usage statistics, it is recommended to schedule this notebook to run multiple times per day (e.g., 6 times per day).
   - **Configure Run Parameters:** Configure the run parameters at the top of the notebook based on your models and other requirements.
   - **Logging Datastore:** Attach a lakehouse so log tables can be saved. 

1. **Core Functionality:**
   - **Metadata Capture:** Functions to retrieve and save semantic model objects (columns and measures) along with dependencies using Fabric API calls.
   - **Query Log Collection:** Modules to capture query counts and detailed logs, which help track model usage and performance over specified time intervals.
   - **Unused Columns and Source Mapping:** Compares lakehouse/warehouse metadata with model usage to detect columns that are no longer utilized.
   - **Cold Cache Performance:** Deploys a cloned version of the model to measure cold-cache performance via parallel DAX queries and trace log analysis.
   - **Resident Statistics:** Captures statistics about column residency (e.g., memory load, sizes) to further evaluate model performance.

1. **Star Schema Generation:**
   - The notebook constructs several star schema tables for in-depth analysis:
      - **DIM_ModelObject:** Latest definitions for columns, measures, and unused columns.
      - **DIM_Model:** Basic model details.
      - **DIM_Report:** Report details.
      - **DIM_User:** Standardized user info from logs.
      - **FACT_ModelObjectQueryCount:** Ties query counts to model objects and their dependencies.
      - **FACT_ModelLogs:** Detailed logs for performance tracking.
      - **FACT_ModelObjectStatistics:** Combines daily statistics such as cold cache performance and memory size for columns.

1. **Orchestration and Execution:**
   - The main orchestration function (`collect_model_statistics`) processes each model sequentially, performing all capture steps (metadata, logs, unused columns, cold cache, resident statistics) and finally marking each run as completed or failed.
   - The notebook concludes by writing the star schema tables to Delta format, ready for import into a Fabric semantic model for further analysis.

### Usage Notes

- **Scheduling and Monitoring:** To capture granular historical data, consider scheduling this notebook to run at regular intervals throughout the day.
- **Configuration:** Adjust the parameters (e.g., `max_queries_daily`, `max_workers`) to suit your environment and workload.


### Install the Semantic Link Labs package
Check [here](https://pypi.org/project/semantic-link-labs/) to see the latest version.

In [None]:
%pip install semantic-link-labs

### Import Required Packages

In [None]:
# Standard Library Imports
import builtins
import functools
import math
import re
import threading
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from contextlib import contextmanager
from datetime import datetime, timedelta
from uuid import uuid4

import pandas as pd
import pyspark

# Local Project-Specific Modules
import sempy.fabric as fabric
import sempy_labs as labs

# Third-Party Libraries
from dateutil.relativedelta import relativedelta
from pyspark.sql.functions import col, lit, collect_set, udf
from pyspark.sql.types import StringType

### Set Initial Parameters

In [None]:
# Models to collect statistics for
models = [
    {
        "model_name": "Semantic Model Name", # The name of the target model
        "model_workspace_name": "Semantic Model Workspace Name", # The name of the target model workspace
        "datastore_name": "Semantic Model Datastore Name", # Either a Fabric Lakehouse or Warehouse - Only needed for Direct Lake models. Will also work with a Fabric warehouse
        "datastore_workspace_name": "Semantic Model Lakehouse Workspace Name", # Either a Fabric Lakehouse or Warehouse - Only needed for Direct Lake models. Will also work with a Fabric warehouse
        "log_analytics_kusto_uri": "",     # Optional: provide your own Kusto URI or leave as empty string (leaving blank will result in using the get_workspace_monitoring_info function)
        "log_analytics_kusto_database_uuid": "", # Optional: provide your own Kusto DB Uuid or leave as empty string (leaving blank will result in using the get_workspace_monitoring_info function)
    },
]

# Settings for collecting cold cache performance measurements.
collect_cold_cache_measurements = True # Only recommended for Direct Lake or small Import models
max_queries_daily = 1                  # Maximum cold cache performance queries per column per day
max_workers = 50                       # Number of concurrent column queries

# Determine target workspace for the cloned model (used in cold cache measurements)
cold_cache_target_workspace_name = fabric.resolve_workspace_name()

# Settings for model query collection (how many days back to collect data)
max_days_ago_to_collect = 30  # Collect data from 1 to 30 days ago (only days with no data are collected)

# Adjustment for recent logs: exclude intervals within this many hours of the as-of datetime for the most recent day
min_hours_before_current = 3

# Principal name collection mode:
#   0 = Keep original ExecutingUser,
#   1 = Anonymize using historical mapping,
#   2 = Always set to "Masked"
collect_principal_names = 0
mask_principal_names_after_days = 30  # Set to 0 if masking is not required

# Define user groups to bucket executing users found in the object count and detailed logs.
user_groups = {  
    "Engineers Example": [
        "engineer1@microsoft.com",
        "engineer2@microsoft.com",
    ],
    "Project Managers Example": [
        "pm1@microsoft.com",
        "pm2@microsoft.com",
    ],
}
default_user_group = "Other Users"

# Delta table names for historical data (new records are appended each run)
historical_table_names = {
    "run_history": "run_history",
    "model_columns": "model_columns",
    "model_measures": "model_measures",
    "object_query_count": "model_object_query_count",
    "detailed_logs": "model_detailed_logs",
    "object_mapping": "model_object_mapping",
    "dependencies": "model_dependencies",
    "unused_columns": "unused_delta_table_columns",
    "source_mapping": "model_column_source_mapping",
    "cold_cache_measurements": "model_column_cold_cache_measurement",
    "resident_statistics": "model_column_resident_statistics",
    "source_reports": "source_reports",
    "source_app_reports": "source_app_reports",
}

# Delta table names for star schema (these tables are overwritten each run)
star_schema_table_names = {
    "dim_model_object": "DIM_ModelObject",
    "dim_model": "DIM_Model",
    "dim_report": "DIM_Report",
    "dim_user": "DIM_User",
    "fact_model_object_query_count": "FACT_ModelObjectQueryCount",
    "fact_detailed_logs": "FACT_ModelLogs",
    "fact_model_statistics": "FACT_ModelObjectStatistics",
}

# Flags for table management
force_delete_historical_tables = False
force_delete_incomplete_runs = True

# Abfss base path
abfss_base_path = "onelake.dfs.fabric.microsoft.com"

# Ensure Spark uses case-sensitive SQL
spark.conf.set("spark.sql.caseSensitive", True)

### Helper Functions: Logging, Retry, and Saving DataFrames

In [None]:
# Thread-local storage to track call depth (used for logging indentation)
# This allows each thread to maintain its own "call depth" counter independently.
_thread_local = threading.local()


@contextmanager
def indented_print(indent_level: int):
    """
    A context manager that temporarily replaces the built-in print function.
    It prepends a specific indent (based on indent_level) to every print output,
    which makes nested function calls easier to trace visually.
    """
    # Save the original print function so it can be restored later.
    original_print = builtins.print

    def custom_print(*args, **kwargs):
        # Create an indent by repeating four spaces per indent level.
        indent = "    " * indent_level
        # Call the original print with the indented message.
        original_print(indent + " ".join(map(str, args)), **kwargs)

    # Replace the built-in print with our custom_print.
    builtins.print = custom_print
    try:
        # Yield control back to the caller.
        yield
    finally:
        # Restore the original print function after exiting the block.
        builtins.print = original_print


def log_function_calls(func):
    """
    Decorator that logs the start and end of a function call using indented printing.
    It uses a thread-local counter to indent log messages, so nested calls are visually offset.
    
    Example:
        @log_function_calls
        def my_func():
            ...
    """
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        # Initialize call_depth for the thread if not already set.
        if not hasattr(_thread_local, "call_depth"):
            _thread_local.call_depth = 0

        # Capture the current call depth to determine the indent.
        indent = _thread_local.call_depth

        # Log the start message using the indented_print context manager.
        with indented_print(indent):
            print(f"‚úÖ {func.__name__} - Starting")

        # Increase the call depth as we enter the function.
        _thread_local.call_depth += 1
        try:
            # Log any output inside the function with increased indentation.
            with indented_print(_thread_local.call_depth):
                result = func(*args, **kwargs)
        finally:
            # Decrease the call depth on function exit.
            _thread_local.call_depth -= 1
            with indented_print(_thread_local.call_depth):
                print(f"‚úÖ {func.__name__} - Ending")
        return result

    return wrapper


def retry(exceptions, num_retries=3, initial_delay=5, backoff_factor=2, logger=None):
    """
    Decorator factory that returns a decorator to automatically retry a function call if it raises
    one of the specified exceptions. It uses exponential backoff between retries.
    
    Parameters:
        exceptions (tuple or Exception): Exception(s) that trigger a retry.
        num_retries (int): Number of retry attempts before giving up.
        initial_delay (int): Initial delay in seconds before the first retry.
        backoff_factor (int): Factor by which the delay is multiplied after each retry.
        logger (callable, optional): Logger function for reporting retries (defaults to print).
    
    Usage:
        @retry((ValueError,), num_retries=3, initial_delay=2, backoff_factor=2)
        def my_func():
            ...
    """
    def decorator_retry(func):
        @functools.wraps(func)
        def wrapper_retry(*args, **kwargs):
            attempts, delay = num_retries, initial_delay
            # Retry loop: try the function until attempts are exhausted.
            while attempts > 1:
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    msg = f"‚ö†Ô∏è {func.__name__} failed with {e}, retrying in {delay} seconds..."
                    if logger:
                        logger(msg)
                    else:
                        print(msg)
                    # Pause execution for 'delay' seconds before retrying.
                    time.sleep(delay)
                    attempts -= 1
                    # Increase the delay for the next attempt.
                    delay *= backoff_factor
            # Final attempt: if previous retries failed, let any exception propagate.
            return func(*args, **kwargs)
        return wrapper_retry
    return decorator_retry


@log_function_calls
def save_dataframe_to_delta_table(data, table_name: str, context: dict, **extra_columns) -> None:
    """
    Appends a pandas or Spark DataFrame to a Delta table with additional contextual columns.
    Extra columns provided as keyword arguments are added to the DataFrame before writing.
    
    Parameters:
        data (pandas.DataFrame or pyspark.sql.DataFrame): The input data.
        table_name (str): The target Delta table name.
        context (dict): A context dictionary that must contain keys:
            - 'as_of_datetime'
            - 'as_of_date'
            - 'run_uuid'
            - 'source_model_uuid'
        **extra_columns: Any additional columns to add to the DataFrame.
    
    The function converts a pandas DataFrame to a Spark DataFrame if necessary and ensures
    that column names have no spaces. It then writes the DataFrame to the specified Delta table.
    """
    # Default columns added to every record from the context.
    default_cols = {
        "AsOfDateTime": context["as_of_datetime"],
        "AsOfDate": context["as_of_date"],
        "RunUuid": context["run_uuid"],
        "ModelUuid": context["source_model_uuid"],
    }
    # Merge default columns with any extra columns provided.
    all_extra_cols = {**default_cols, **extra_columns}

    def add_columns(df, cols: dict):
        """
        Helper function to add extra columns to a DataFrame.
        Works for both pandas and Spark DataFrames.
        """
        for col_name, value in cols.items():
            if isinstance(df, pd.DataFrame):
                # Direct assignment for pandas DataFrame.
                df[col_name] = value
            else:
                # For Spark DataFrame, use the withColumn method and lit() to add constant columns.
                df = df.withColumn(col_name, lit(value))
        return df

    if isinstance(data, pd.DataFrame):
        # For pandas DataFrame, remove spaces in column names for consistency.
        data.columns = data.columns.str.replace(" ", "", regex=True)
        data = add_columns(data, all_extra_cols)
        # Convert the cleaned pandas DataFrame into a Spark DataFrame.
        spark_df = spark.createDataFrame(data)
    elif isinstance(data, pyspark.sql.DataFrame):
        # For Spark DataFrame, rename columns by removing any spaces.
        for c in data.columns:
            data = data.withColumnRenamed(c, c.replace(" ", ""))
        spark_df = add_columns(data, all_extra_cols)
    else:
        # Raise error if data is not a recognized DataFrame type.
        raise TypeError("‚ùå Unsupported data type. Expected pandas or Spark DataFrame.")

    try:
        # Write the DataFrame to the specified Delta table.
        # The "mergeSchema" option allows the schema to evolve if needed.
        spark_df.write.format("delta").mode("append").option("mergeSchema", "true").saveAsTable(table_name)
        print(f"‚úÖ Table `{table_name}` updated successfully.")
    except Exception as e:
        print(f"‚ùå Failed to save table `{table_name}`. Error: {e}")
        raise


### Run History & Cleanup Functions

In [None]:
@log_function_calls
def record_run_start(context: dict) -> None:
    """
    Records the start of a run by inserting a new record into the run_history table.
    
    It creates a DataFrame that includes the current start time, a placeholder for the end time,
    and a status of 'started'. Additional context data (except for keys that are handled elsewhere)
    is included to help identify the run.
    """
    # Filter out keys that are specific to run identification and handled separately,
    # ensuring that the DataFrame only includes the general context information.
    context_filtered = {
        k: v for k, v in context.items() if k not in ["run_uuid", "source_model_uuid"]
    }
    # Construct a pandas DataFrame with a single row representing the run start.
    # Both StartTime and EndTime are set to the current timestamp; EndTime will be updated upon completion.
    run_start_df = pd.DataFrame(
        [
            {
                **context_filtered,
                "StartTime": datetime.now(),  # Capture the current time as the start time.
                "EndTime": datetime.now(),    # Placeholder for end time; to be updated later.
                "Status": "started",          # Set initial status as 'started'.
            }
        ]
    )
    # Write the run start DataFrame to the Delta table designated for run history.
    save_dataframe_to_delta_table(
        data=run_start_df,
        table_name=historical_table_names["run_history"],
        context=context,
    )
    # Log a confirmation message including the run's unique identifier.
    print(f"‚úÖ Recorded run start for UUID: {context['run_uuid']}")


@log_function_calls
def record_run_completion(context: dict, status: str) -> None:
    """
    Updates the run_history table to mark the run as completed or failed.
    
    It sets the EndTime to the current timestamp and updates the run's Status.
    The function ensures that the run_uuid is present and safely escapes it for SQL usage.
    """
    # Retrieve the unique run identifier from the context.
    run_uuid = context.get("run_uuid")
    if not run_uuid:
        raise ValueError("‚ùå 'run_uuid' missing from context.")
    # Escape single quotes in the run_uuid to prevent SQL injection or syntax issues.
    escaped_uuid = run_uuid.replace("'", "''")
    # Format the current datetime as a string suitable for SQL TIMESTAMP.
    end_time_str = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    # Construct the SQL update query to set EndTime and Status for the given run.
    update_query = f"""
        UPDATE {historical_table_names["run_history"]}
        SET EndTime = CAST('{end_time_str}' AS TIMESTAMP),
            Status = '{status}'
        WHERE RunUuid = '{escaped_uuid}'
    """
    try:
        # Execute the SQL update query using Spark's SQL interface.
        spark.sql(update_query)
        # Log a success message indicating the run has been updated.
        print(f"‚úÖ Run UUID: {run_uuid} updated with status '{status}'.")
    except Exception as e:
        # Log the error if the update fails and re-raise the exception.
        print(f"‚ùå Failed to update run UUID: {run_uuid}. Error: {e}")
        raise


@log_function_calls
def cleanup_incomplete_runs() -> None:
    """
    Removes records associated with runs that did not complete successfully.
    
    The function performs two main operations:
      1. Deletes records from all historical tables (except run_history) corresponding to incomplete runs.
      2. Updates the run_history table to mark those incomplete runs as 'removed'.
      
    This cleanup helps maintain data consistency by removing or flagging partially recorded runs.
    """
    # Check if the run_history table exists; if not, there is nothing to clean up.
    if not spark.catalog.tableExists(historical_table_names["run_history"]):
        print("‚úÖ run_history table does not exist yet. No cleanup necessary.")
        return

    try:
        # Retrieve all run_uuids from run_history where the Status is not 'completed' or 'removed'.
        incomplete_df = (
            spark.table(historical_table_names["run_history"])
            .filter(~col("Status").isin("completed", "removed"))
            .select("RunUuid")
        )
        # Collect the run uuids from the DataFrame to a list.
        incomplete_uuids = [row["RunUuid"] for row in incomplete_df.collect()]

        # If there are no incomplete runs, log the info and exit.
        if not incomplete_uuids:
            print("‚úÖ No incomplete runs to clean.")
            return

        print(f"‚úÖ Found {len(incomplete_uuids)} incomplete run(s); proceeding with cleanup.")
        # Escape each run_uuid for safe SQL query usage.
        escaped_uuids = [uuid.replace("'", "''") for uuid in incomplete_uuids]
        # Create a comma-separated string of escaped run_uuids for use in SQL IN clause.
        uuid_list_str = ", ".join(f"'{uuid}'" for uuid in escaped_uuids)

        # Iterate over each historical table (except run_history) to remove incomplete run records.
        for logical_name, table in historical_table_names.items():
            if logical_name == "run_history":
                continue  # Skip the run_history table in this deletion loop.
            try:
                # Only attempt deletion if the table exists.
                if not spark.catalog.tableExists(table):
                    print(f"‚úÖ Table {table} not found. Skipping deletion.")
                    continue
                # Construct and execute the deletion query for the current table.
                delete_query = f"DELETE FROM {table} WHERE RunUuid IN ({uuid_list_str})"
                spark.sql(delete_query)
                print(f"‚úÖ Deleted records in table {table} for incomplete runs.")
            except Exception as e:
                # Log the error for the current table and continue with the next one.
                print(f"‚ùå Failed to clean table {table}. Error: {e}")
                continue

        # After cleaning other tables, update the run_history table to mark incomplete runs as 'removed'.
        update_query = f"""
            UPDATE {historical_table_names["run_history"]}
            SET Status = 'removed'
            WHERE RunUuid IN ({uuid_list_str})
        """
        spark.sql(update_query)
        print(f"‚úÖ Marked {len(incomplete_uuids)} incomplete run(s) as removed.")
    except Exception as e:
        # Log and re-raise any exception encountered during the cleanup process.
        print(f"‚ùå Cleanup failed. Error: {e}")
        raise


def drop_historical_tables() -> None:
    """
    Drops all historical Delta tables.
    
    Use this function with caution as it permanently deletes all historical audit data
    stored in the tables defined in the 'historical_table_names' mapping.
    """
    # Loop through each historical table and attempt to drop it.
    for logical_name, table in historical_table_names.items():
        try:
            print(f"üóëÔ∏è Dropping table: {table}")
            # Execute the DROP TABLE command; IF EXISTS ensures no error is thrown if the table doesn't exist.
            spark.sql(f"DROP TABLE IF EXISTS {table}")
            print(f"‚úÖ Dropped table: {table}")
        except Exception as e:
            # Log any failure to drop a table.
            print(f"‚ùå Failed to drop table `{table}`. Error: {e}")

### Capturing Semantic Model Objects & Dependencies

In [None]:
@log_function_calls
def capture_semantic_model_objects(context: dict) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Retrieves columns and measures for the specified semantic model.

    It uses Fabric API calls to capture:
      - Columns (with extended metadata) from the model.
      - Measures from the model.
    The captured data is saved to Delta tables for historical tracking.

    Returns:
      A tuple (model_columns, model_measures) as pandas DataFrames.
    """
    # Check that the context contains the required keys for API calls.
    for key in ["source_model_uuid", "source_model_workspace_uuid"]:
        if key not in context:
            raise KeyError(f"‚ùå Missing context key: '{key}'")
    try:
        # Refresh the Table Object Model (TOM) cache to ensure up-to-date metadata.
        fabric.refresh_tom_cache(context["source_model_workspace_uuid"])

        # Retrieve model columns with extended metadata via Sempy.
        model_columns = fabric.list_columns(
            dataset=context["source_model_uuid"],
            extended=True,
            workspace=context["source_model_workspace_uuid"],
        )
        # Clean column names by removing spaces for consistency.
        model_columns.columns = model_columns.columns.str.replace(" ", "", regex=True)
        # Save the captured columns to the Delta table for historical tracking.
        save_dataframe_to_delta_table(
            data=model_columns,
            table_name=historical_table_names["model_columns"],
            context=context,
        )

        # Similarly, capture model measures from Sempy.
        model_measures = fabric.list_measures(
            dataset=context["source_model_uuid"],
            workspace=context["source_model_workspace_uuid"],
        )
        # Remove spaces from measure column names.
        model_measures.columns = model_measures.columns.str.replace(" ", "", regex=True)
        # Save the captured measures to the corresponding Delta table.
        save_dataframe_to_delta_table(
            data=model_measures,
            table_name=historical_table_names["model_measures"],
            context=context,
        )
    except Exception as e:
        print(
            f"‚ùå Failed to capture model objects for model `{context['source_model_uuid']}`. Error: {e}"
        )
        raise
    return model_columns, model_measures


def find_dependencies_recursive(
    dependencies: pd.DataFrame,
    root_table: str,
    root_object: str,
    ref_table: str,
    ref_object: str,
    level: int,
    accum: list,
) -> None:
    """
    Recursively finds dependencies for a measure.

    It traverses the dependency DataFrame (obtained from a DAX query) to:
      - Append each dependency with the current recursion level.
      - Recursively follow further dependencies if the referenced object is a measure.

    Parameters:
      dependencies (pd.DataFrame): DataFrame containing dependency data.
      root_table (str): The table name of the original measure.
      root_object (str): The measure name for which dependencies are being traced.
      ref_table (str): The table name of the current referenced object.
      ref_object (str): The name of the current referenced object.
      level (int): The current recursion depth (starting at 1).
      accum (list): List used to accumulate dependency entries.
    """
    # Filter the dependency DataFrame to only include rows matching the current reference.
    refs = dependencies[
        (dependencies["[TABLE]"] == ref_table) &
        (dependencies["[OBJECT]"] == ref_object)
    ]
    for _, row in refs.iterrows():
        # Append dependency details to the accumulator.
        accum.append(
            {
                "ObjectType": "MEASURE",  # All entries here are measures.
                "TableName": root_table,  # Original measure's table.
                "ObjectName": root_object,  # Original measure's name.
                "ReferencedObjectType": row["[REFERENCED_OBJECT_TYPE]"],
                "ReferencedTableName": row["[REFERENCED_TABLE]"],
                "ReferencedObjectName": row["[REFERENCED_OBJECT]"],
                "Level": level,  # Record the current recursion level.
            }
        )
        # If the referenced object is itself a measure, continue recursion.
        if row["[REFERENCED_OBJECT_TYPE]"] == "MEASURE":
            find_dependencies_recursive(
                dependencies,
                root_table,
                root_object,
                row["[REFERENCED_TABLE]"],
                row["[REFERENCED_OBJECT]"],
                level + 1,  # Increase recursion level.
                accum,
            )


@log_function_calls
def capture_semantic_model_dependencies(
    context: dict, model_measures: pd.DataFrame
) -> None:
    """
    Captures dependencies for model measures.

    It runs a DAX query to retrieve dependency data for measures,
    then recursively traces dependencies for each measure using the provided model_measures DataFrame.
    The complete dependency mapping is saved to the Delta table for dependencies.
    """
    # Execute a DAX query to fetch dependency data for measures.
    model_measure_deps = fabric.evaluate_dax(
        dataset=context["source_model_uuid"],
        workspace=context["source_model_workspace_uuid"],
        dax_string="""
            EVALUATE
            FILTER(
                INFO.CALCDEPENDENCY("OBJECT_TYPE", "MEASURE"),
                [REFERENCED_OBJECT_TYPE] IN { "COLUMN", "MEASURE" }
            )
        """,
    )
    accum = []  # Initialize an empty list to accumulate dependency entries.
    # Iterate over each measure to start the recursive dependency search.
    for _, row in model_measures.iterrows():
        # Begin recursion for each measure using its own table and name as the root.
        find_dependencies_recursive(
            dependencies=model_measure_deps,
            root_table=row["TableName"],
            root_object=row["MeasureName"],
            ref_table=row["TableName"],
            ref_object=row["MeasureName"],
            level=1,  # Start at level 1.
            accum=accum,
        )
    # Convert the accumulated dependency records to a Spark DataFrame and save to Delta table.
    save_dataframe_to_delta_table(
        data=spark.createDataFrame(accum),
        table_name=historical_table_names["dependencies"],
        context=context,
    )

### Processing Semantic Model Objects & Saving Mappings

In [None]:
@log_function_calls
def process_semantic_model_objects(
    model_objects: pd.DataFrame, object_type: str
) -> pd.DataFrame:
    """
    Standardizes the metadata for model objects (columns or measures) into a mapping DataFrame.

    For columns, it produces multiple formatting variants (e.g., quoted, unquoted, bracketed)
    to facilitate later matching. For measures, only a single representation is generated.

    Returns:
      A DataFrame with standardized object mapping information, including:
         - TableName: Original table name.
         - ObjectName: The base name (column or measure name).
         - ObjectType: 'COLUMN' or 'MEASURE'.
         - ModelObject: The variant string representation for matching.
    """

    def map_row(row: pd.Series) -> list:
        # Process based on whether the object is a column or a measure.
        if object_type == "COLUMN":
            tbl = row["TableName"]
            col_name = row["ColumnName"]
            # Generate different string variants for the column.
            quoted_variant = f"'{tbl}'[{col_name}]"   # e.g., 'TableName'[ColumnName]
            unquoted_variant = quoted_variant.replace("'", "")  # Remove quotes, e.g., TableName[ColumnName]
            bracket_variant = f"[{tbl}].[{col_name}]"   # e.g., [TableName].[ColumnName]
            # If the table name includes spaces, avoid the unquoted variant.
            variants = (
                [quoted_variant, bracket_variant]
                if " " in tbl
                else [quoted_variant, unquoted_variant, bracket_variant]
            )
            obj_name = col_name
        else:
            # For measures, generate only one variant.
            measure = row["MeasureName"]
            variants = [f"[{measure}]"]  # Format measure as [MeasureName]
            obj_name = measure

        # Base dictionary holds common mapping fields.
        base = {
            "TableName": row["TableName"],
            "ObjectName": obj_name,
            "ObjectType": object_type,
        }
        # Return a list of dictionaries, one for each variant.
        return [{**base, "ModelObject": variant} for variant in variants]

    # For each row in the input DataFrame, apply map_row to produce a list of mapping dictionaries.
    mapped = [item for _, row in model_objects.iterrows() for item in map_row(row)]
    # Convert the list of mapping dictionaries into a pandas DataFrame.
    return pd.DataFrame(mapped)


@log_function_calls
def save_report_measure_mappings(distinct_objects: set, context: dict) -> None:
    """
    Saves new mappings for REPORT MEASURE objects.

    It parses each report measure string using regular expressions to extract:
      - The table name (from content within square brackets or single quotes).
      - The measure name (from a specific DAX pattern).
    These mappings are then saved to the object_mapping Delta table.
    """
    rows = []
    # Regular expression to capture table names enclosed in either square brackets or single quotes.
    table_pattern = re.compile(r"(?:\[(?P<name_bracket>[^\]]+)\]|'(?P<name_quote>[^']+)')")
    
    # Regular expression to capture the measure name from the expression pattern.
    measure_pattern = re.compile(r"\[([^\]]+)\]\s*=\s*\(\/\* USER DAX BEGIN \*\/")

    # Iterate over each report measure string in the distinct_objects set.
    for model_object in distinct_objects:
        tbl_match = table_pattern.search(model_object)
        measure_match = measure_pattern.search(model_object)
        # Extract the table name from the regex groups if a match is found; otherwise, set as None.
        if tbl_match:
            table_name = tbl_match.group("name_bracket") or tbl_match.group("name_quote")
        else:
            table_name = None

        # Append a mapping dictionary with the extracted values.
        rows.append(
            {
                "TableName": table_name,
                "ObjectName": measure_match.group(1) if measure_match else None,
                "ModelObject": model_object,
                "ObjectType": "REPORT MEASURE",
            }
        )
    # If mappings were found, convert them to a DataFrame and save to the Delta table.
    if rows:
        df_report = pd.DataFrame(rows)
        save_dataframe_to_delta_table(
            data=df_report,
            table_name=historical_table_names["object_mapping"],
            context=context,
        )
        print(f"‚úÖ Saved {len(rows)} REPORT MEASURE mappings.")

### Capture and Process Logs

In [None]:
# Utility Functions

def find_starting_index(block_hours: float) -> int:
    """
    Finds the starting index for a given block length (in hours) based on a descending range.

    This function iterates over the numbers 24 down to 1 and returns the first index
    for which the provided block_hours is greater than or equal to the hour value.
    If no matching value is found, it returns 23 as a fallback.

    Args:
        block_hours (float): The length of the block in hours.

    Returns:
        int: The index corresponding to the block_hours.
    """
    # Loop through hours from 24 to 1 (inclusive), keeping track of the index.
    for i, hrs in enumerate(range(24, 0, -1)):
        # If the current block_hours is at least as large as the current hour value...
        if block_hours >= hrs:
            # ...return this index.
            return i
    # If no hour in the loop was less than or equal to block_hours, return the last index as fallback.
    return 23  


def group_missing_hours(missing_hours: list) -> list:
    """
    Groups contiguous missing hours into intervals.

    This function takes a list of missing hour values and groups them into continuous intervals.
    Each interval is returned as a tuple where the second value is one greater than the last hour
    (to represent an exclusive end).

    Args:
        missing_hours (list): A list of integer hour values that are missing.

    Returns:
        list: A list of tuples, each representing an interval (start_hour, end_hour_exclusive).
    """
    if not missing_hours:
        # If there are no missing hours, return an empty list.
        return []
    intervals = []
    # Initialize start and end with the first missing hour.
    start = missing_hours[0]
    end = missing_hours[0]
    # Loop through the remaining missing hours.
    for h in missing_hours[1:]:
        # If the current hour continues the sequence...
        if h == end + 1:
            # ...extend the current interval.
            end = h
        else:
            # Otherwise, append the current interval (with end as exclusive) and start a new one.
            intervals.append((start, end + 1))
            start = h
            end = h
    # Append the final interval.
    intervals.append((start, end + 1))
    return intervals


def get_missing_hours_for_day(log_table, day_date) -> list:
    """
    Retrieves the missing hours for a given day from the log table.

    It collects existing hour entries for the given day from the log table and returns a list
    of hour values (0 to 23) that are missing.

    Args:
        log_table: A Spark DataFrame representing the log table.
        day_date: The date for which missing hours should be identified.

    Returns:
        list: A list of hour values (0-23) that are not present in the log_table.
    """
    existing = set()
    if log_table is not None:
        try:
            # Filter the log table for the specified day and extract distinct hours.
            existing = {
                row["AsOfHour"] for row in log_table.filter(col("AsOfDate") == day_date)
                .select("AsOfHour").distinct().collect()
            }
        except Exception as e:
            # Print a warning if there is an error during retrieval.
            print(f"‚ö†Ô∏è Error retrieving existing hours for {day_date}: {e}")
    # Return all hours that are not in the existing set.
    return [hour for hour in range(24) if hour not in existing]


def format_datetime(dt: datetime) -> str:
    """
    Formats a datetime object into a specific string format.

    The formatted string follows the pattern "datetime(YYYY-MM-DDTHH:MM:SS)".

    Args:
        dt (datetime): The datetime object to format.

    Returns:
        str: The formatted datetime string.
    """
    return dt.strftime("datetime(%Y-%m-%dT%H:%M:%S)")


def build_user_groups():
    """
    Constructs dynamic 'let' statements and 'case' conditions for user groups.

    The function iterates over a globally defined 'user_groups' dictionary (assumed to be defined
    elsewhere) and generates:
      - A list of 'let' statements that create dynamic arrays for each group.
      - A list of 'case' conditions to be used in KQL queries for mapping executing users to groups.

    Returns:
        tuple: Two strings, one with all let statements joined by newlines, and another with the case conditions joined by commas and newlines.
    """
    let_statements = []
    case_conditions = []
    # Iterate over each user group and its associated email list.
    for group, emails in user_groups.items():
        # Create a variable-friendly group name (remove spaces/dashes and convert to lowercase).
        grp_var = group.replace(" ", "").replace("-", "").lower()
        # Join the email addresses into a single string, each email wrapped in single quotes.
        emails_str = ", ".join(f"'{email}'" for email in emails)
        # Append the let statement for the current group.
        let_statements.append(f"let {grp_var} = dynamic([{emails_str}]);")
        # Append the corresponding case condition to check if the executing user belongs to this group.
        case_conditions.append(f'ExecutingUser in ({grp_var}), "{group}"')
    # Return the complete let statements and case conditions as joined strings.
    return "\n".join(let_statements), ",\n".join(case_conditions)


def get_log_table(table_key: str, filter_expr):
    """
    Retrieve a Spark table using a table key and apply a filter expression.

    The function attempts to load a table (using a global mapping 'historical_table_names' with table_key)
    and apply the provided filter. If the table is not accessible, it returns None.

    Args:
        table_key (str): The key to look up the table name in historical_table_names.
        filter_expr: A Spark SQL filter expression to apply to the table.

    Returns:
        The filtered Spark DataFrame if accessible, otherwise None.
    """
    try:
        # Attempt to load the Spark table and apply the filter.
        table = spark.table(table_key).filter(filter_expr)
        return table
    except Exception as e:
        # Print a warning if the table cannot be accessed.
        print(f"‚ö†Ô∏è Table {table_key} not accessible; proceeding without prior data.")
        return None


# Base Query Handler Interface
class BaseQueryHandler:
    """
    Abstract base class for query handlers.

    This class defines the interface for any query handler, requiring the implementation of
    methods to generate test queries, main queries, and process query results.
    """
    def generate_test_query(self, start_ts: datetime, end_ts: datetime) -> str:
        raise NotImplementedError

    def generate_main_query(self, start_ts: datetime, end_ts: datetime) -> str:
        raise NotImplementedError

    def process_result(self, main_result, start_ts: datetime) -> None:
        raise NotImplementedError


# QueryLogCollector Class
class QueryLogCollector:
    """
    Handles the collection and processing of log queries over time intervals.

    This class divides a given time range into intervals, executes queries for each interval,
    and processes the results. It supports retries with smaller intervals in case of mismatches.
    """
    ALLOWED_INTERVALS = list(range(24, 0, -1))  # Allowed query interval lengths (hours)

    def __init__(self, context: dict):
        # Store the execution context which contains configurations and connection details.
        self.context = context

    @staticmethod
    @log_function_calls
    @retry(exceptions=(Exception,), num_retries=2, initial_delay=30, backoff_factor=2, logger=print)
    def execute_query(context: dict, kql_query: str):
        """
        Executes a Kusto Query Language (KQL) query using the provided context.

        The function retrieves an access token, builds a Spark DataFrame query using the KQL query,
        and returns the result. It prints success or failure messages accordingly.

        Args:
            context (dict): Context containing Kusto connection details.
            kql_query (str): The KQL query string to execute.

        Returns:
            The Spark DataFrame containing the query result.
        """
        try:
            # Retrieve the access token for authentication.
            access_token = mssparkutils.credentials.getToken(context["log_analytics_kusto_uri"])
            # Build the Spark DataFrame reader for the Kusto data source with required options.
            result = (spark.read.format("com.microsoft.kusto.spark.synapse.datasource")
                      .option("accessToken", access_token)
                      .option("kustoCluster", context["log_analytics_kusto_uri"])
                      .option("kustoDatabase", context["log_analytics_kusto_database"])
                      .option("kustoQuery", kql_query)
                      .load())
            print("‚úÖ KQL query executed successfully.")
            return result
        except Exception:
            print("‚ùå Failed to execute KQL query.")
            raise

    def process_sub_intervals(self, start_ts: datetime, end_ts: datetime, new_idx: int,
                              generate_test_query, generate_main_query, process_result) -> bool:
        """
        Processes sub-intervals within a given time range by dividing it into smaller chunks.

        It iterates over the time interval using the granularity defined by ALLOWED_INTERVALS
        at the index new_idx, executes queries for each sub-interval, and processes the results.
        If any sub-interval fails, it returns False.

        Args:
            start_ts (datetime): Start timestamp of the interval.
            end_ts (datetime): End timestamp of the interval.
            new_idx (int): Current index into the ALLOWED_INTERVALS list representing granularity.
            generate_test_query: Function to generate a test query.
            generate_main_query: Function to generate the main query.
            process_result: Function to process the query result.

        Returns:
            bool: True if all sub-interval queries succeed, False otherwise.
        """
        # Get the new granularity (hours) from the allowed intervals list.
        new_granularity = QueryLogCollector.ALLOWED_INTERVALS[new_idx]
        success = True
        current = start_ts
        # Loop over the overall interval in steps of the granularity.
        while current < end_ts:
            # Calculate the end of the current sub-interval.
            sub_end = current + pd.Timedelta(hours=new_granularity)
            # Ensure the sub-interval does not exceed the overall end time.
            if sub_end > end_ts:
                sub_end = end_ts
            # Calculate the length of the current sub-interval in hours.
            sub_length = (sub_end - current).total_seconds() / 3600.0
            # Determine the starting index based on the sub-interval length.
            sub_idx = find_starting_index(sub_length)
            print(f"üîç Querying sub-interval {current} to {sub_end}...")
            # Attempt to execute the query for the sub-interval.
            if not self.attempt_interval_query(current, sub_end, sub_idx,
                                               generate_test_query, generate_main_query, process_result):
                success = False
            # Move to the next sub-interval.
            current = sub_end
        return success

    def attempt_interval_query(self, start_ts: datetime, end_ts: datetime, current_idx: int,
                               generate_test_query, generate_main_query, process_result) -> bool:
        """
        Attempts to execute queries for a given time interval and verify that the results match.

        It first runs a test query to check if any rows are returned. If rows exist, it then runs
        the main query. If the row count between test and main queries does not match, it retries
        with a smaller interval (if possible). If the minimum granularity is reached and the query still fails,
        it skips the interval.

        Args:
            start_ts (datetime): Start timestamp of the interval.
            end_ts (datetime): End timestamp of the interval.
            current_idx (int): Current index in the allowed intervals indicating query granularity.
            generate_test_query: Function to generate a test query.
            generate_main_query: Function to generate the main query.
            process_result: Function to process the query result.

        Returns:
            bool: True if the query for the interval succeeds, False otherwise.
        """
        try:
            print(f"‚ÑπÔ∏è Sending test query for interval {start_ts} to {end_ts}...")
            # Generate and execute the test query.
            test_query = generate_test_query(start_ts, end_ts)
            test_result = self.execute_query(self.context, test_query)
            # Retrieve the first row of the test result to check the total count.
            test_row = test_result.first()
            test_count = 0 if test_row is None else test_row["totalCount"]

            # If the test query returns zero rows, skip further processing for this interval.
            if test_count == 0:
                print(f"‚ö†Ô∏è Test query returned 0 rows for interval {start_ts} to {end_ts}. Skipping...")
                return True

            print(f"‚ÑπÔ∏è Sending main query for interval {start_ts} to {end_ts}...")
            # Generate and execute the main query.
            main_query = generate_main_query(start_ts, end_ts)
            main_result = self.execute_query(self.context, main_query)
            main_count = main_result.count()

            # Ensure that the number of rows from the main query matches the test query.
            if main_count != test_count:
                raise Exception(f"Query result truncated: main_count ({main_count}) != test_count ({test_count})")
            else:
                print(f"‚úÖ Query results match: {main_count} rows.")
            # Process the successful query result.
            process_result(main_result, start_ts)
            return True
        except Exception as e:
            # Log the error with details about the interval and granularity.
            print(f"‚ùå Interval {start_ts} to {end_ts} at granularity {QueryLogCollector.ALLOWED_INTERVALS[current_idx]}h failed. Error: {e}")
            # If we are at the minimum granularity, skip this interval.
            if current_idx == len(QueryLogCollector.ALLOWED_INTERVALS) - 1:
                print(f"‚ùå Minimum granularity reached for {start_ts} to {end_ts}; skipping interval.")
                return False
            else:
                # Otherwise, try with a smaller interval (next index in ALLOWED_INTERVALS).
                new_idx = current_idx + 1
                print(f"‚ö†Ô∏è Retrying with smaller interval ({QueryLogCollector.ALLOWED_INTERVALS[new_idx]}h) for {start_ts} to {end_ts}...")
                return self.process_sub_intervals(start_ts, end_ts, new_idx,
                                                  generate_test_query, generate_main_query, process_result)

    def process_query_intervals(self, log_table, days_range, query_handler: BaseQueryHandler, label: str, apply_cutoff=True):
        """
        Processes multiple query intervals across a range of days.

        For each day in the days_range, this method determines the missing hours based on the log_table,
        groups contiguous missing hours into intervals, and attempts to collect query logs using the provided
        query_handler.

        Args:
            log_table: A Spark DataFrame representing the log table.
            days_range: An iterable of integers representing days ago to process.
            query_handler (BaseQueryHandler): An instance of a query handler to generate and process queries.
            label (str): A label for logging purposes (e.g., "detailed logs").
            apply_cutoff (bool): Whether to apply a cutoff for the most recent day.
        """
        # Process each day in the provided range.
        for days_ago in days_range:
            # Calculate the start of the day (midnight) adjusted for days ago.
            day_start = self.context["as_of_datetime"].replace(hour=0, minute=0, second=0, microsecond=0) - relativedelta(days=days_ago)
            day_date = day_start.date()
            # Determine missing hours for the day.
            missing_hours = get_missing_hours_for_day(log_table, day_date) if log_table is not None else list(range(24))
            # If applying cutoff for the most recent day, adjust missing hours accordingly.
            if apply_cutoff and days_ago == 1:
                cutoff_hour = int(((self.context["as_of_datetime"] - timedelta(hours=min_hours_before_current) - day_start).total_seconds()) // 3600)
                missing_hours = [h for h in missing_hours if h < cutoff_hour]
                if not missing_hours:
                    print(f"‚úÖ All eligible {label} collected for {day_date} after cutoff.")
                    continue
            # If there are no missing hours, move to the next day.
            if not missing_hours:
                print(f"‚úÖ All {label} collected for {day_date}.")
                continue
            # Group missing hours into continuous intervals.
            intervals = group_missing_hours(missing_hours)
            # Process each missing interval.
            for start_hr, end_hr in intervals:
                # Calculate the start and end datetime for the current interval.
                interval_start = day_start + pd.Timedelta(hours=start_hr)
                interval_end = day_start + pd.Timedelta(hours=end_hr)
                # Compute the block length in hours.
                block_length = (interval_end - interval_start).total_seconds() / 3600.0
                # Determine the appropriate index based on the block length.
                idx = find_starting_index(block_length)
                print(f"üîç Collecting {label} for {day_date} from hour {start_hr} to {end_hr} (block length: {block_length}h)...")
                # Attempt to query the current interval.
                self.attempt_interval_query(interval_start, interval_end, idx,
                                            query_handler.generate_test_query, query_handler.generate_main_query, query_handler.process_result)


# Query Handler Implementations

class ObjectCountQueryHandler(BaseQueryHandler):
    """
    Query handler for collecting query counts for model objects (columns, measures, or REPORT MEASUREs).

    This handler builds a KQL query tailored to the object type, executes the query, saves the results,
    and collects any distinct REPORT MEASURE strings.
    """
    def __init__(self, context: dict, object_type: str, model_objects_df):
        self.context = context
        self.object_type = object_type
        self.model_objects_df = model_objects_df
        # Set to keep track of distinct REPORT MEASURE values encountered.
        self.distinct_report_measures = set()

    def _build_query(self, start_ts: datetime, end_ts: datetime) -> str:
        """
        Builds the KQL query string for collecting query counts.

        Depending on the object type, the query will include different clauses to extract or expand model objects.
        It also integrates user group information into the query.

        Args:
            start_ts (datetime): Start timestamp for the query.
            end_ts (datetime): End timestamp for the query.

        Returns:
            str: The constructed KQL query.
        """
        # Build user groups let statements and case conditions.
        let_statements, case_conditions = build_user_groups()
        if self.object_type == "REPORT MEASURE":
            # For REPORT MEASUREs, extend the query to extract the measure details.
            model_extend = """
            | extend ModelObject = extract_all(@"MEASURE (.*?\\/\\* USER DAX END \\*\\/\\))", EventText)
            | mv-expand ModelObject
            | where isnotempty(ModelObject)
            """
        else:
            # For other object types, if model objects data is provided, add them as a dynamic array.
            if self.model_objects_df is not None:
                objs_list = ", ".join(f'"{obj}"' for obj in self.model_objects_df["ModelObject"])
                let_statements += f"\nlet modelObjects = dynamic([{objs_list}]);"
            # Use mv-apply to match EventText against each model object.
            model_extend = """
            | mv-apply ModelObject = modelObjects to typeof(string) on (
                extend Matched = iff(EventText contains_cs ModelObject, true, false)
                | where Matched
                | project-away Matched
            )
            """
        # Format the start and end timestamps for the query.
        start_str = format_datetime(start_ts)
        end_str = format_datetime(end_ts)
        # Build the complete query string.
        query = f'''
        let model_uuid = "{self.context["source_model_uuid"]}";
        {let_statements}
        SemanticModelLogs
            | where Timestamp between ({start_str} .. {end_str})
            | where ItemId == model_uuid
            | where OperationName == "QueryBegin"
            | extend ReportId = extract_json("$.Sources[0].ReportId", tostring(parse_xml(XmlaProperties)["PropertyList"]["ApplicationContext"]), typeof(string))
            | extend AsOfHour = datetime_part("hour", Timestamp)
            | project AsOfHour, ExecutingUser, EventText, ReportId
            {model_extend}
            | extend ExecutingUserGroup =
                case(
                    {case_conditions},
                    "{default_user_group}"
                )
            | summarize QueryCount = count() by tostring(ModelObject), AsOfHour, ExecutingUserGroup, ReportId
        '''.strip().replace("\n", " ")
        return query

    def generate_test_query(self, start_ts: datetime, end_ts: datetime) -> str:
        """
        Generates a test query that summarizes the total count.

        Args:
            start_ts (datetime): Start timestamp for the test query.
            end_ts (datetime): End timestamp for the test query.

        Returns:
            str: The KQL test query string.
        """
        # Append a summarize clause to count total rows.
        return self._build_query(start_ts, end_ts) + " | summarize totalCount = count()"

    def generate_main_query(self, start_ts: datetime, end_ts: datetime) -> str:
        """
        Generates the main query used to fetch detailed query counts.

        Args:
            start_ts (datetime): Start timestamp for the main query.
            end_ts (datetime): End timestamp for the main query.

        Returns:
            str: The KQL main query string.
        """
        return self._build_query(start_ts, end_ts)

    def process_result(self, main_result, start_ts: datetime) -> None:
        """
        Processes the main query result.

        It saves the result to a Delta table and updates the distinct_report_measures set for REPORT MEASUREs.

        Args:
            main_result: The Spark DataFrame returned from executing the main query.
            start_ts (datetime): The starting timestamp of the query interval.
        """
        save_dataframe_to_delta_table(
            data=main_result,
            table_name=historical_table_names["object_query_count"],
            context=self.context,
            AsOfDate=start_ts.date(),
            AsOfDateTime=start_ts,
            ObjectType=self.object_type,
        )
        # If processing REPORT MEASUREs, collect distinct model objects.
        if self.object_type == "REPORT MEASURE":
            for row in main_result.select("ModelObject").distinct().collect():
                if row["ModelObject"]:
                    self.distinct_report_measures.add(row["ModelObject"])


class DetailedLogsQueryHandler(BaseQueryHandler):
    """
    Query handler for capturing detailed logs from semantic model operations.

    This handler builds queries to fetch both query begin and query end logs,
    joins them together, processes the results (including user name masking), and collects ReportIds.
    """
    def __init__(self, context: dict):
        self.context = context
        # Set to collect unique ReportIds.
        self.report_ids = set()
        # Dictionary to store historical mapping for principal names.
        self.historical_mapping = {}
        # Lock to prevent concurrent modifications of the historical mapping.
        self.mapping_lock = threading.Lock()

    def _build_query(self, start_ts: datetime, end_ts: datetime) -> str:
        """
        Builds the KQL query string for detailed logs.

        The query fetches logs for both "QueryEnd" (and "Error") and "QueryBegin" events,
        then joins them together and applies user group mapping.

        Args:
            start_ts (datetime): Start timestamp for the query.
            end_ts (datetime): End timestamp for the query.

        Returns:
            str: The constructed KQL query string.
        """
        let_statements, case_conditions = build_user_groups()
        # Convert timestamps to ISO format strings.
        start_iso = start_ts.isoformat(timespec="seconds")
        end_iso = end_ts.isoformat(timespec="seconds")
        # For QueryBegin, extend the window to one day before start_ts.
        query_begin_start_iso = (start_ts - timedelta(days=1)).isoformat(timespec="seconds")
        query = f'''
        let model_uuid = "{self.context["source_model_uuid"]}";
        {let_statements}
        let base_data =
            SemanticModelLogs
            | where ItemId == model_uuid;
        let query_end = base_data
            | where Timestamp between (datetime({start_iso}) .. datetime({end_iso}))
            | where OperationName in ("Error", "QueryEnd")
            | project Timestamp, OperationName, OperationDetailName, OperationId, XmlaSessionId, ExecutingUser, DurationMs, CpuTimeMs, EventText, Status, StatusCode;
        let query_begin = base_data
            | where Timestamp between (datetime({query_begin_start_iso}) .. datetime({end_iso}))
            | where OperationName == "QueryBegin"
            | extend ActivityId = tostring(parse_xml(XmlaProperties)["PropertyList"]["DbpropMsmdActivityID"])
            | extend RequestId = tostring(parse_xml(XmlaProperties)["PropertyList"]["DbpropMsmdRequestID"])
            | extend CurrentActivityId = tostring(parse_xml(XmlaProperties)["PropertyList"]["DbpropMsmdCurrentActivityID"])
            | extend ReportId = extract_json("$.Sources[0].ReportId", tostring(parse_xml(XmlaProperties)["PropertyList"]["ApplicationContext"]), typeof(string))
            | distinct ActivityId, RequestId, CurrentActivityId, ReportId, OperationId, XmlaSessionId;
        query_end
        | join kind=leftouter (query_begin) on OperationId, XmlaSessionId
        | extend ExecutingUserGroup =
            case(
                    {case_conditions},
                    "{default_user_group}"
                )
        | extend AsOfHour = datetime_part("hour", Timestamp)
        | project Timestamp, AsOfHour, OperationName, OperationDetailName, ReportId, ExecutingUser, ExecutingUserGroup, DurationMs, CpuTimeMs, EventText, OperationId, XmlaSessionId, ActivityId, RequestId, CurrentActivityId, Status, StatusCode
        '''.strip().replace("\n", "")
        return query

    def generate_test_query(self, start_ts: datetime, end_ts: datetime) -> str:
        """
        Generates a test query for detailed logs that summarizes the total count.

        Args:
            start_ts (datetime): Start timestamp for the test query.
            end_ts (datetime): End timestamp for the test query.

        Returns:
            str: The KQL test query string.
        """
        return self._build_query(start_ts, end_ts) + " | summarize totalCount = count()"

    def generate_main_query(self, start_ts: datetime, end_ts: datetime) -> str:
        """
        Generates the main query for capturing detailed logs.

        Args:
            start_ts (datetime): Start timestamp for the main query.
            end_ts (datetime): End timestamp for the main query.

        Returns:
            str: The KQL main query string.
        """
        return self._build_query(start_ts, end_ts)

    def process_result(self, main_result, start_ts: datetime) -> None:
        """
        Processes the detailed logs query result.

        Depending on configuration, it may mask user names, save the results to a Delta table,
        and extract distinct ReportIds from the result.

        Args:
            main_result: The Spark DataFrame containing the query results.
            start_ts (datetime): The starting timestamp of the query interval.
        """
        # Optionally mask user names based on the configuration setting.
        if collect_principal_names == 2:
            # Completely mask executing user names.
            main_result = main_result.withColumn("ExecutingUser", lit("Masked"))
        elif collect_principal_names == 1:
            try:
                # Retrieve a list of distinct executing users.
                new_users = [row[0].strip().lower() for row in main_result.select("ExecutingUser").distinct().collect() if row[0]]
            except Exception as e:
                print(f"‚ùå Error retrieving distinct users: {e}")
                new_users = []
            with self.mapping_lock:
                # Update the historical mapping for any new users.
                for user in new_users:
                    if user and user not in self.historical_mapping:
                        self.historical_mapping[user] = str(uuid4())
            # Broadcast the mapping to all Spark workers.
            broadcast_map = spark.sparkContext.broadcast(self.historical_mapping)
            # Define a UDF to mask user names using the broadcasted mapping.
            def mask_user(actual):
                return broadcast_map.value.get(actual.strip().lower() if actual else None, actual)
            mask_udf = udf(mask_user, StringType())
            # Apply the UDF to mask the ExecutingUser column.
            main_result = main_result.withColumn("ExecutingUser", mask_udf(col("ExecutingUser")))
        # Save the processed detailed logs to the Delta table.
        save_dataframe_to_delta_table(
            data=main_result,
            table_name=historical_table_names["detailed_logs"],
            context=self.context,
            AsOfDate=start_ts.date(),
            AsOfDateTime=start_ts,
        )
        try:
            # Extract distinct ReportIds from the result.
            distinct_ids = set(main_result.select("ReportId").rdd.flatMap(lambda x: x).collect())
            self.report_ids.update(distinct_ids)
            print(f"‚úÖ Collected {len(distinct_ids)} ReportIds for interval starting at {start_ts}.")
        except Exception:
            print(f"‚ùå Failed to extract ReportIds for interval starting at {start_ts}.")


# Generic Function for Processing Query Collections
def process_query_collection(context: dict, table_key: str, query_handler: BaseQueryHandler, label: str, apply_cutoff=True, filter_expr=None, override_context: dict = None):
    """
    Retrieves the appropriate log table, instantiates a QueryLogCollector,
    and processes query intervals using the specified query handler.

    Args:
        context (dict): The context containing configuration and connection details.
        table_key (str): Key to identify the table in historical_table_names.
        query_handler (BaseQueryHandler): An instance of a query handler for building and processing queries.
        label (str): A label used for logging and messaging.
        apply_cutoff (bool): Flag to indicate if cutoff logic should be applied.
        filter_expr: A Spark SQL expression to filter the log table.
        override_context (dict): Optional dictionary to override parts of the context.

    Returns:
        None
    """
    # Create a copy of the context to avoid modifying the original.
    ctx = context.copy()
    if override_context:
        # Update the context with any override values.
        ctx.update(override_context)
    if filter_expr is None:
        # Use a default filter expression based on the source model UUID.
        filter_expr = col("ModelUuid") == ctx["source_model_uuid"]
    # Retrieve the log table using the provided key and filter.
    log_table = get_log_table(historical_table_names[table_key], filter_expr)
    # Instantiate the query log collector.
    qcollector = QueryLogCollector(ctx)
    # Process query intervals over a range of days.
    qcollector.process_query_intervals(log_table, range(1, max_days_ago_to_collect + 1), query_handler, label, apply_cutoff)


# Capture Functions Using the Refactored Logic

@log_function_calls
def capture_query_counts_by_object(context: dict, object_type: str, model_objects_df) -> None:
    """
    Captures query counts for a specific object type by processing the query collection.

    It sets up a filter expression for the model UUID and object type,
    instantiates the ObjectCountQueryHandler, and processes the queries.
    For REPORT MEASURE objects, it also saves any new mappings discovered.

    Args:
        context (dict): The execution context.
        object_type (str): The type of object ("COLUMN", "MEASURE", "REPORT MEASURE").
        model_objects_df: DataFrame containing model objects; can be None for REPORT MEASURE.

    Returns:
        None
    """
    filter_expr = (col("ModelUuid") == context["source_model_uuid"]) & (col("ObjectType") == object_type)
    query_handler = ObjectCountQueryHandler(context, object_type, model_objects_df)
    process_query_collection(context, "object_query_count", query_handler, f"{object_type} query counts", True, filter_expr)
    # For REPORT MEASURE, save new mappings if any were discovered.
    if object_type == "REPORT MEASURE":
        if query_handler.distinct_report_measures:
            print(f"üìù Saving {len(query_handler.distinct_report_measures)} new REPORT MEASURE mappings...")
            try:
                save_report_measure_mappings(query_handler.distinct_report_measures, context)
            except Exception as e:
                print(f"‚ùå Failed to save REPORT MEASURE mappings: {e}")
        else:
            print("‚ÑπÔ∏è No new REPORT MEASUREs discovered.")


@log_function_calls
def capture_detailed_logs(context: dict) -> set:
    """
    Captures detailed logs from the semantic model and returns a set of ReportIds.

    This function optionally builds a historical mapping for principal names,
    processes detailed logs, and applies masking to older records if configured.

    Args:
        context (dict): The execution context.

    Returns:
        set: A set of ReportIds captured from the detailed logs.
    """
    try:
        # If configuration indicates that principal names should be collected,
        # build a historical mapping from logs over the past 30 days.
        if collect_principal_names == 1:
            start_hist = (context["as_of_datetime"] - relativedelta(days=30)).replace(minute=0, second=0, microsecond=0)
            end_hist = (context["as_of_datetime"] - relativedelta(days=1)).replace(minute=0, second=0, microsecond=0)
            def gen_hist_main(s, e):
                return f"""
                    let startTime = {s.strftime("datetime(%Y-%m-%dT%H:%M:%S)")};
                    let endTime = {e.strftime("datetime(%Y-%m-%dT%H:%M:%S)")};
                    SemanticModelLogs
                    | where Timestamp between (startTime .. endTime)
                    | where OperationName in ("Error", "QueryEnd")
                    | distinct XmlaSessionId, ExecutingUser
                """
            def make_test_query(query_func):
                return lambda s, e: query_func(s, e) + " | summarize totalCount = count()"
            hist_results = []
            def process_hist(main_result, start_ts):
                hist_results.append(main_result.toPandas())
            total_hours = (end_hist - start_hist).total_seconds() / 3600.0
            start_idx = find_starting_index(total_hours)
            qcollector = QueryLogCollector(context)
            qcollector.attempt_interval_query(start_hist, end_hist, start_idx, make_test_query(gen_hist_main), gen_hist_main, process_hist)
            hist_kql_pd = pd.concat(hist_results, ignore_index=True) if hist_results else pd.DataFrame(columns=["XmlaSessionId", "ExecutingUser"])
            try:
                hist_logs_df = spark.table(historical_table_names["detailed_logs"]).select("XmlaSessionId", "ExecutingUser").distinct()
                hist_logs_pd = hist_logs_df.toPandas()
            except Exception:
                print("‚ö†Ô∏è detailed_logs table missing; using empty historical data.")
                hist_logs_pd = pd.DataFrame(columns=["XmlaSessionId", "ExecutingUser"])
            if not hist_kql_pd.empty and not hist_logs_pd.empty:
                hist_kql = hist_kql_pd.rename(columns={"ExecutingUser": "ActualUser"})
                hist_kql["ActualUser"] = hist_kql["ActualUser"].apply(lambda x: x.strip().lower() if x else None)
                hist_logs = hist_logs_pd.rename(columns={"ExecutingUser": "MaskedUser"})
                merged = pd.merge(hist_kql, hist_logs, on="XmlaSessionId", how="inner")
                merged["NormalizedActualUser"] = merged["ActualUser"].apply(lambda x: x.strip().lower() if x else None)
                merged = merged.drop_duplicates(subset=["NormalizedActualUser"])
                historical_mapping = { row["NormalizedActualUser"]: row["MaskedUser"] for _, row in merged.iterrows() if pd.notnull(row["NormalizedActualUser"]) }
            else:
                historical_mapping = {}
        else:
            historical_mapping = {}
    except Exception as e:
        print(f"‚ö†Ô∏è Historical mapping build failed: {e}")
        historical_mapping = {}

    query_handler = DetailedLogsQueryHandler(context)
    query_handler.historical_mapping = historical_mapping
    process_query_collection(context, "detailed_logs", query_handler, "detailed logs")
    # After collecting detailed logs, optionally mask principal names for records older than a configured number of days.
    if mask_principal_names_after_days > 0:
        cutoff_date = context["as_of_date"] - timedelta(days=mask_principal_names_after_days)
        update_query = f"""
            UPDATE {historical_table_names["detailed_logs"]}
            SET ExecutingUser = 'Masked'
            WHERE AsOfDate < '{cutoff_date}'
        """
        try:
            spark.sql(update_query)
            print(f"‚úÖ Masked user names for records older than {mask_principal_names_after_days} days (before {cutoff_date}).")
        except Exception as e:
            print(f"‚ùå Failed to mask user names for historical detailed logs: {e}")
    print(f"‚úÖ capture_detailed_logs complete. Total ReportIds: {len(query_handler.report_ids)}")
    return query_handler.report_ids


@log_function_calls
def capture_logs_and_mappings(context: dict, model_columns: pd.DataFrame, model_measures: pd.DataFrame) -> set:
    """
    Orchestrates the capture of various logs and object mappings.

    It processes model columns, measures, REPORT MEASUREs, detailed logs.
    If required context keys are missing, it raises an error.
    Returns a consolidated set of ReportIds collected from the detailed logs.

    Args:
        context (dict): The execution context containing necessary configurations.
        model_columns (pd.DataFrame): DataFrame containing model column definitions.
        model_measures (pd.DataFrame): DataFrame containing model measure definitions.

    Returns:
        set: A set of ReportIds collected from the logs.
    """
    required = {"source_model_uuid", "source_model_workspace_uuid", "source_model_name", "log_analytics_kusto_uri", "log_analytics_kusto_database"}
    missing = required - context.keys()
    if missing:
        raise KeyError(f"‚ùå Missing context keys: {', '.join(missing)}")
    
    report_ids = set()
    try:
        print("üìä Processing columns...")
        processed_cols = process_semantic_model_objects(model_columns, "COLUMN")
        save_dataframe_to_delta_table(
            data=processed_cols,
            table_name=historical_table_names["object_mapping"],
            context=context,
        )
        capture_query_counts_by_object(context, "COLUMN", processed_cols)
    except Exception as e:
        print(f"‚ùå Columns processing failed. Error: {e}")
    
    try:
        print("üìä Processing measures...")
        processed_measures = process_semantic_model_objects(model_measures, "MEASURE")
        save_dataframe_to_delta_table(
            data=processed_measures,
            table_name=historical_table_names["object_mapping"],
            context=context,
        )
        capture_query_counts_by_object(context, "MEASURE", processed_measures)
    except Exception as e:
        print(f"‚ùå Measures processing failed. Error: {e}")
    
    try:
        print("üìä Processing REPORT MEASUREs...")
        capture_query_counts_by_object(context, "REPORT MEASURE", None)
    except Exception as e:
        print(f"‚ùå REPORT MEASURE processing failed. Error: {e}")
    
    try:
        print("üìä Processing detailed logs...")
        detailed_ids = capture_detailed_logs(context)
        if detailed_ids:
            report_ids.update(detailed_ids)
            print(f"‚úÖ Updated ReportIds with {len(detailed_ids)} detailed ReportIds.")
        else:
            print("‚ö†Ô∏è No detailed ReportIds captured.")
    except Exception as e:
        print(f"‚ùå Detailed log capture failed. Error: {e}")

    return report_ids

### Capturing Unused Delta Table Columns & Source Mappings

In [None]:
@log_function_calls
def capture_unused_delta_columns(context: dict) -> None:
    """
    Captures unused columns from the Delta tables by comparing datastore metadata with model usage.

    If datastore information is missing (i.e., no datastore name provided), the function inserts placeholder rows.
    Otherwise, it connects to the datastore to retrieve column metadata from Delta tables, compares it to the
    columns actually used in the model (obtained via the Table Object Model (TOM)), and saves the differences
    (unused columns) to designated Delta tables.

    If no unused columns are found, a placeholder N/A record is written to the unused columns table.
    """
    # Define the required context keys for retrieving datastore and model metadata.
    required_keys = [
        "source_datastore_name",
        "source_datastore_workspace_uuid",
        "source_datastore_uuid",
        "source_model_uuid",
        "source_model_workspace_uuid",
    ]
    # Ensure all required keys are present; if not, raise a KeyError.
    for key in required_keys:
        if key not in context:
            raise KeyError(f"‚ùå Missing required context key: '{key}'")
            
    # If datastore details are missing, insert placeholder rows into the target Delta tables and exit.
    if not context["source_datastore_name"]:
        save_dataframe_to_delta_table(
            data=spark.createDataFrame(
                [
                    {
                        "TableName": "N/A",
                        "SourceTableName": "N/A",
                        "SourceColumnName": "N/A",
                    }
                ]
            ),
            table_name=historical_table_names["unused_columns"],
            context=context,
        )
        save_dataframe_to_delta_table(
            data=spark.createDataFrame(
                [
                    {
                        "TableName": "N/A",
                        "ColumnName": "N/A",
                        "SourceTableName": "N/A",
                        "SourceColumnName": "N/A",
                    }
                ]
            ),
            table_name=historical_table_names["source_mapping"],
            context=context,
        )
        return

    tom_tables_info = []
    # Connect to the semantic model using the TOM API to retrieve table metadata.
    with labs.tom.connect_semantic_model(
        dataset=context["source_model_uuid"],
        readonly=True,
        workspace=context["source_model_workspace_uuid"],
    ) as tom:
        # Iterate over tables in the semantic model.
        for tbl in tom.model.Tables:
            # Get the first partition (if any) to extract source information.
            partition = next(iter(tbl.Partitions), None)
            if partition and partition.Source:
                tom_tables_info.append(
                    {
                        "tom_table": tbl,
                        "schema_name": getattr(partition.Source, "SchemaName", None),
                        "source_table_name": getattr(partition.Source, "EntityName", None),
                    }
                )

    def get_datastore_columns():
        """
        Retrieves columns from datastore Delta tables by manually reading data from known storage paths.
        
        For each table from the TOM metadata, constructs a list of candidate paths (with and without schema)
        and attempts to load the Delta table from each in order, extracting the column names.
        """
        data = []
        base_path = (
            f"abfss://{context['source_datastore_workspace_uuid']}"
            f"@{abfss_base_path}/"
            f"{context['source_datastore_uuid']}/Tables/"
        )
        
        for item in tom_tables_info:
            schema = item["schema_name"]
            entity = item["source_table_name"]
            if not entity:
                continue

            # Build candidate paths: try with schema (if available) first, then without.
            candidate_paths = []
            if schema:
                candidate_paths.append(f"{base_path}{schema}/{entity}/")
            candidate_paths.append(f"{base_path}{entity}/")

            loaded = False
            last_exception = None

            # Try each candidate path until one succeeds.
            for path in candidate_paths:
                try:
                    df = spark.read.format("delta").load(path)
                    # On success, record the column names and break out of the loop.
                    for col_name in df.schema.fieldNames():
                        data.append({"Table Name": entity, "Column Name": col_name})
                    loaded = True
                    break
                except Exception as ex:
                    print(f"‚ö†Ô∏è Unable to read from {path}: {ex}")
                    last_exception = ex

            # If no candidate path was successful, raise the last encountered exception.
            if not loaded:
                raise last_exception

        return spark.createDataFrame(data)

    # Retrieve all columns from the datastore using the defined helper function.
    all_cols_df = get_datastore_columns()

    # Ensure the result is a Spark DataFrame (convert if necessary).
    if not isinstance(all_cols_df, pyspark.sql.DataFrame):
        all_cols_df = spark.createDataFrame(all_cols_df)

    # Group the retrieved columns by table name and collect them into sets.
    grouped_df = (
        all_cols_df.groupBy("Table Name")
        .agg(collect_set("Column Name").alias("columns"))
        .collect()
    )
    # Create a dictionary mapping each table to its set of columns.
    table_columns = {row["Table Name"]: set(row["columns"]) for row in grouped_df}

    remaining_columns = []  # To store columns present in the datastore but unused in the model.
    source_mapping = []     # To store mapping between model columns and source columns as defined in TOM.

    for info in tom_tables_info:
        tbl = info["tom_table"]
        src_table = info["source_table_name"]
        if not src_table:
            continue
        # Get the set of columns available in the datastore for this source table.
        delta_cols = table_columns.get(src_table, set())
        # Get the set of columns used in the model from TOM metadata.
        used_cols = {col.SourceColumn for col in tbl.Columns if hasattr(col, "SourceColumn")}
        # Build the source mapping for each column that has a SourceColumn attribute.
        for col in tbl.Columns:
            if hasattr(col, "SourceColumn"):
                source_mapping.append(
                    {
                        "TableName": tbl.Name,
                        "ColumnName": col.Name,
                        "SourceTableName": src_table,
                        "SourceColumnName": col.SourceColumn,
                    }
                )
        # Identify columns in the datastore that are not used in the model.
        for unused in delta_cols - used_cols:
            remaining_columns.append(
                {
                    "TableName": tbl.Name,
                    "SourceTableName": src_table,
                    "SourceColumnName": unused,
                }
            )
    
    # If no unused columns were found, insert a placeholder record.
    if not remaining_columns:
        remaining_columns.append(
            {"TableName": "N/A", "SourceTableName": "N/A", "SourceColumnName": "N/A"}
        )

    # Convert the lists of unused columns and source mappings to Spark DataFrames.
    unused_df = spark.createDataFrame(remaining_columns)
    mapping_df = spark.createDataFrame(source_mapping)
    # Save the unused columns data to the Delta table for unused columns.
    save_dataframe_to_delta_table(
        data=unused_df,
        table_name=historical_table_names["unused_columns"],
        context=context,
    )
    # Save the source mapping data to the Delta table for source mappings.
    save_dataframe_to_delta_table(
        data=mapping_df,
        table_name=historical_table_names["source_mapping"],
        context=context,
    )

### Capturing and Processing Reports

In [None]:
@log_function_calls
def get_reports(context: dict, all_workspaces: pd.DataFrame, report_ids_to_keep: list) -> None:
    """
    Retrieves report information from each workspace and filters to retain only the specified ReportIds.
    
    For each workspace:
      - Calls the Fabric API to list reports.
      - Selects and renames columns for consistency.
      - Adds workspace identifiers.
      - Filters out reports not in the report_ids_to_keep list.
      - For any missing ReportIds, creates placeholder rows with default "Unknown" values.
    The final combined DataFrame is saved to the Delta table for reports.
    """
    reports_list = []  # Initialize a list to store DataFrames of reports from each workspace.
    try:
        # Iterate over each workspace provided in the all_workspaces DataFrame.
        for _, workspace in all_workspaces.iterrows():
            ws_id = workspace["Id"]
            ws_name = workspace["Name"]
            try:
                # Retrieve the list of reports for the current workspace using the Fabric API.
                reports_df = fabric.list_reports(workspace=ws_id)
            except Exception as e:
                # If report listing fails for a workspace, log the error and skip to the next workspace.
                print(f'‚ùå Failed to list reports for workspace "{ws_name}" (ID: {ws_id}). Error: {e}')
                continue
            
            # Select only the relevant columns and rename them for consistency.
            reports_df = reports_df[["Id", "Name", "Web Url"]].rename(
                columns={"Id": "ReportId", "Name": "ReportName", "Web Url": "WebUrl"}
            )
            # Add workspace-specific identifiers to the report DataFrame.
            reports_df["WorkspaceId"] = ws_id
            reports_df["WorkspaceName"] = ws_name
            # Filter the reports to keep only those whose ReportId is in the report_ids_to_keep list.
            reports_df = reports_df[reports_df["ReportId"].isin(report_ids_to_keep)]
            # Append the filtered DataFrame to the list.
            reports_list.append(reports_df)
        
        # Combine all report DataFrames into one; if none were found, create an empty DataFrame with the required columns.
        combined_reports = (
            pd.concat(reports_list, ignore_index=True)
            if reports_list
            else pd.DataFrame(
                columns=[
                    "ReportId",
                    "ReportName",
                    "WebUrl",
                    "WorkspaceId",
                    "WorkspaceName",
                ]
            )
        )
        # Determine which ReportIds from the desired list are missing in the combined reports.
        missing_ids = set(report_ids_to_keep) - set(combined_reports["ReportId"])
        if missing_ids:
            # For any missing ReportIds, create placeholder rows with default "Unknown" values.
            missing_df = pd.DataFrame(
                [
                    {
                        "ReportId": rid,
                        "ReportName": "Unknown",
                        "WebUrl": "Unknown",
                        "WorkspaceId": "Unknown",
                        "WorkspaceName": "Unknown",
                    }
                    for rid in missing_ids
                ]
            )
            # Append the placeholder rows to the combined DataFrame.
            combined_reports = pd.concat(
                [combined_reports, missing_df], ignore_index=True
            )
        # Save the final combined reports DataFrame to the Delta table for source reports.
        save_dataframe_to_delta_table(
            data=combined_reports,
            table_name=historical_table_names["source_reports"],
            context=context,
        )
        print(f"‚úÖ Retrieved and saved {len(combined_reports)} reports.")
    except Exception as e:
        print(f"‚ùå get_reports encountered an error: {e}")
        raise

@log_function_calls
def get_app_reports(context: dict) -> None:
    """
    Retrieves all Power BI apps accessible and
    lists the reports contained in each app, then saves the combined
    DataFrame to the Delta table for app reports.
    """
    try:
        # List all apps
        client = fabric.PowerBIRestClient()
        apps_response = client.get("https://api.powerbi.com/v1.0/myorg/apps")
        apps_df = pd.DataFrame(apps_response.json().get("value", []))
        
        reports_list = []
        # For each app, list its reports
        for _, app in apps_df.iterrows():
            app_id = app["id"]
            app_name = app["name"]
            try:
                resp = client.get(
                    f"https://api.powerbi.com/v1.0/myorg/apps/{app_id}/reports"
                )
                reports_df = pd.DataFrame(resp.json().get("value", []))
                
                if reports_df.empty:
                    # No reports to keep in this app ‚Üí skip
                    continue

                # Tag each row with its source app
                reports_df["appName"] = app_name

            except Exception as e:
                print(f'‚ùå Failed to list reports for app "{app_name}" (ID: {app_id}): {e}')
                continue

            reports_list.append(reports_df)

        # Combine or build an empty schema if nothing matched
        if reports_list:
            combined = pd.concat(reports_list, ignore_index=True)
            combined = combined.drop(columns=["users", "subscriptions"])
        else:
            # Ensure consistent schema even if empty
            combined = pd.DataFrame(columns=["id", "reportType", "name", "webUrl", "embedUrl", "isOwnedByMe",
                "datasetId", "appId", "originalReportObjectId", "reportFlags", "appName", "description"])

        # Persist to Delta
        save_dataframe_to_delta_table(
            data=combined,
            table_name=historical_table_names["source_app_reports"],
            context=context,
        )
        print(f"‚úÖ Retrieved and saved {len(combined)} app‚Äêreport records.")

    except Exception as e:
        print(f"‚ùå get_app_reports encountered a fatal error: {e}")
        raise


### Cold Cache Helpers: Log Table, Model Refresh, Cache Clear, and Timing Capture

In [None]:
@log_function_calls
def fetch_log_table(context: dict, table_name: str) -> pyspark.sql.DataFrame:
    """
    Attempts to fetch a log table for today's QueryEnd events for the current model.

    Returns:
      A Spark DataFrame filtered by ModelUuid, AsOfDate, and EventClass,
      or None if the table does not exist.
    """
    try:
        # Read the table from Spark using its name.
        raw_tbl = spark.read.table(table_name)
        # Apply filters: match the current model, today's date, and ensure the event class is "QueryEnd".
        filters = (
            (col("ModelUuid") == context["source_model_uuid"]) &
            (col("AsOfDate") == context["as_of_date"]) &
            (col("EventClass") == "QueryEnd")
        )
        return raw_tbl.filter(filters)
    except Exception:
        # Log informational message if the table is not found; it may be created later.
        print(f"‚ÑπÔ∏è Log table `{table_name}` does not exist. It will be created if needed.")
        return None


@log_function_calls
def wait_for_model_creation() -> None:
    """
    Polls the target workspace until the cloned model is created.

    Continues checking every 5 seconds until the cloned model appears in the dataset list.
    """
    # Continuously check if the cloned model is present in the dataset list from the target workspace.
    while (
        cloned_model_name
        not in fabric.list_datasets(workspace=cold_cache_target_workspace_name, mode="rest")["Dataset Name"].to_list()
    ):
        print("‚åõ Waiting for cloned model creation...")
        time.sleep(5)


@log_function_calls
def refresh_dataset(model_name: str, refresh_type: str) -> None:
    """
    Initiates a dataset refresh using the specified refresh type (e.g., "clearValues" or "full").

    Waits until the refresh operation completes with a status in valid_refresh_statuses.
    """
    attempts = 0
    # Start the refresh process and obtain a refresh status identifier.
    refresh_status = fabric.refresh_dataset(model_name, refresh_type=refresh_type)
    # Poll until the refresh status indicates completion or failure.
    while (
        fabric.get_refresh_execution_details(model_name, refresh_status).status not in valid_refresh_statuses
    ):
        attempts += 1
        if attempts >= max_attempts:
            raise Exception(f"‚ùå Refresh failed after {attempts} attempts.")
        # Wait briefly before the next check.
        time.sleep(3)


@log_function_calls
def clear_cache(model_name: str) -> None:
    """
    Clears the VertiPaq cache for the specified model.

    It calls a helper function to clear the cache and then verifies the operation
    by executing a trivial DAX query. The process is retried until successful or
    until the maximum number of attempts is reached.
    """
    attempts = 0
    while True:
        try:
            # Attempt to clear the cache via a helper function.
            labs.clear_cache(model_name)
            # Verify the cache clear by executing a simple DAX query.
            fabric.evaluate_dax(model_name, "EVALUATE {1}")
            print(f"‚úÖ Cache cleared for model `{model_name}`.")
            break
        except Exception as e:
            attempts += 1
            print(f"‚ö†Ô∏è Cache clear attempt failed: {e}")
            if attempts >= max_attempts:
                raise Exception("‚ùå Failed to clear VertiPaq cache.")
            # Refresh the TOM cache for the target workspace to ensure consistency.
            fabric.refresh_tom_cache(cold_cache_target_workspace_name)
            time.sleep(5)


def capture_cold_cache_timings(column_name: str, trace) -> None:
    """
    Executes a DAX query for a specific column to measure cold cache performance.

    After running the query, it checks the trace logs for a QueryEnd event that
    references the given column name. Raises an exception if the expected event
    is not found after a specified number of attempts.
    """
    # Construct a DAX expression to query a sample of the column's values.
    dax_expr = f"EVALUATE TOPN(1, VALUES({column_name}))"
    try:
        # Execute the DAX query on the cloned model.
        fabric.evaluate_dax(cloned_model_name, dax_expr)
    except Exception as e:
        print(f"‚ùå DAX evaluation error for {column_name}: {e}")
        raise Exception(f"Failed to evaluate DAX for {column_name}: {e}")
    attempts = 0
    while attempts < max_attempts:
        try:
            # Retrieve trace logs from the trace object.
            trace_logs = trace.get_trace_logs()
            if trace_logs is None:
                raise Exception(f"Trace logs are None for column {column_name}")
            # Look for a QueryEnd event that includes the column name in its Text Data.
            matching_logs = trace_logs[
                (trace_logs["Event Class"] == "QueryEnd") &
                (trace_logs["Text Data"].str.contains(re.escape(column_name), na=False))
            ]
            if not matching_logs.empty:
                break  # Expected trace log found; exit loop.
        except Exception as e:
            print(f"‚ùå Error reading trace logs for {column_name}: {e}")
            raise Exception(f"Failed to access trace logs for {column_name}: {e}")
        attempts += 1
        if attempts >= max_attempts:
            raise Exception(f"‚ùå Failed after {attempts} attempts for column {column_name}")
        time.sleep(3)

### Capturing Cold Cache Performance Metrics

In [None]:
@log_function_calls
def capture_cold_cache_performance(context: dict, model_columns: pd.DataFrame) -> set:
    """
    Measures cold cache performance for eligible columns by deploying a cloned model and executing parallel DAX queries.

    Steps:
      1. Set up and deploy a cloned version of the model.
      2. Refresh and clear the cache of the cloned model.
      3. Create a trace to capture QueryEnd events and measure performance.
      4. Execute DAX queries in parallel for each eligible column.
      5. Save the trace logs with performance metrics.

    Returns:
      A set of column identifiers (formatted strings) that were successfully processed.
    """
    global cloned_model_name, valid_refresh_statuses, max_attempts

    # Define the cloned model's name based on the source model.
    cloned_model_name = f"{context['source_model_name']} - Semantic Model Audit"
    # Valid statuses indicating the refresh operation has completed.
    valid_refresh_statuses = ["Completed", "Failed"]
    # Maximum number of attempts for refresh and cache clearing.
    max_attempts = 120

    # Try to get existing cold cache log data, if available.
    log_tbl = fetch_log_table(context, historical_table_names["cold_cache_measurements"])

    # Filter out columns not eligible for cold cache measurement (e.g., those starting with "RowNumber-").
    eligible_df = model_columns[~model_columns["ColumnName"].str.startswith("RowNumber-")]
    # Format each eligible column as a string representation: 'TableName'[ColumnName]
    eligible_columns = [
        f"'{row['TableName']}'[{row['ColumnName']}]"
        for _, row in eligible_df.iterrows()
    ]

    if log_tbl is not None:
        # Group the existing log data by column name and count entries.
        counts_df = log_tbl.groupBy("ColumnName").count()
        # Identify columns that have already reached the maximum queries per day.
        columns_to_skip = {
            row["ColumnName"]
            for row in counts_df.filter(col("count") >= max_queries_daily)
            .select("ColumnName")
            .collect()
        }
        # Exclude columns that should be skipped.
        filtered_columns = [col for col in eligible_columns if col not in columns_to_skip]
        num_skipped = len(columns_to_skip)
    else:
        filtered_columns = eligible_columns
        num_skipped = 0

    num_query = len(filtered_columns)
    print(f"üìä {num_query} columns to query; {num_skipped} columns skipped.")

    # If cold cache measurements should be collected and there are columns to query.
    if collect_cold_cache_measurements and num_query > 0:
        try:
            # Deploy a cloned version of the model for cold cache testing.
            labs.deploy_semantic_model(
                source_dataset=context["source_model_name"],
                source_workspace=context["source_model_workspace_name"],
                target_dataset=cloned_model_name,
                target_workspace=cold_cache_target_workspace_name,
                refresh_target_dataset=False,
                overwrite=True,
            )
            # Refresh the TOM cache for the target workspace.
            fabric.refresh_tom_cache(cold_cache_target_workspace_name)
            time.sleep(30)  # Wait for the cache refresh to settle.
            wait_for_model_creation()  # Poll until the cloned model is created.
            # Refresh the cloned model's dataset with a "clearValues" and then "full" refresh.
            refresh_dataset(cloned_model_name, "clearValues")
            refresh_dataset(cloned_model_name, "full")
            # Clear the VertiPaq cache for the cloned model.
            clear_cache(cloned_model_name)
            time.sleep(5)  # Short delay after clearing the cache.

            # Set up a trace connection to capture QueryEnd events for performance metrics.
            trace_conn = fabric.create_trace_connection(
                dataset=cloned_model_name, workspace=cold_cache_target_workspace_name
            )
            trace_conn.drop_traces()  # Clear any existing traces.
            trace_name = f"Simple DAX Trace {uuid4()}"
            event_schema = {
                "QueryEnd": ["EventClass", "TextData", "Duration", "CpuTime", "Success"]
            }
            # Create a trace within a context manager to ensure proper resource management.
            with fabric.create_trace_connection(
                dataset=cloned_model_name, workspace=cold_cache_target_workspace_name
            ) as trace_conn:
                with trace_conn.create_trace(event_schema=event_schema, name=trace_name) as trace:
                    trace.start()
                    # Wait until the trace has started.
                    while not trace.is_started:
                        time.sleep(2)
                    print("üîÑ Querying columns in parallel...")
                    total_cols = len(filtered_columns)
                    if total_cols == 0:
                        print("‚ÑπÔ∏è No columns to query after filtering.")
                        return set()
                    # Set up progress tracking.
                    progress_interval = math.ceil(total_cols / 10)
                    next_progress = progress_interval
                    completed = 0
                    successful = set()
                    failed = set()
                    # Execute queries in parallel using a thread pool.
                    with ThreadPoolExecutor(max_workers=max_workers) as executor:
                        future_to_col = {
                            executor.submit(capture_cold_cache_timings, col, trace): col
                            for col in filtered_columns
                        }
                        for future in as_completed(future_to_col):
                            col_name = future_to_col[future]
                            try:
                                # This will raise an exception if the query for this column fails.
                                future.result()
                                successful.add(col_name)
                            except Exception as e:
                                print(f"‚ùå Error processing {col_name}: {e}")
                                failed.add(col_name)
                                continue
                            completed += 1
                            # Print progress updates at regular intervals.
                            if completed >= next_progress:
                                print(f"‚úÖ {completed / total_cols * 100:.0f}% of columns completed.")
                                next_progress += progress_interval
                    try:
                        # Stop the trace and retrieve the captured trace logs.
                        trace_logs = trace.stop()
                        if trace_logs is not None and not trace_logs.empty:
                            # Extract the column name from the "Text Data" field using a regex.
                            trace_logs["ColumnName"] = trace_logs["Text Data"].str.extract(r"VALUES\s*\(\s*(.+?)\s*\)\s*\)")
                            # Save the trace logs to the Delta table for cold cache measurements.
                            save_dataframe_to_delta_table(
                                data=trace_logs,
                                table_name=historical_table_names["cold_cache_measurements"],
                                context=context,
                                QueryUuid=str(uuid4()),
                            )
                    except Exception as e:
                        print(f"‚ùå Failed to process trace logs: {e}")
                    if failed:
                        print(f"‚ö†Ô∏è {len(failed)} columns failed: {', '.join(failed)}")
                    else:
                        print("‚úÖ All columns processed successfully.")
                    print(f"‚úÖ Cold cache performance capture complete. {len(successful)} columns queried successfully.")
                    return successful
        except Exception as e:
            print(f"‚ùå Error during cold cache performance capture: {e}")
            return set()
    else:
        print("‚ÑπÔ∏è No columns to query after filtering; inserting placeholder.")
        try:
            # If no columns are eligible, insert a placeholder record to maintain table structure.
            save_dataframe_to_delta_table(
                data=pd.DataFrame({
                    "EventClass": ["N/A"],
                    "TextData": ["N/A"],
                    "Duration": [0],
                    "CpuTime": [0],
                    "Success": ["N/A"],
                    "ColumnName": ["N/A"],
                }),
                table_name=historical_table_names["cold_cache_measurements"],
                context=context,
                QueryUuid=str(uuid4()),
            )
        except Exception as e:
            print(f"‚ùå Failed to insert placeholder for cold cache measurements: {e}")
        return set()

### Capturing Resident Column Statistics

In [None]:
@log_function_calls
def capture_resident_statistics(context: dict, queried_columns: set) -> None:
    """
    Captures resident statistics (e.g., whether columns are loaded in memory, sizes) for model columns.
    
    It compares current model columns (using the Fabric API) with historical resident statistics,
    and saves only new records for columns that have not been recorded yet.
    """
    
    def format_column(row: dict) -> str:
        # Standardize the column identifier by combining the table and column names.
        return f"'{row['TableName']}'[{row['ColumnName']}]"
    
    try:
        # Read previously captured resident statistics for the current date from the Delta table.
        existing_stats = (
            spark.read.table(historical_table_names["resident_statistics"])
            .filter(
                (col("ModelUuid") == context["source_model_uuid"]) &
                (col("AsOfDate") == context["as_of_date"])
            )
            .select("TableName", "ColumnName")
            .collect()
        )
        # Create a set of standardized identifiers for the existing resident statistics.
        existing = {format_column(row.asDict()) for row in existing_stats}
    except Exception:
        print("‚ö†Ô∏è Could not read existing resident statistics; proceeding without.")
        existing = set()
    
    # Determine which model to query: if cold cache was measured, use the cloned model; otherwise, use the source model.
    resident_model = (
        cloned_model_name
        if collect_cold_cache_measurements
        else context["source_model_name"]
    )
    resident_workspace = (
        cold_cache_target_workspace_name
        if collect_cold_cache_measurements
        else context["source_model_workspace_name"]
    )
    
    # Retrieve the current list of model columns using the Fabric API.
    model_columns_resident = fabric.list_columns(
        dataset=resident_model,
        extended=True,
        workspace=resident_workspace,
    )
    # Remove spaces from column names to ensure consistency in identifiers.
    model_columns_resident.columns = model_columns_resident.columns.str.replace(" ", "", regex=True)
    # Build a set of standardized identifiers for the current model columns.
    model_set = {format_column(row) for _, row in model_columns_resident.iterrows()}
    
    # Identify columns that are new (i.e., not present in the historical resident statistics).
    to_capture = model_set - existing
    if collect_cold_cache_measurements:
        # Optionally restrict to only columns that were previously queried for cold cache metrics.
        to_capture = to_capture.intersection(queried_columns)
    to_capture = list(to_capture)
    
    if to_capture:
        # Filter the current model columns DataFrame to include only those columns that are new.
        filtered = [
            row
            for _, row in model_columns_resident.iterrows()
            if format_column(row) in to_capture
        ]
        filtered_df = pd.DataFrame(filtered)
        print(f"üìà Capturing resident statistics for {len(filtered_df)} new columns.")
        # Save the new resident statistics to the designated Delta table.
        save_dataframe_to_delta_table(
            data=filtered_df,
            table_name=historical_table_names["resident_statistics"],
            context=context,
        )
    else:
        print("‚ÑπÔ∏è No new resident statistics to capture for this run.")

### Workspace Monitoring Information and Datastore Identification

In [None]:
@log_function_calls
def get_workspace_monitoring_info(workspace: str) -> tuple[str, str]:
    """
    Retrieves the Query Service URI and KQL Database Id for the Monitoring KQL database in the given workspace.
    
    This function queries the list of KQL databases in the workspace using the Fabric API (via labs.list_kql_databases).
    It then filters the result to find the database named "Monitoring KQL database". If such a database is not found,
    it raises a ValueError. Otherwise, it extracts and returns the Query Service URI and the KQL Database Id as a tuple.
    
    Returns:
      A tuple (kusto_uri, kusto_db_guid) where:
        - kusto_uri: The URI for querying the KQL database.
        - kusto_db_guid: The unique identifier (GUID) for the KQL database.
    
    Raises:
      ValueError: If no KQL databases or the specific "Monitoring KQL database" is found in the workspace.
    """
    # Retrieve a DataFrame containing all KQL databases for the given workspace.
    df = labs.list_kql_databases(workspace=workspace)
    # Check if the DataFrame is empty; if so, no KQL databases exist in the workspace.
    if df.empty:
        raise ValueError(f"‚ùå No KQL databases found in workspace `{workspace}`.")
    # Filter the DataFrame to find the row where the KQL Database Name matches "Monitoring KQL database".
    df_monitor = df[df["KQL Database Name"] == "Monitoring KQL database"]
    # If no matching database is found, raise an error.
    if df_monitor.empty:
        raise ValueError(
            f"‚ùå Monitoring KQL database not found in workspace `{workspace}`."
        )
    # Extract the Query Service URI from the first (and expected only) matching row.
    kusto_uri = df_monitor.iloc[0]["Query Service URI"]
    # Extract the KQL Database Id (GUID) from the same row.
    kusto_db_uuid = df_monitor.iloc[0]["KQL Database Id"]
    # Return the extracted URI and Database Id as a tuple.
    return kusto_uri, kusto_db_uuid


def resolve_datastore_id(data_store_name: str, workspace_name: str) -> str:
    """
    Resolves the ID for a given datastore (lakehouse or warehouse) based on its name and associated workspace.
    
    The function first attempts to resolve the ID as a lakehouse by calling:
        labs.resolve_lakehouse_id(data_store_name, workspace_name)
    If that call fails (either by raising an exception or returning a falsy value), it then attempts
    to resolve the ID as a warehouse using:
        abs.resolve_warehouse_id(data_store_name, workspace_name)
    
    If both attempts fail, a RuntimeError is raised. No error or message is shown if the ID is resolved successfully.
    
    Parameters:
        data_store_name (str): The name of the datastore (lakehouse or warehouse).
        workspace_name (str): The name of the workspace associated with the datastore.
    
    Returns:
        str: The resolved datastore ID.
    
    Raises:
        RuntimeError: If neither a lakehouse nor a warehouse ID can be resolved.
    """
    datastore_id = None

    # Attempt to resolve as a lakehouse
    try:
        datastore_id = labs.resolve_lakehouse_id(data_store_name, workspace_name)
        if datastore_id:
            return datastore_id
    except Exception:
        pass

    # Attempt to resolve as a warehouse if lakehouse resolution fails
    try:
        datastore_id = labs.resolve_warehouse_id(data_store_name, workspace_name)
        if datastore_id:
            return datastore_id
    except Exception:
        pass

    raise RuntimeError(
        f"Failed to resolve a valid ID for datastore '{data_store_name}' in workspace '{workspace_name}'."
    )

### Main Orchestration: Collecting Statistics for Each Semantic Model

In [None]:
@log_function_calls
def collect_model_statistics(models: list) -> None:
    """
    Main orchestration function that processes each semantic model to capture various statistics.
    
    For each model in the provided list, the function performs the following steps:
      1. Clean up incomplete historical data if configured (by dropping tables or removing incomplete runs).
      2. Build a context dictionary containing all necessary model and workspace details.
      3. Record the start of the run in the run_history table.
      4. Retrieve workspace monitoring information (using datastore details if available).
      5. Capture model objects (columns and measures) and save them for historical tracking.
      6. Capture measure dependencies via a DAX query.
      7. Capture query counts for various model objects and update mappings accordingly.
      8. Retrieve detailed query logs and extract ReportIds.
      9. Capture unused Delta table columns and cold cache performance.
      10. Capture resident column statistics (e.g., column residency in memory).
      11. Record the run completion status (marking the run as completed or failed).
    
    If any critical step fails for a model, the function marks the run as failed and proceeds with the next model.
    """
    # Step 1: Clean up historical data if configured.
    if force_delete_historical_tables:
        print("‚ö†Ô∏è Force-deleting historical tables. All data will be lost.")
        drop_historical_tables()
    elif force_delete_incomplete_runs:
        print("‚ÑπÔ∏è Removing records for incomplete runs.")
        cleanup_incomplete_runs()

    # Step 2: Retrieve all workspaces available in the system.
    all_workspaces = fabric.list_workspaces()

    # Process each model from the provided models list.
    for model in models:
        now = datetime.now()  # Capture current timestamp for this run.
        # Step 2: Build the context dictionary with required metadata.
        context = {
            "run_uuid": str(uuid4()),
            "as_of_datetime": now,
            "as_of_date": now.date(),
            "source_model_workspace_name": model["model_workspace_name"],
            "source_model_workspace_uuid": fabric.resolve_workspace_id(model["model_workspace_name"]),
            "source_model_name": model["model_name"],
            "source_model_uuid": fabric.resolve_dataset_id(model["model_name"], model["model_workspace_name"]),
        }
        # Include datastore details in the context if they are provided.
        if model["datastore_name"] and model["datastore_workspace_name"]:
            context.update({
                "source_datastore_name": model["datastore_name"],
                "source_datastore_workspace_name": model["datastore_workspace_name"],
                "source_datastore_uuid": resolve_datastore_id(model["datastore_name"], model["datastore_workspace_name"]),
                "source_datastore_workspace_uuid": fabric.resolve_workspace_id(model["datastore_workspace_name"]),
            })
        else:
            # Use empty strings if datastore details are not provided.
            context.update({
                "source_datastore_name": "",
                "source_datastore_workspace_name": "",
                "source_datastore_uuid": "",
                "source_datastore_workspace_uuid": "",
            })
        try:
            print(f"üìÅ Processing model `{model['model_name']}`")
            # Step 3: Record the start of the run.
            record_run_start(context)

            # Step 4: Determine workspace monitoring info.
            if model["log_analytics_kusto_uri"] or model["log_analytics_kusto_database_uuid"]:
                context["log_analytics_kusto_uri"] = model["log_analytics_kusto_uri"]
                context["log_analytics_kusto_database"] = model["log_analytics_kusto_database_uuid"]
            elif context["source_model_workspace_uuid"]:
                (context["log_analytics_kusto_uri"],
                 context["log_analytics_kusto_database"]) = get_workspace_monitoring_info(context["source_model_workspace_uuid"])
            else:
                context["log_analytics_kusto_uri"] = ""
                context["log_analytics_kusto_database"] = ""

            report_ids = set()  # To accumulate ReportIds from detailed logs.

            try:
                # Step 5: Capture model objects (columns and measures).
                model_columns, model_measures = capture_semantic_model_objects(context)
                # Step 6: Capture measure dependencies.
                capture_semantic_model_dependencies(context, model_measures)
                # Step 7: Capture query counts and update object mappings.
                report_ids = capture_logs_and_mappings(context, model_columns, model_measures)
                if report_ids:
                    # Step 8a: Retrieve and save reports using the collected ReportIds.
                    get_reports(context, all_workspaces, list(report_ids))
                else:
                    print("‚ÑπÔ∏è No ReportIds found to process.")
                # Step 8b: Retrieve and save report app mapping.
                get_app_reports(context)
            except Exception as critical_err:
                print(f"‚ùå Critical step failed for model `{model['model_name']}`: {critical_err}")
                try:
                    record_run_completion(context, "failed")
                    print(f"üî¥ Run UUID: {context['run_uuid']} marked as failed.")
                except Exception as update_err:
                    print(f"‚ùå Failed to mark run UUID: {context['run_uuid']} as failed. Error: {update_err}")
                continue  # Skip to next model if a critical step fails.

            try:
                # Step 9: Capture unused Delta table columns.
                capture_unused_delta_columns(context)
            except Exception as e:
                print(f"‚ö†Ô∏è Failed to capture unused Delta columns for model `{model['model_name']}`: {e}")

            try:
                # Step 9 (continued): Capture cold cache performance metrics.
                queried_cols = capture_cold_cache_performance(context, model_columns)
            except Exception as e:
                print(f"‚ö†Ô∏è Cold cache performance capture failed for model `{model['model_name']}`: {e}")
                queried_cols = set()

            try:
                # Step 10: Capture resident statistics for columns.
                capture_resident_statistics(context, queried_cols)
            except Exception as e:
                print(f"‚ö†Ô∏è Resident statistics capture failed for model `{model['model_name']}`: {e}")

            try:
                # Step 11: Mark the run as completed.
                record_run_completion(context, "completed")
                print(f"‚úÖ Run UUID: {context['run_uuid']} completed successfully.")
            except Exception as e:
                print(f"‚ùå Failed to mark run UUID: {context['run_uuid']} as completed. Error: {e}")
        except Exception as e:
            # Log any unexpected error during processing and move to the next model.
            print(f"‚ùå Unexpected error processing model `{model['model_name']}`: {e}")
            continue  # Continue with next model on error

### Execute the Statistics Collection for All Models

In [None]:
# Execute the main function to process all models defined in the models list.
collect_model_statistics(models)

### Generate Star Schema
### Helper Function for Star Schema: Generate Table Key
Used throughout the star schema creation SQL to produce unique keys.

In [None]:
def generate_table_key(*columns) -> str:
    """
    Generates a SQL expression to produce a unique key from the concatenated values of the given columns.
    
    It uses IFNULL to replace NULLs, CONCAT to join values, MD5 for hashing,
    and CONV to convert the hash to a BIGINT.
    
    Returns:
      A string containing the SQL expression for the unique key.
    """
    # Create IFNULL expressions for each column to avoid NULL issues.
    ifnull_parts = [f"IFNULL({col}, '')" for col in columns]
    # Build the SQL expression.
    sql_expr = f"""
        CAST(CONV(
            RIGHT(MD5(CONCAT({", ".join(ifnull_parts)})), 16),
            16,
            -10
        ) AS BIGINT)
    """
    return sql_expr.strip()

### Create ```DIM_ModelObject```
Collects the most recent column, measure, and unused column definitions.

In [None]:
query_result = spark.sql(f"""
    -- Get the latest report measures
    WITH latest_report_measures AS (
        SELECT
            mapping.TableName,
            mapping.ObjectName,
            mapping.ObjectType,
            mapping.ModelObject AS Expression,
            '' AS Description,
            NULL AS ModifiedDate,
            'True' AS DeletedFromModelFlag,
            query_count.RunUuid,
            query_count.ModelUuid,
            query_count.AsOfDate,
            query_count.AsOfDateTime
        FROM {historical_table_names["object_query_count"]} AS query_count
        JOIN (
            SELECT
                ModelUuid,
                ModelObject,
                MAX(AsOfDateTime) AS MaxAsOfDateTime
            FROM {historical_table_names["object_query_count"]}
            GROUP BY ALL
        ) AS latest ON
            latest.ModelObject = query_count.ModelObject
            AND latest.ModelUuid = query_count.ModelUuid
            AND latest.MaxAsOfDateTime = query_count.AsOfDateTime
        LEFT JOIN {historical_table_names["object_mapping"]} AS mapping ON
            mapping.RunUuid = query_count.RunUuid
            AND mapping.ModelUuid = query_count.ModelUuid
            AND mapping.ObjectType = query_count.ObjectType
            AND mapping.ModelObject = query_count.ModelObject
        WHERE query_count.ObjectType = 'REPORT MEASURE'
    ),

    -- Get the latest model columns
    latest_model_columns AS (
        SELECT
            mapping.TableName,
            mapping.ObjectName,
            mapping.ObjectType,
            '' AS Expression,
            model_column.Description,
            CAST(ModifiedTime AS DATE) AS ModifiedDate,
            CASE 
                WHEN MAX(model_column.AsOfDate) OVER() = model_column.AsOfDate THEN 'False'
                ELSE 'True'
            END AS DeletedFromModelFlag,
            model_column.RunUuid,
            model_column.ModelUuid,
            model_column.AsOfDate,
            model_column.AsOfDateTime
        FROM {historical_table_names["model_columns"]} AS model_column
        JOIN (
            SELECT
                ModelUuid,
                TableName,
                ColumnName,
                MAX(AsOfDateTime) AS MaxAsOfDateTime
            FROM {historical_table_names["model_columns"]}
            GROUP BY ALL
        ) AS latest ON
            latest.ModelUuid = model_column.ModelUuid
            AND latest.TableName = model_column.TableName
            AND latest.ColumnName = model_column.ColumnName
            AND latest.MaxAsOfDateTime = model_column.AsOfDateTime
        LEFT JOIN {historical_table_names["object_mapping"]} AS mapping ON
            mapping.RunUuid = model_column.RunUuid
            AND mapping.ModelUuid = model_column.ModelUuid
            AND mapping.ObjectType = 'COLUMN'
            AND mapping.TableName = model_column.TableName
            AND mapping.ObjectName = model_column.ColumnName
    ),

    -- Get the latest model measures
    latest_model_measures AS (
        SELECT
            mapping.TableName,
            mapping.ObjectName,
            mapping.ObjectType,
            model_measure.MeasureExpression AS Expression,
            model_measure.MeasureDescription AS Description,
            NULL AS ModifiedDate,
            CASE 
                WHEN MAX(model_measure.AsOfDate) OVER() = model_measure.AsOfDate THEN 'False'
                ELSE 'True'
            END AS DeletedFromModelFlag,
            model_measure.RunUuid,
            model_measure.ModelUuid,
            model_measure.AsOfDate,
            model_measure.AsOfDateTime
        FROM {historical_table_names["model_measures"]} AS model_measure
        JOIN (
            SELECT
                ModelUuid,
                TableName,
                MeasureName,
                MAX(AsOfDateTime) AS MaxAsOfDateTime
            FROM {historical_table_names["model_measures"]}
            GROUP BY ALL
        ) AS latest ON
            latest.ModelUuid = model_measure.ModelUuid
            AND latest.TableName = model_measure.TableName
            AND latest.MeasureName = model_measure.MeasureName
            AND latest.MaxAsOfDateTime = model_measure.AsOfDateTime
        LEFT JOIN {historical_table_names["object_mapping"]} AS mapping ON
            mapping.ModelUuid = model_measure.ModelUuid
            AND mapping.RunUuid = model_measure.RunUuid
            AND mapping.ObjectType = 'MEASURE'
            AND mapping.TableName = model_measure.TableName
            AND mapping.ObjectName = model_measure.MeasureName
    ),

    -- Get unused columns
    latest_unused_columns AS (
        SELECT
            unused_column.TableName,
            IFNULL(model_column_with_mapping.ObjectName, unused_column.SourceColumnName) AS ObjectName,
            'COLUMN' AS ObjectType,
            '' AS Expression,
            model_column_with_mapping.Description,
            model_column_with_mapping.ModifiedDate,
            'True' AS DeletedFromModelFlag,
            unused_column.RunUuid,
            unused_column.ModelUuid,
            unused_column.AsOfDate,
            unused_column.AsOfDateTime
        FROM {historical_table_names["unused_columns"]} AS unused_column
        LEFT JOIN (
            SELECT
                ModelUuid,
                SourceTableName,
                SourceColumnName,
                MAX(AsOfDateTime) AS MaxAsOfDateTime
            FROM {historical_table_names["unused_columns"]}
            GROUP BY ALL
        ) AS latest ON
            latest.ModelUuid = unused_column.ModelUuid
            AND latest.SourceTableName = unused_column.SourceTableName
            AND latest.SourceColumnName = unused_column.SourceColumnName
            AND latest.MaxAsOfDateTime = unused_column.AsOfDateTime
        LEFT JOIN (
            SELECT
                latest_model_columns.ModelUuid,
                latest_model_columns.TableName,
                latest_model_columns.ObjectName,
                latest_model_columns.Description,
                latest_model_columns.ModifiedDate,
                source_mapping.SourceTableName,
                source_mapping.SourceColumnName,
                source_mapping.RunUuid
            FROM latest_model_columns
            LEFT JOIN {historical_table_names["source_mapping"]} AS source_mapping ON
                source_mapping.ModelUuid = latest_model_columns.ModelUuid
                AND source_mapping.RunUuid = latest_model_columns.RunUuid
                AND source_mapping.TableName = latest_model_columns.TableName
                AND source_mapping.ColumnName = latest_model_columns.ObjectName
        ) AS model_column_with_mapping ON
            model_column_with_mapping.ModelUuid = unused_column.ModelUuid
            AND model_column_with_mapping.RunUuid = unused_column.RunUuid
            AND model_column_with_mapping.SourceTableName = unused_column.SourceTableName
            AND model_column_with_mapping.SourceColumnName = unused_column.SourceColumnName
    ),

    -- Union all objects
    union_all_objects AS (
        SELECT * FROM latest_report_measures
        UNION
        SELECT * FROM latest_model_columns
        UNION
        SELECT * FROM latest_model_measures
        UNION
        SELECT * FROM latest_unused_columns
    ),

    -- Enrich and keep the latest records
    keep_latest_record_and_enrich AS (
        SELECT
            union_all_objects.TableName,
            union_all_objects.ObjectName,
            union_all_objects.ObjectType,
            union_all_objects.Expression,
            union_all_objects.Description,
            union_all_objects.ModifiedDate,
            union_all_objects.DeletedFromModelFlag,
            union_all_objects.ModelUuid,
            IFNULL(source_mapping.SourceTableName, 'N/A') AS SourceTableName,
            IFNULL(source_mapping.SourceColumnName, 'N/A') AS SourceColumnName,
            CASE 
                WHEN MAX(union_all_objects.AsOfDate) OVER() = union_all_objects.AsOfDate THEN 'False'
                ELSE 'True'
            END AS DeletedFromDatastoreFlag,
            {
                generate_table_key(
                    "union_all_objects.ModelUuid",
                    "union_all_objects.TableName",
                    '''
                        CASE 
                            WHEN union_all_objects.ObjectType = 'REPORT MEASURE'
                            THEN union_all_objects.Expression
                            ELSE union_all_objects.ObjectName
                        END
                    ''',
                )
            } AS ModelObjectId
        FROM union_all_objects
        JOIN (
            SELECT
                ModelUuid,
                TableName,
                ObjectName,
                Expression,
                MAX(AsOfDateTime) AS MaxAsOfDateTime
            FROM union_all_objects
            GROUP BY ALL
        ) AS latest ON
            latest.ModelUuid = union_all_objects.ModelUuid
            AND latest.TableName = union_all_objects.TableName
            AND latest.ObjectName = union_all_objects.ObjectName
            AND latest.Expression = union_all_objects.Expression
            AND latest.MaxAsOfDateTime = union_all_objects.AsOfDateTime
        LEFT JOIN {historical_table_names["source_mapping"]} AS source_mapping ON
            source_mapping.ModelUuid = union_all_objects.ModelUuid
            AND source_mapping.RunUuid = union_all_objects.RunUuid
            AND source_mapping.TableName = union_all_objects.TableName
            AND source_mapping.ColumnName = union_all_objects.ObjectName
    )
    SELECT * FROM keep_latest_record_and_enrich
""")

query_result.write.mode("overwrite").format("delta").option(
    "overwriteSchema", "true"
).saveAsTable(star_schema_table_names["dim_model_object"])

### Create ```DIM_Model```

In [None]:
query_result = spark.sql(f"""
    SELECT
        models.source_model_workspace_name AS WorkspaceName,
        models.source_model_name AS ModelName,
        models.ModelUuid AS ModelUuid,
        models.source_datastore_name AS DatastoreName,
        models.source_datastore_uuid AS DatastoreUuid,
        models.source_datastore_workspace_name AS DatastoreWorkspaceName,
        models.source_datastore_workspace_uuid AS DatastoreWorkspaceUuid
    FROM {historical_table_names["run_history"]} AS models
    JOIN (
        SELECT
            ModelUuid,
            MAX(AsOfDateTime) AS AsOfDateTime
        FROM {historical_table_names["run_history"]}
        WHERE Status = 'completed'
        GROUP BY ModelUuid
    ) AS latest ON
        latest.ModelUuid = models.ModelUuid
        AND latest.AsOfDateTime = models.AsOfDateTime
""")

query_result.write.mode("overwrite").format("delta").option(
    "overwriteSchema", "true"
).saveAsTable(star_schema_table_names["dim_model"])

### Create ```DIM_Report```

In [None]:
query_result = spark.sql(f"""
    WITH src_reports AS (
        SELECT
            sr.ReportId,
            sr.ReportName,
            sr.WebUrl,
            sr.WorkspaceId AS WorkspaceUuid,
            sr.WorkspaceName,
            sr.AsOfDateTime
        FROM {historical_table_names["source_reports"]} AS sr
        JOIN (
            SELECT
                ReportId,
                MAX(AsOfDateTime) AS MaxAsOfDateTime
            FROM {historical_table_names["source_reports"]}
            GROUP BY ReportId
        ) AS latest_reports
            ON sr.ReportId = latest_reports.ReportId
            AND sr.AsOfDateTime = latest_reports.MaxAsOfDateTime
    ),
    final_reports AS (
        SELECT DISTINCT
            src.ReportId AS ReportUuid,
            COALESCE(mapping_data.MappedReportName, src.ReportName) AS ReportName,
            COALESCE(mapping_data.MappedWebUrl, src.WebUrl) AS WebUrl,
            COALESCE(mapping_data.MappedWorkspaceUuid, src.WorkspaceUuid) AS WorkspaceUuid,
            COALESCE(mapping_data.MappedWorkspaceName, src.WorkspaceName) AS WorkspaceName,
            CASE WHEN mapping_data.AppId IS NOT NULL THEN TRUE ELSE FALSE END AS IsAppReport,
            COALESCE(mapping_data.OriginalReportObjectId, src.ReportId) AS OriginalReportUuid
        FROM src_reports AS src
        LEFT JOIN (
            SELECT DISTINCT
                app.id AS AppId,
                app.originalReportObjectId AS OriginalReportObjectId,
                mapped.ReportName AS MappedReportName,
                mapped.WebUrl AS MappedWebUrl,
                mapped.WorkspaceUuid AS MappedWorkspaceUuid,
                mapped.WorkspaceName AS MappedWorkspaceName
            FROM {historical_table_names["source_app_reports"]} AS app
            LEFT JOIN src_reports AS mapped
                ON app.originalReportObjectId = mapped.ReportId
        ) AS mapping_data
            ON src.ReportId = mapping_data.AppId
    )
    SELECT
        ReportUuid,
        CASE WHEN ReportUuid = '' THEN 'Non-Report' ELSE ReportName END AS ReportName,
        WebUrl,
        WorkspaceUuid,
        WorkspaceName,
        IsAppReport,
        OriginalReportUuid
    FROM final_reports
""")

query_result.write.mode("overwrite").format("delta").option("overwriteSchema", "true").saveAsTable(star_schema_table_names["dim_report"])

### Create ```DIM_User```

In [None]:
query_result = spark.sql(f"""
    WITH union_data AS (
        SELECT
            'Masked' AS ExecutingUser,
            ExecutingUserGroup
        FROM {historical_table_names["object_query_count"]}
        UNION ALL
        SELECT
            ExecutingUser,
            ExecutingUserGroup
        FROM {historical_table_names["detailed_logs"]}
    ),
    remove_duplicates_and_add_key AS (
        SELECT DISTINCT
            ExecutingUser,
            ExecutingUserGroup,
            {generate_table_key("ExecutingUser", "ExecutingUserGroup")} AS UserId
        FROM union_data
    )
    SELECT * FROM remove_duplicates_and_add_key
""")

query_result.write.mode("overwrite").format("delta").option(
    "overwriteSchema", "true"
).saveAsTable(star_schema_table_names["dim_user"])

### Create ```FACT_ModelObjectQueryCount```
Maps queries back to model objects, including dependencies.

In [None]:
query_result = spark.sql(f"""
    -- Map query counts with model objects
    WITH query_counts_with_mapping AS (
        SELECT
            query_count.AsOfDate,
            query_count.AsOfHour,
            query_count.ExecutingUserGroup,
            IFNULL(query_count.ReportId, 'N/A') AS ReportUuid,
            query_count.QueryCount,
            query_count.RunUuid,
            query_count.ObjectType,
            query_count.ModelUuid,
            query_count.ModelObject,
            mapping.TableName,
            mapping.ObjectName,
            {
                generate_table_key(
                    "query_count.ModelUuid",
                    "mapping.TableName",
                    '''
                        CASE 
                            WHEN query_count.ObjectType = 'REPORT MEASURE'
                            THEN query_count.ModelObject
                            ELSE mapping.ObjectName
                        END
                    ''',
                )
            } AS ModelObjectId
        FROM
            {historical_table_names["object_query_count"]} AS query_count
        LEFT JOIN
            {historical_table_names["object_mapping"]} AS mapping ON
                mapping.ModelUuid = query_count.ModelUuid
                AND mapping.RunUuid = query_count.RunUuid
                AND mapping.ModelObject = query_count.ModelObject
    ),

    -- Identify dependencies
    dependencies AS (
        SELECT DISTINCT
            ModelUuid,
            TableName,
            ObjectName,
            ReferencedTableName,
            ReferencedObjectName,
            RunUuid,
            ObjectType
        FROM {historical_table_names["dependencies"]}
    ),

    -- Join dependencies with query counts
    dependencies_join_query_count AS (
        SELECT
            dependencies.ModelUuid,
            query_count.AsOfDate,
            query_count.AsOfHour,
            {
                generate_table_key(
                    "dependencies.ModelUuid",
                    "dependencies.ReferencedTableName",
                    "dependencies.ReferencedObjectName",
                )
            } AS ModelObjectId,
            IFNULL(query_count.ReportUuid, 'N/A') AS ReportUuid,
            query_count.ExecutingUserGroup,
            query_count.QueryCount
        FROM 
            dependencies
        LEFT JOIN
            query_counts_with_mapping AS query_count ON
                query_count.ModelUuid = dependencies.ModelUuid
                AND query_count.RunUuid = dependencies.RunUuid
                AND query_count.TableName = dependencies.TableName
                AND query_count.ObjectName = dependencies.ObjectName
        WHERE
            query_count.ObjectName IS NOT NULL
    ),

    -- Union data to create the final fact table
    union_model_and_report_objects AS (
        SELECT
            ModelUuid,
            AsOfDate,
            AsOfHour,
            ModelObjectId,
            ReportUuid,
            {generate_table_key("'Masked'", "ExecutingUserGroup")} AS UserId,
            'True' AS DirectReferenceFlag,
            QueryCount
        FROM
            query_counts_with_mapping
        UNION ALL
        SELECT
            ModelUuid,
            AsOfDate,
            AsOfHour,
            ModelObjectId,
            ReportUuid,
            {generate_table_key("'Masked'", "ExecutingUserGroup")} AS UserId,
            'False' AS DirectReferenceFlag,
            QueryCount
        FROM
            dependencies_join_query_count
    )

    -- Select all records from the final union
    SELECT * FROM union_model_and_report_objects
""")

query_result.write.mode("overwrite").format("delta").option(
    "overwriteSchema", "true"
).saveAsTable(star_schema_table_names["fact_model_object_query_count"])

### Create ```FACT_ModelLogs```
Stores detailed DAX query logs for performance analysis.

In [None]:
query_result = spark.sql(f"""
    SELECT
        AsOfDate,
        OperationName,
        OperationDetailName,
        ReportId AS ReportUuid,
        ModelUuid,
        Timestamp AS DateTime,
        {generate_table_key("ExecutingUser", "ExecutingUserGroup")} AS UserId,
        DurationMs,
        CpuTimeMs,
        EventText,
        OperationId,
        XmlaSessionId,
        ActivityId,
        RequestId,
        CurrentActivityId,
        StatusCode,
        Status
    FROM
        {historical_table_names["detailed_logs"]} AS query_count
""")

query_result.write.mode("overwrite").format("delta").option(
    "overwriteSchema", "true"
).saveAsTable(star_schema_table_names["fact_detailed_logs"])

### Create ```FACT_ModelObjectStatistics```
Blends cold-cache data, table residency, and table sizes for columns.

In [None]:
query_result = spark.sql(f"""
    -- Create distinct combinations of object and date
    WITH distinct_object_date_combo AS (
        SELECT DISTINCT
            AsOfDate,
            ModelUuid,
            TableName,
            ColumnName AS ObjectName,
            {generate_table_key("ModelUuid", "TableName", "ColumnName")} AS ModelObjectId
        FROM
            {historical_table_names["model_columns"]}
    ),

    -- Map cold cache with object mapping
    cold_cache_with_mapping AS (
        SELECT
            mapping.TableName,
            mapping.ObjectName,
            cold_cache.ModelUuid,
            cold_cache.AsOfDate,
            cold_cache.Duration,
            cold_cache.CpuTime
        FROM
            {historical_table_names["cold_cache_measurements"]} AS cold_cache
        LEFT JOIN
            {historical_table_names["object_mapping"]} AS mapping ON
                mapping.RunUuid = cold_cache.RunUuid
                AND mapping.ModelObject = cold_cache.ColumnName
        WHERE 
            cold_cache.Success = 'Success'
            AND cold_cache.ColumnName IS NOT NULL
    ),

    -- Join facts and aggregate metrics
    join_facts AS (
        SELECT
            combos.ModelUuid,
            combos.AsOfDate,
            combos.ModelObjectId,
            COUNT(residency.IsResident) AS ColumnResidencyMeasuredCount,
            SUM(CASE WHEN residency.IsResident = True THEN 1 ELSE 0 END) AS ColumnResidencyTrueCount,
            AVG(data_size.TotalSize) AS TotalSize,
            AVG(data_size.DataSize) AS DataSize,
            AVG(data_size.DictionarySize) AS DictionarySize,
            AVG(data_size.HierarchySize) AS HierarchySize,
            AVG(cold_cache.Duration) AS DurationTime,
            AVG(cold_cache.CpuTime) AS CpuTime
        FROM
            distinct_object_date_combo AS combos
        LEFT JOIN
            {historical_table_names["model_columns"]} AS residency ON
                residency.ModelUuid = combos.ModelUuid
                AND residency.TableName = combos.TableName
                AND residency.ColumnName = combos.ObjectName
                AND residency.AsOfDate = combos.AsOfDate
        LEFT JOIN
            {historical_table_names["resident_statistics"]} AS data_size ON
                data_size.ModelUuid = combos.ModelUuid
                AND data_size.TableName = combos.TableName
                AND data_size.ColumnName = combos.ObjectName
                AND data_size.AsOfDate = combos.AsOfDate
        LEFT JOIN
            cold_cache_with_mapping AS cold_cache ON
                cold_cache.ModelUuid = combos.ModelUuid
                AND cold_cache.TableName = combos.TableName
                AND cold_cache.ObjectName = combos.ObjectName
                AND cold_cache.AsOfDate = combos.AsOfDate
        GROUP BY ALL
    )

    -- Select all results from the final join
    SELECT * FROM join_facts
""")

query_result.write.mode("overwrite").format("delta").option(
    "overwriteSchema", "true"
).saveAsTable(star_schema_table_names["fact_model_statistics"])

In [None]:
mssparkutils.session.stop()