# Intro
This notebook takes uploaded files from Mythic, processes them and writes them to bronze as delta tables. It then moves the files to a processed folder to keep the uploaded folder open for the next batch of files.  



## Change History

<style>
  table {margin-left: 0 !important;}
</style>

| Date    | Author | Description |
| :-------- | :------- | :------- | 
|2025-02-12 | Mclain R |  Create Date|

# Code

## Imports

###### notebookutils
- **mssparkutils**: A utility module in Microsoft Fabric that provides functions for handling file operations, secrets, and other notebook-related tasks within the Spark environment.

###### python
- **re**: The regular expressions module used for pattern matching, text parsing, and string manipulation in Python.

###### sempy
- **fabric**: Part of the Semantic Link (SEMpy) library, sempy.fabric provides tools for interacting with Microsoft Fabric, enabling operations such as querying semantic models, working with datasets, and integrating with Fabric's data services. It's commonly used for extracting insights from Power BI datasets and semantic models within Fabric.

In [None]:
from notebookutils import mssparkutils
import re
import sempy
import sempy.fabric as fabric

## Define Parameters
- none

Note: the following is a parameter cell and will be interpreted by Pipelines as such.

## Reused Functions
- none

## Define Fields

- **workspace_name**: name of workspace
- **client_name**: name of client. Used to name adls shortcut
- **adls_folder**: name of adls shortcutted folder
- **destination_folder**: name of folder to write to
- **directory_path**: path of adls storage


In [None]:
# Get the current workspace ID
workspace_id = fabric.get_workspace_id()
print(f"Workspace ID: {workspace_id}")

# Get the workspace name from the workspace ID
workspace_name = fabric.resolve_workspace_name(workspace_id)
print(f"Workspace Name: {workspace_name}")

In [None]:
client_name = "mythic"
adls_folder = "uploader"
destination_folder = "processed"

In [None]:
# Define the directory path in your ADLS Gen2 storage
directory_path = f"abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/bronze_lakehouse.Lakehouse/Files/adls_{client_name}/{adls_folder}"

## Process Data

3/26/24 MR added Check headers for correct schema

##### dynamically get all csv files in the uploader folder and write them to bronze delta tables. Then move files from uploader folder to processed folder.

In [None]:
# Get list of files in the directory
file_list = mssparkutils.fs.ls(directory_path)

for file in file_list:
    # Check if it's a CSV file and starts with the desired prefix
    if file.name.endswith(".csv") and (file.name.startswith("ICRM Additional Account Fields_") or file.name.startswith("DCRM Additional Account Fields_")):
        # Extract table name from file name and convert to lowercase
        table_name = file.name.split("_")[0].lower()

        # Replace spaces with underscores in the table name
        table_name = table_name.replace(" ", "_")
        
        # Read CSV file with options to handle special characters and double quotes correctly
        df = spark.read.csv(
            file.path, 
            header=True, 
            inferSchema=True
        )
        
        #convert camelcase to snakecase and clean up headers
        df = df.toDF(*(c.lower() for c in df.columns))
        df = df.toDF(*(c.replace('&','_') for c in df.columns))
        df = df.toDF(*(c.replace(' ','_') for c in df.columns))
        df = df.toDF(*(c.replace('.','_') for c in df.columns))
        df = df.toDF(*(c.replace('-','_') for c in df.columns))
        df = df.toDF(*(c.replace('/','_') for c in df.columns))
        df = df.toDF(*(c.replace(',','_') for c in df.columns))
        df = df.toDF(*(c.replace(';','_') for c in df.columns))
        df = df.toDF(*(c.replace(')','') for c in df.columns))
        df = df.toDF(*(c.replace('(','') for c in df.columns))
        
        # Define expected cleaned schema for each file type
        expected_schemas = {
            'icrm_additional_account_fields': ['bac_code', 'parent_bac', 'parent_account', 'pdn',
            'account_dba_name', 'account_id', 'dealer_group_code', 'dealer_group_name', 'dealer_status',
            'garage_package', 'garage_current_provider', 'f_i_relationship', 'f_i_dir_of_sales',
            'region_description', 'uniqueaccountid', 'slp', 'primary_vsc_provider', 'floorplan', 'af_retail',
            'vsc_vol__3mo', 'gap_vol__3mo', 'otherfi_vol__3mo', 'product_density_3mo', '31_day_avg__vol',
            'cur__31_day_vol', '%_diff_', 'call_reports', 'most_recent_call_date', 'slc_report', 'most_recent_slc_date',
            'product_density', 'floorplan_lost', 'paid_off_cash_advance', 'incentive', 'reinsurance', 'risk_score'],
            
            'dcrm_additional_account_fields': [
                'oe_id', 'account_name', 'account_id', 'parent_account', 'group_id',
                'dealer_status', 'group_name', 'privileges', 'sub_privileges', 'uniqueaccountid'
            ]
        }

        # Get expected columns for this table
        expected_columns = expected_schemas.get(table_name)
        actual_columns = df.columns

        # Fail if actual schema doesn't match expected
        if set(expected_columns) != set(actual_columns):
            raise Exception(
                f"Schema mismatch in file {file.name}.<br><br>"
                f"Expected: {expected_columns}.<br><br>"
                f"Found: {actual_columns}."
            )

        # Define Delta table path
        delta_table_path = f"abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/bronze_lakehouse.Lakehouse/Tables/nucleus__{table_name}"

        # Write DataFrame to Delta table
        df.write.mode("overwrite") \
            .option("overwriteSchema", "true") \
            .format("delta") \
            .save(delta_table_path)

        display(df.limit(10))

        # Define path for uploaded file to move to
        destination_path = f"abfss://Mythic@onelake.dfs.fabric.microsoft.com/bronze_lakehouse.Lakehouse/Files/{destination_folder}/{file.name}"

        # Use mssparkutils to copy the file
        mssparkutils.fs.cp(file.path, destination_path, recurse=True)

        # Optionally, delete the original file after copying
        mssparkutils.fs.rm(file.path, recurse=True)
