# Intro

This notebook dynamically retrieves all hubspot tables sent to bronze from fivetran, cleans the tables and writes them to silver.


## Change History

<style>
  table {margin-left: 0 !important;}
</style>

| Date    | Author | Description |
| :-------- | :------- | :------- | 
|2024-10-14 | Mclain R |  Created Date|

# Code

## Imports

In [None]:
from notebookutils import mssparkutils
from pyspark.sql.functions import col
import re
import pyspark.sql.functions as F
import sempy
import sempy.fabric as fabric

## Define Parameters
- none

Note: the following is a parameter cell and will be interpreted by Pipelines as such.

## Reused Functions
- none

## Define Fields

- **workspace_name**: name of workspace

In [None]:
# Get the current workspace ID
workspace_id = fabric.get_workspace_id()
print(f"Workspace ID: {workspace_id}")

# Get the workspace name from the workspace ID
workspace_name = fabric.resolve_workspace_name(workspace_id)
print(f"Workspace Name: {workspace_name}")

## Process Data

##### table_paths defines 'sandbox not in name' so that it is production data only. Change that line if sandbox data is needed

In [None]:
# Base path containing the folders
base_path = f"abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/bronze_lakehouse.Lakehouse/Tables"

# List all items in the base path
items = mssparkutils.fs.ls(base_path)

# Get Table paths that start with 'hubspot' and do not contain the word 'sandbox'
table_paths = [item.path for item in items if item.isDir and item.name.startswith('hubspot') and 'sandbox' not in item.name and 'explore' not in item.name]

# Process each table
for table_path in table_paths:
    # Read Delta Tables from the folder
    df = spark.read.format("delta").load(table_path)

    # Check if '_fivetran_deleted' column exists
    if "_fivetran_deleted" in df.columns:
        # Filter rows where _fivetran_deleted is false (0)
        df = df.filter(col("_fivetran_deleted") == 0)

    display(table_path.split('/')[-1])

    # Rename columns
    rename_expr = [col(column).alias(column.replace("fivetran", "interloop")) if "fivetran" in column else col(column) for column in df.columns]
    df = df.select(*rename_expr)

    #display(df.head(10))

    # Write to Delta table

    delta_table_path = f"abfss://{workspace_name}@onelake.dfs.fabric.microsoft.com/silver_lakehouse.Lakehouse/Tables/" + table_path.split('/')[-1]
    df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(delta_table_path)