
### **HIGH-LEVEL PURPOSE**

This script bulk-loads CSV files from a "Raw" landing zone into MANAGED Delta tables
in the current Lakehouse database. It:
1) Reads CSVs for two groups of datasets: reference and operations.
2) Uses wildcard (glob) paths to sweep all countries/folders at once.
3) Derives a 'Source' column (country) from each file's path using regex on input_file_name().
4) Writes the results directly as MANAGED Delta tables via saveAsTable(), partitioned by 'Source'.
5) Overwrites tables each run (idempotent full refresh).

In [None]:
from pyspark.sql import functions as F
import re

# Root folder where raw CSVs land, organized by country folders (e.g., Files/Raw/USA/...).
raw_root = "Files/Raw"

# Current database (Lakehouse DB) where MANAGED Delta tables will be created/overwritten.
db = spark.catalog.currentDatabase()

# Lists of expected CSV filenames. The script will build one table per filename (minus ".csv").
reference_files = [
    "Customers.csv", "Employees.csv", "Machines.csv",
    "Mills.csv", "Products.csv", "Warehouses.csv"
]

operations_files = [
    "InventorySnapshots.csv", "MachineDowntime.csv", "MachineSensors.csv",
    "Orders.csv", "PlannedProductions.csv", "Shipments.csv", "StockMovements.csv"
]

StatementMeta(, a2845f01-1246-40d2-beef-03623fd98642, 7, Finished, Available, Finished)


#### **HOW 'Source' (country) IS EXTRACTED**

Example path: Files/Raw/Brazil/reference/Customers.csv
We want to capture "Brazil" regardless of which subfolder comes next.
   - re.escape(raw_root) safely treats any special chars in raw_root as literals.
   - r"/([^/]+)/" captures the segment immediately after raw_root (the country).

Concretely, for:
```
   input_file_name = ".../Files/Raw/Brazil/reference/Customers.csv"
  regex group(1)   = "Brazil"
```

In [None]:
from pyspark.sql import functions as F
import re


country_regex = re.escape(raw_root) + r"/([^/]+)/"

def load_to_managed(files, folder_structure):
    """
    For each CSV filename in `files`:
      - Build a glob path using `folder_structure` (e.g., Files/Raw/*/reference/{file})
      - Read all matching CSVs across countries into a single DataFrame.
      - Add 'Source' (country) by regex against input_file_name().
      - Save as a MANAGED Delta table in `db`, partitioned by 'Source', overwriting any prior table.
    """
    for fn in files:
        # Derive a table name by dropping ".csv".
        table = fn[:-4]
        table_qual = f"{db}.{table}"

        # Turn a pattern like "Files/Raw/*/reference/{file}" into a concrete glob.
        #   e.g., "Files/Raw/*/reference/Customers.csv"
        src_glob = folder_structure.format(file=fn)

        try:
            # ---------------------------
            # LOAD ALL MATCHING FILES
            # ---------------------------
            # - .option("header","true"): Use the first row as column names.
            # - .option("inferSchema","true"): Let Spark guess column types from the data (fast setup, but
            #   consider specifying an explicit schema for stability/production).
            # - .load(src_glob): Reads every file that matches the wildcard pattern.
            # - .withColumn("Source", ...): Pull the country from the file path.
            df = (
                spark.read.format("csv")
                    .option("header", "true")
                    .option("inferSchema", "true")
                    .load(src_glob)
                    .withColumn("Source", F.regexp_extract(F.input_file_name(), country_regex, 1))
            )

            # ---------------------------
            # WRITE AS MANAGED DELTA TABLE
            # ---------------------------
            # saveAsTable(table_qual):
            #   - Creates (or replaces) a MANAGED table stored in the Lakehouse's managed location.
            #   - We use mode("overwrite") + option("overwriteSchema","true") for idempotent, full-refresh loads
            #     that also adapt to schema changes.
            # partitionBy("Source"):
            #   - Physically partitions table data by the country. This speeds up queries that filter on Source,
            #     and makes it easy to manage per-country data.
            (df.write.format("delta")
               .mode("overwrite")
               .option("overwriteSchema", "true")
               .partitionBy("Source")
               .saveAsTable(table_qual))

            print(f"✅ {table_qual}: loaded as managed Delta table")
        except Exception as e:
            # Any read/write issue (e.g., missing files, malformed CSV) is reported,
            # and the loop continues to the next table.
            print(f"⚠️ {table}: skipped — {e}")


StatementMeta(, a2845f01-1246-40d2-beef-03623fd98642, 8, Finished, Available, Finished)

#### **INVOCATION / FOLDER PATTERNS**

In [None]:

# Reference data lives under: <country>/reference/<file>.csv
#   e.g., Files/Raw/USA/reference/Customers.csv
# The glob "*": sweep all countries without enumerating them.
load_to_managed(reference_files,  f"{raw_root}/*/reference/{{file}}")

# Operational data lives under: <country>/<some_subfolder>/<file>.csv,
# often date-stamped or otherwise nested.
#   e.g., Files/Raw/USA/2025-08-01/Orders.csv
# The glob "*/*": any one-level subfolder beneath country.
load_to_managed(operations_files, f"{raw_root}/*/*/{{file}}")

StatementMeta(, a2845f01-1246-40d2-beef-03623fd98642, 9, Finished, Available, Finished)

✅ lakehouse_bz.Customers: loaded as managed Delta table
✅ lakehouse_bz.Employees: loaded as managed Delta table
✅ lakehouse_bz.Machines: loaded as managed Delta table
✅ lakehouse_bz.Mills: loaded as managed Delta table
✅ lakehouse_bz.Products: loaded as managed Delta table
✅ lakehouse_bz.Warehouses: loaded as managed Delta table
✅ lakehouse_bz.InventorySnapshots: loaded as managed Delta table
✅ lakehouse_bz.MachineDowntime: loaded as managed Delta table
✅ lakehouse_bz.MachineSensors: loaded as managed Delta table
✅ lakehouse_bz.Orders: loaded as managed Delta table
✅ lakehouse_bz.PlannedProductions: loaded as managed Delta table
✅ lakehouse_bz.Shipments: loaded as managed Delta table
✅ lakehouse_bz.StockMovements: loaded as managed Delta table


In [None]:
from notebookutils import mssparkutils

delete_tables = False  # <-- set True to delete the folders too

if delete_tables:
    # same list you used when creating
    tables = [
        "Customers","Employees","Machines","Mills","Products","Warehouses",
        "InventorySnapshots","MachineDowntime","MachineSensors",
        "Orders","PlannedProductions","Shipments","StockMovements"
    ]

    # Drop from the current Lakehouse database (what Fabric shows as the schema)
    db = spark.catalog.currentDatabase()
    print(f"Dropping from database: {db}")

    for t in tables:
        spark.sql(f"DROP TABLE IF EXISTS {db}.{t}")
        print(f"🗑️  Dropped: {db}.{t}")

StatementMeta(, a2845f01-1246-40d2-beef-03623fd98642, 10, Finished, Available, Finished)