# **Bronze Layer - Raw Data Ingestion**
This notebook ingests the raw Online Retail II dataset from a managed volume into the Databricks Lakehouse.  
It uses Pandas to read the Excel file, adds audit metadata, and saves the raw data as a Delta table (`bronze_sales`) for further processing in the Silver layer.

In [0]:
# Install openpyxl to enable reading Excel files with pandas
%pip install openpyxl
# Step 1: Importing pandas to read the Excel file
import pandas as pd

# Step 2: definition of the full path to the uploaded Excel file in your volume
file_path = "/Volumes/workspace/default/ecommerce_demo/online_retail_II.xlsx"

# Step 3: Read the Excel file using pandas
pdf = pd.read_excel(file_path)

# Step 4: Show the first 5 rows of data
pdf.head()


##  Step 2: Conversion to Spark DataFrame

The raw Excel file has been successfully loaded into a Pandas DataFrame (`pdf`).  
It is now converted into a Spark DataFrame (`df_raw`) to enable distributed data processing using Apache Spark.  
This transformation prepares the dataset for downstream ETL operations and storage in Delta Lake format as part of the Bronze layer.


In [0]:
# Convert all columns to string type to avoid mixed-type conversion issues
pdf_clean = pdf.astype(str)

# Now safely convert to Spark DataFrame
df_raw = spark.createDataFrame(pdf_clean)

# Preview the result
df_raw.display()


## Step 3: Add Ingestion Audit Column

To support data traceability and lineage tracking, an additional column named `ingested_at` is appended to the dataset.  
This column captures the exact timestamp at which the data was ingested into the Databricks environment.  
Including such metadata is a standard practice in data engineering pipelines to aid in debugging, version control, and historical tracking.


In [0]:
from pyspark.sql.functions import current_timestamp

# Add an ingestion timestamp column to the Spark DataFrame
df_bronze = df_raw.withColumn("ingested_at", current_timestamp())

# Display the result with the audit column
df_bronze.display()


## ✏️ Step 3.1: Adjust Column Names for Delta Compatibility

To comply with Delta Lake's technical constraints, column names containing spaces or special characters have been adjusted.  
Specifically, spaces were replaced with underscores (e.g., `Customer ID` → `Customer_ID`). This ensures compatibility with Delta Lake's schema enforcement rules.

It is important to note that no data values were altered during this process.  
The dataset remains an accurate representation of the raw input, preserving the integrity of the Bronze layer while enabling reliable storage in Delta format.


In [0]:
# Step: Clean column names by replacing spaces with underscores
df_bronze = df_bronze.toDF(*[col.replace(" ", "_") for col in df_bronze.columns])

# Display to confirm updated column names
df_bronze.display()

## Step 4: Save as Bronze Delta Table

The Spark DataFrame, now appended with an `ingested_at` timestamp column, is written to persistent storage as a Delta Table named `bronze_sales`.  
This marks the completion of the Bronze layer, where raw data is captured in its original form for auditability and reproducibility.  
By using Delta Lake format, the dataset becomes capable of supporting version control, schema enforcement, and time travel features.


In [0]:
# Save the final DataFrame as a Delta Table (Bronze layer)
df_bronze.write.format("delta").mode("overwrite").saveAsTable("bronze_sales")
