## Bronze Ingestion

### Spark Session Initialization

This section initializes a Spark session configured with Delta Lake support.

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from delta import configure_spark_with_delta_pip

builder = SparkSession.builder.appName("Bronze Ingestion") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("WARN")

### Raw Data Configuration

Define the base path for the raw source files.
This layer represents the **Bronze layer**, where data is ingested in its original format with minimal transformation.

In [3]:
raw_data = '../data/raw_csvs'

### Source Data Ingestion

Load source datasets from CSV files into Spark DataFrames.
Schemas are inferred automatically to simplify ingestion at the Bronze layer.

In [4]:
orders_df = (
    spark.read
         .csv(f"{raw_data}/olist_orders_dataset.csv",
            header=True,
            inferSchema=True)
)

customers_df = (
    spark.read
        .csv(f"{raw_data}/olist_customers_dataset.csv",
            header=True,
            inferSchema=True)
)

order_items_df = (
    spark.read
        .csv(f"{raw_data}/olist_order_items_dataset.csv",
            header=True,
            inferSchema=True)
)

order_payments_df = (
    spark.read
    .csv(f"{raw_data}/olist_order_payments_dataset.csv",
        header=True,
        inferSchema=True)
)

order_reviews_df = (
    spark.read
    .csv(f"{raw_data}/olist_order_reviews_dataset.csv",
         header=True,
         inferSchema=True)
)

products_df = (
    spark.read
    .csv(f"{raw_data}/olist_products_dataset.csv",
         header=True,
         inferSchema=True)
)

sellers_df = (
    spark.read
    .csv(f"{raw_data}/olist_sellers_dataset.csv",
         header=True,
         inferSchema=True)
)

### Data Validation

Perform an initial inspection of the ingested datasets to validate structure and data types.

In [11]:
print('Orders DF:')
orders_df.printSchema()
print('Customers DF:')
customers_df.printSchema()

Orders DF:
root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_purchase_timestamp: timestamp (nullable = true)
 |-- order_approved_at: timestamp (nullable = true)
 |-- order_delivered_carrier_date: timestamp (nullable = true)
 |-- order_delivered_customer_date: timestamp (nullable = true)
 |-- order_estimated_delivery_date: timestamp (nullable = true)

Customers DF:
root
 |-- customer_id: string (nullable = true)
 |-- customer_unique_id: string (nullable = true)
 |-- customer_zip_code_prefix: integer (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)



In [12]:
print('Order items DF:')
order_items_df.printSchema()
print('Order payments DF:')
order_payments_df.printSchema()

Order items DF:
root
 |-- order_id: string (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- product_id: string (nullable = true)
 |-- seller_id: string (nullable = true)
 |-- shipping_limit_date: timestamp (nullable = true)
 |-- price: double (nullable = true)
 |-- freight_value: double (nullable = true)

Order payments DF:
root
 |-- order_id: string (nullable = true)
 |-- payment_sequential: integer (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- payment_installments: integer (nullable = true)
 |-- payment_value: double (nullable = true)



In [13]:
print('Order reviews DF:')
order_reviews_df.printSchema()
print('Products DF:')
products_df.printSchema()
print('Sellers DF:')
sellers_df.printSchema()

Order reviews DF:
root
 |-- review_id: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- review_score: string (nullable = true)
 |-- review_comment_title: string (nullable = true)
 |-- review_comment_message: string (nullable = true)
 |-- review_creation_date: string (nullable = true)
 |-- review_answer_timestamp: string (nullable = true)

Products DF:
root
 |-- product_id: string (nullable = true)
 |-- product_category_name: string (nullable = true)
 |-- product_name_lenght: integer (nullable = true)
 |-- product_description_lenght: integer (nullable = true)
 |-- product_photos_qty: integer (nullable = true)
 |-- product_weight_g: integer (nullable = true)
 |-- product_length_cm: integer (nullable = true)
 |-- product_height_cm: integer (nullable = true)
 |-- product_width_cm: integer (nullable = true)

Sellers DF:
root
 |-- seller_id: string (nullable = true)
 |-- seller_zip_code_prefix: integer (nullable = true)
 |-- seller_city: string (nullable = true)
 |-- sel

### Partitioning Strategy

Partitioning is applied to optimize storage layout and query performance.

In [14]:
orders_df = (
    orders_df
        .withColumn("year", F.year("order_purchase_timestamp"))
        .withColumn("month", F.month("order_purchase_timestamp"))
        .withColumn("day", F.dayofmonth("order_purchase_timestamp"))
)
order_items_df = (
    order_items_df
    .withColumn("year", F.year("shipping_limit_date"))
    .withColumn("month", F.month("shipping_limit_date"))
    .withColumn("day", F.day("shipping_limit_date"))
)

### Init the variale for the path to the bronze layer.

In [15]:
bronze = '../delta/01_bronze'

### Bronze Layer Completion

The two tables with partition are done separately from the others, because they are more likely to produce errors. This way we can debug easier.

In [16]:
(
    orders_df.write
    .format("delta")
    .mode("overwrite")
    .option("inferSchema", "true")
    .partitionBy("year", "month", "day")
    .save(f"{bronze}/orders")
)

                                                                                

In [17]:
(
    order_items_df.write
            .format("delta")
            .mode("overwrite")
            .option("inferSchema", "true")
            .partitionBy("year", "month", "day")
            .save(f"{bronze}/order_items")
)

                                                                                

In [18]:
(
    customers_df.write
            .format("delta")
            .mode("overwrite")
            .option("inferSchema", "true")
            .save(f"{bronze}/customers")
)

(
    order_payments_df.write
            .format("delta")
            .mode("overwrite")
            .option("inferSchema", "true")
            .save(f"{bronze}/order_payments")
)

(
    order_reviews_df.write
            .format("delta")
            .mode("overwrite")
            .option("inferSchema", "true")
            .save(f"{bronze}/order_reviews")
)

(
    products_df.write
            .format("delta")
            .mode("overwrite")
            .option("inferSchema", "true")
            .save(f"{bronze}/products")
)

(
    sellers_df.write
            .format("delta")
            .mode("overwrite")
            .option("inferSchema", "true")
            .save(f"{bronze}/sellers")
)

26/01/11 22:10:56 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


### The Bronze ingestion process is complete.
All raw datasets are now stored as Delta tables and are ready for cleansing and transformation in the Silver layer.