# Data Transformation Notebook

This notebook demonstrates the process of transforming data from bronze to silver and then from silver to gold. Each transformation step is followed by a testing step to ensure data integrity and quality.

## 1. Bronze to Silver Transformation

In this step, we will transform the raw data (bronze) into a more refined format (silver).

In [None]:
# Load the raw data (bronze)
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Bronze to Silver Transformation") \
    .getOrCreate()

# Load the dataset (Bronze Layer)
df_bronze = spark.read.csv("datasets/2021.csv", header=False)
df_bronze.show(5)

In [None]:
# Perform the transformation to silver
def bronze_to_silver(df):
    # Define schema
    schema = ["OrderID", "OrderLine", "OrderDate", "CustomerName", "Email", "Product", "Quantity", "Price", "Tax"]
    # Apply schema to DataFrame
    df = df.toDF(*schema)
    return df

df_silver = bronze_to_silver(df_bronze)
df_silver.show(5)

## 2. Testing the Bronze to Silver Transformation

In this step, we will validate the transformation to ensure data integrity.

In [None]:
# Validate the transformation
expected_schema = ["OrderID", "OrderLine", "OrderDate", "CustomerName", "Email", "Product", "Quantity", "Price", "Tax"]
assert df_silver.columns == expected_schema, 'Schema does not match!'
print('Bronze to Silver transformation is valid.')

## 3. Silver to Gold Transformation

In this step, we will further refine the silver data into a highly refined format (gold).

In [None]:
# Perform the transformation to gold
from pyspark.sql.functions import col, split

def silver_to_gold(df):
    # Split Product column into ProductName and ProductDetails
    df = df.withColumn("ProductName", split(col("Product"), ",")[0]) \
           .withColumn("ProductDetails", split(col("Product"), ",")[1])
    # Drop the original Product column
    df = df.drop("Product")
    return df

df_gold = silver_to_gold(df_silver)
df_gold.show(5)

## 4. Testing the Silver to Gold Transformation

In this step, we will validate the final transformation to ensure data quality.

In [None]:
# Validate the transformation
expected_schema = ["OrderID", "OrderLine", "OrderDate", "CustomerName", "Email", "Quantity", "Price", "Tax", "ProductName", "ProductDetails"]
assert df_gold.columns == expected_schema, 'Schema does not match!'
print('Silver to Gold transformation is valid.')