# Gold Layer Transformation

This notebook transforms data from the Silver layer into business-ready Gold layer datasets optimized for analytics and reporting.

In [116]:
# Set JAVA_HOME environment variable
import os
os.environ['JAVA_HOME'] = r'C:\Program Files\Java\jre1.8.0_451'

# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType, DoubleType, DateType, StringType
from pyspark.sql.window import Window
from datetime import datetime
import logging
import glob
import pandas as pd
import numpy as np

# Initialize Spark session
try:
    # Try to get an existing Spark session
    spark = SparkSession.builder \
        .appName("Gold Layer Transformation") \
        .getOrCreate()
    
    # Test if the session is active
    spark.sparkContext.getConf().getAll()
    print("Using existing Spark session")
except:
    # If there's an error, create a new session
    print("Creating new Spark session")
    spark = SparkSession.builder \
        .appName("Gold Layer Transformation") \
        .getOrCreate()

Using existing Spark session


## 2. Path Configuration and Logging Setup

Define paths for Silver and Gold layers, and configure logging.

In [117]:
# Define paths - use absolute paths to avoid confusion
current_dir = os.path.dirname(os.path.abspath("__file__"))
base_dir = os.path.dirname(current_dir)  # Go up one level to reach the root directory

# Get the latest date from silver layer
silver_base_path = os.path.join(base_dir, "output", "silverLayer")
gold_base_path = os.path.join(base_dir, "output", "goldLayer")

# Use current date for gold layer
date_str = datetime.now().strftime("%Y-%m-%d")

# Log path organized by date
log_dir = os.path.join(base_dir, "logs", "gold_layer", date_str)
log_path = os.path.join(log_dir, "gold_layer.log")

# Ensure the logs directory exists
os.makedirs(log_dir, exist_ok=True)

# Configure logging to write to the log file
logging.basicConfig(filename=log_path, level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

def log_message(message, level="info"):
    """Logs messages with the specified level."""
    if level == "info":
        logging.info(message)
    elif level == "error":
        logging.error(message)
    print(message)

# Find the latest date folder in silver layer
silver_date_folders = glob.glob(os.path.join(silver_base_path, "*"))
if not silver_date_folders:
    raise Exception(f"No date folders found in {silver_base_path}")

latest_silver_folder = max(silver_date_folders)
silver_path = latest_silver_folder
gold_path = os.path.join(gold_base_path, date_str)

# Ensure the gold layer directory exists
os.makedirs(gold_path, exist_ok=True)

print(f"Silver path: {silver_path}")
print(f"Gold path: {gold_path}")

Silver path: c:\Users\oussema\OneDrive\Bureau\FSEGN\DW-Odoo\output\silverLayer\2025-05-08
Gold path: c:\Users\oussema\OneDrive\Bureau\FSEGN\DW-Odoo\output\goldLayer\2025-05-08


## 3. Data Loading

Read data from the Silver layer and display initial statistics.

In [118]:
# Read the transformed data from silver layer
try:
    silver_df = spark.read.parquet(os.path.join(silver_path, "mrp_production.parquet"))
    log_message(f"Successfully read data from {os.path.join(silver_path, 'mrp_production.parquet')}")
    print(f"Row count: {silver_df.count()}")
    print(f"Column count: {len(silver_df.columns)}")
    
    # Print column names to verify
    print("\nColumn names in the dataset:")
    for col_name in silver_df.columns:
        print(f"- {col_name}")
    
    # Display sample data
    display(silver_df.limit(5).toPandas())
except Exception as e:
    log_message(f"Error reading silver layer data: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Successfully read data from c:\Users\oussema\OneDrive\Bureau\FSEGN\DW-Odoo\output\silverLayer\2025-05-08\mrp_production.parquet
Row count: 1734
Column count: 26

Column names in the dataset:
- Product
- Reference
- Responsible
- Start
- End
- Deadline
- State
- Quantity_Producing
- Quantity_To_Produce
- Total_Quantity
- Product_Cost
- Product_Sales_Price
- Product_Code
- Product_Name
- Production_Efficiency
- Profit_Margin_Percent
- Start_Year
- Start_Month
- Start_Day
- Start_Weekday
- Start_Quarter
- Start_Month_Name
- Production_Duration_Days
- Product_Category
- Price_Category
- Margin_Category


Unnamed: 0,Product,Reference,Responsible,Start,End,Deadline,State,Quantity_Producing,Quantity_To_Produce,Total_Quantity,...,Start_Year,Start_Month,Start_Day,Start_Weekday,Start_Quarter,Start_Month_Name,Production_Duration_Days,Product_Category,Price_Category,Margin_Category
0,[FURN_8855] Drawer,WH/MO/01679,Department 1,2024-01-03 01:00:00,2024-02-19 01:00:00,2024-01-25 01:00:00,Cancelled,12.0,17.0,17.0,...,2024,1,3,4,1,January,47,Furniture - Storage,Standard,Low Margin
1,[FURN_8522] Table Top,WH/MO/07299,Department 3,2024-01-03 01:00:00,2024-02-27 01:00:00,2024-01-22 01:00:00,Cancelled,4.0,13.0,13.0,...,2024,1,3,4,1,January,55,Furniture - Tables,Premium,High Margin
2,[FURN_9666] Table,WH/MO/02264,Department 4,2024-01-06 01:00:00,2024-02-06 01:00:00,2024-03-29 01:00:00,Unknown,19.0,19.0,19.0,...,2024,1,6,7,1,January,31,Furniture - Tables,Luxury,High Margin
3,[FURN_7023] Wood Panel,WH/MO/03734,Department 2,2024-01-06 01:00:00,2024-02-17 01:00:00,2024-03-20 01:00:00,Unknown,10.0,10.0,10.0,...,2024,1,6,7,1,January,42,Furniture - Components,Standard,Medium Margin
4,[FURN_7023] Wood Panel,WH/MO/03614,Department 2,2024-01-07 01:00:00,2024-02-03 01:00:00,2024-03-11 01:00:00,Unknown,15.0,15.0,15.0,...,2024,1,7,1,1,January,27,Furniture - Components,Standard,Medium Margin


## 4. Examine Silver Layer Data

Check the column names and data structure in the silver layer data.

In [119]:
# Examine the standardized column names from the silver layer
try:
    # Verify the column names
    print("Column names in the silver layer:")
    for col_name in silver_df.columns:
        print(f"- {col_name}")
    
    # Display sample data
    display(silver_df.select("Reference", "Total_Quantity", "Product_Cost", "Product_Sales_Price").limit(5).toPandas())
except Exception as e:
    log_message(f"Error examining silver layer data: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Column names in the silver layer:
- Product
- Reference
- Responsible
- Start
- End
- Deadline
- State
- Quantity_Producing
- Quantity_To_Produce
- Total_Quantity
- Product_Cost
- Product_Sales_Price
- Product_Code
- Product_Name
- Production_Efficiency
- Profit_Margin_Percent
- Start_Year
- Start_Month
- Start_Day
- Start_Weekday
- Start_Quarter
- Start_Month_Name
- Production_Duration_Days
- Product_Category
- Price_Category
- Margin_Category


Unnamed: 0,Reference,Total_Quantity,Product_Cost,Product_Sales_Price
0,WH/MO/01679,17.0,100.0,110.5
1,WH/MO/07299,13.0,240.0,380.0
2,WH/MO/02264,19.0,290.0,520.0
3,WH/MO/03734,10.0,80.0,100.0
4,WH/MO/03614,15.0,80.0,100.0


## 5. Gold Layer Model 1: Production Summary by Department

Create a summary of production metrics aggregated by department. This model provides insights into departmental performance, efficiency, and profitability.

In [120]:
# Create a production summary by department
try:
    dept_summary = silver_df.groupBy("Responsible") \
        .agg(
            F.count("Reference").alias("Total_Productions"),
            F.sum("Total_Quantity").alias("Total_Units"),
            F.round(F.avg("Production_Duration_Days"), 1).alias("Avg_Production_Days"),
            F.round(F.avg("Production_Efficiency"), 1).alias("Avg_Efficiency_Percent"),
            F.round(F.avg("Profit_Margin_Percent"), 1).alias("Avg_Profit_Margin_Percent")
        ) \
        .orderBy(F.desc("Total_Productions"))
    
    # Display the department summary
    display(dept_summary.toPandas())
    
    # Save to gold layer
    dept_summary.write.mode("overwrite").parquet(os.path.join(gold_path, "department_production_summary.parquet"))
    dept_summary.toPandas().to_csv(os.path.join(gold_path, "department_production_summary.csv"), index=False)
    log_message(f"Department production summary saved to gold layer")
    
except Exception as e:
    log_message(f"Error creating department summary: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Unnamed: 0,Responsible,Total_Productions,Total_Units,Avg_Production_Days,Avg_Efficiency_Percent,Avg_Profit_Margin_Percent
0,Department 5,368,3897.0,27.2,95.3,46.1
1,Department 2,360,3604.0,27.3,95.4,46.5
2,Department 1,349,3613.0,26.3,95.4,43.9
3,Department 3,335,3462.0,26.9,94.1,44.2
4,Department 4,322,3253.0,27.4,93.5,43.5


Department production summary saved to gold layer


## 6. Gold Layer Model 2: Product Category Performance

Analyze performance metrics by product category. This model helps identify which product categories are most profitable, efficient, and in demand.

In [121]:
# Create a product category performance summary
try:
    category_performance = silver_df.groupBy("Product_Category") \
        .agg(
            F.countDistinct("Product_Code").alias("Unique_Products"),
            F.count("Reference").alias("Total_Productions"),
            F.sum("Total_Quantity").alias("Total_Units"),
            F.round(F.avg("Product_Cost"), 2).alias("Avg_Cost"),
            F.round(F.avg("Product_Sales_Price"), 2).alias("Avg_Sales_Price"),
            F.round(F.avg("Profit_Margin_Percent"), 1).alias("Avg_Profit_Margin_Percent"),
            F.round(F.avg("Production_Duration_Days"), 1).alias("Avg_Production_Days")
        ) \
        .orderBy(F.desc("Total_Units"))
    
    # Display the category performance
    display(category_performance.toPandas())
    
    # Save to gold layer
    category_performance.write.mode("overwrite").parquet(os.path.join(gold_path, "product_category_performance.parquet"))
    category_performance.toPandas().to_csv(os.path.join(gold_path, "product_category_performance.csv"), index=False)
    log_message(f"Product category performance saved to gold layer")
    
except Exception as e:
    log_message(f"Error creating product category performance: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Unnamed: 0,Product_Category,Unique_Products,Total_Productions,Total_Units,Avg_Cost,Avg_Sales_Price,Avg_Profit_Margin_Percent,Avg_Production_Days
0,Furniture - Tables,2,672,6857.0,266.12,453.13,69.3,27.0
1,Furniture - Components,1,386,3916.0,80.0,100.0,25.0,26.9
2,Furniture - Office,1,368,3867.0,300.0,450.0,50.0,27.3
3,Furniture - Storage,1,308,3189.0,100.0,110.5,10.5,27.1


Product category performance saved to gold layer


## 7. Gold Layer Model 3: Monthly Production Trends

Analyze production metrics over time by month. This model helps identify seasonal patterns and trends in production volume, efficiency, and profitability.

In [122]:
# Create monthly production trends
try:
    monthly_trends = silver_df.groupBy("Start_Year", "Start_Month", "Start_Month_Name") \
        .agg(
            F.count("Reference").alias("Production_Count"),
            F.sum("Total_Quantity").alias("Total_Units"),
            F.round(F.avg("Production_Duration_Days"), 1).alias("Avg_Production_Days"),
            F.round(F.avg("Production_Efficiency"), 1).alias("Avg_Efficiency_Percent"),
            F.round(F.avg("Profit_Margin_Percent"), 1).alias("Avg_Profit_Margin_Percent")
        ) \
        .orderBy("Start_Year", "Start_Month")
    
    # Display the monthly trends
    display(monthly_trends.toPandas())
    
    # Save to gold layer
    monthly_trends.write.mode("overwrite").parquet(os.path.join(gold_path, "monthly_production_trends.parquet"))
    monthly_trends.toPandas().to_csv(os.path.join(gold_path, "monthly_production_trends.csv"), index=False)
    log_message(f"Monthly production trends saved to gold layer")
    
except Exception as e:
    log_message(f"Error creating monthly trends: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Unnamed: 0,Start_Year,Start_Month,Start_Month_Name,Production_Count,Total_Units,Avg_Production_Days,Avg_Efficiency_Percent,Avg_Profit_Margin_Percent
0,2024,1,January,141,1485.0,28.1,93.7,44.3
1,2024,2,February,175,1811.0,26.1,96.9,45.2
2,2024,3,March,181,1833.0,26.1,96.4,46.8
3,2024,4,April,160,1601.0,27.4,93.8,46.5
4,2024,5,May,189,1945.0,27.4,94.3,46.1
5,2024,6,June,195,2028.0,25.7,95.8,45.2
6,2024,7,July,169,1805.0,28.2,95.6,45.7
7,2024,8,August,177,1899.0,27.8,94.0,43.2
8,2024,9,September,170,1728.0,26.4,93.5,42.8
9,2024,10,October,177,1694.0,27.6,93.4,43.0


Monthly production trends saved to gold layer


## 8. Gold Layer Model 4: Top Performing Products

Identify the most profitable and high-performing products. This model helps focus on products that generate the most revenue and profit for the business.

In [123]:
# Create top performing products summary
try:
    # Group by product and calculate metrics
    product_performance = silver_df.groupBy("Product_Code", "Product_Name", "Product_Category") \
        .agg(
            F.count("Reference").alias("Production_Count"),
            F.sum("Total_Quantity").alias("Total_Units"),
            F.round(F.avg("Product_Cost"), 2).alias("Avg_Cost"),
            F.round(F.avg("Product_Sales_Price"), 2).alias("Avg_Sales_Price"),
            F.round(F.avg("Profit_Margin_Percent"), 1).alias("Avg_Profit_Margin"),
            F.round(F.avg("Production_Efficiency"), 1).alias("Avg_Efficiency"),
            F.round(F.avg("Production_Duration_Days"), 1).alias("Avg_Production_Days")
        )
    
    # Calculate total revenue and profit
    product_performance = product_performance.withColumn(
        "Total_Revenue", 
        F.round(F.col("Total_Units") * F.col("Avg_Sales_Price"), 2)
    )
    
    product_performance = product_performance.withColumn(
        "Total_Profit", 
        F.round(F.col("Total_Units") * (F.col("Avg_Sales_Price") - F.col("Avg_Cost")), 2)
    )
    
    # Get top 20 products by profit
    top_products = product_performance.orderBy(F.desc("Total_Profit")).limit(20)
    
    # Display the top products
    display(top_products.toPandas())
    
    # Save to gold layer
    product_performance.write.mode("overwrite").parquet(os.path.join(gold_path, "product_performance.parquet"))
    product_performance.toPandas().to_csv(os.path.join(gold_path, "product_performance.csv"), index=False)
    
    top_products.write.mode("overwrite").parquet(os.path.join(gold_path, "top_products.parquet"))
    top_products.toPandas().to_csv(os.path.join(gold_path, "top_products.csv"), index=False)
    log_message(f"Product performance and top products saved to gold layer")
    
except Exception as e:
    log_message(f"Error creating product performance: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Unnamed: 0,Product_Code,Product_Name,Product_Category,Production_Count,Total_Units,Avg_Cost,Avg_Sales_Price,Avg_Profit_Margin,Avg_Efficiency,Avg_Production_Days,Total_Revenue,Total_Profit
0,FURN_9666,Table,Furniture - Tables,351,3688.0,290.0,520.0,79.3,95.5,27.5,1917760.0,848240.0
1,FURN_7800,Desk Combination,Furniture - Office,368,3867.0,300.0,450.0,50.0,94.5,27.3,1740150.0,580050.0
2,FURN_8522,Table Top,Furniture - Tables,321,3169.0,240.0,380.0,58.3,95.4,26.4,1204220.0,443660.0
3,FURN_7023,Wood Panel,Furniture - Components,386,3916.0,80.0,100.0,25.0,94.1,26.9,391600.0,78320.0
4,FURN_8855,Drawer,Furniture - Storage,308,3189.0,100.0,110.5,10.5,94.6,27.1,352384.5,33484.5


Product performance and top products saved to gold layer


## 9. Gold Layer Model 5: Production Efficiency Dashboard

Create a detailed view of production efficiency by department and product category. This model helps identify which departments excel at producing specific product categories.

In [124]:
# Create production efficiency dashboard data
try:
    # Department efficiency by product category
    dept_category_efficiency = silver_df.groupBy("Responsible", "Product_Category") \
        .agg(
            F.count("Reference").alias("Production_Count"),
            F.round(F.avg("Production_Efficiency"), 1).alias("Avg_Efficiency"),
            F.round(F.avg("Production_Duration_Days"), 1).alias("Avg_Production_Days")
        ) \
        .orderBy("Responsible", F.desc("Avg_Efficiency"))
    
    # Display the department category efficiency
    display(dept_category_efficiency.toPandas())
    
    # Save to gold layer
    dept_category_efficiency.write.mode("overwrite").parquet(os.path.join(gold_path, "dept_category_efficiency.parquet"))
    dept_category_efficiency.toPandas().to_csv(os.path.join(gold_path, "dept_category_efficiency.csv"), index=False)
    log_message(f"Department category efficiency saved to gold layer")
    
    # Monthly efficiency by department
    monthly_dept_efficiency = silver_df.groupBy("Start_Year", "Start_Month", "Start_Month_Name", "Responsible") \
        .agg(
            F.count("Reference").alias("Production_Count"),
            F.round(F.avg("Production_Efficiency"), 1).alias("Avg_Efficiency"),
            F.round(F.avg("Production_Duration_Days"), 1).alias("Avg_Production_Days")
        ) \
        .orderBy("Start_Year", "Start_Month", "Responsible")
    
    # Display the monthly department efficiency
    display(monthly_dept_efficiency.limit(10).toPandas())
    
    # Save to gold layer
    monthly_dept_efficiency.write.mode("overwrite").parquet(os.path.join(gold_path, "monthly_dept_efficiency.parquet"))
    monthly_dept_efficiency.toPandas().to_csv(os.path.join(gold_path, "monthly_dept_efficiency.csv"), index=False)
    log_message(f"Monthly department efficiency saved to gold layer")
    
except Exception as e:
    log_message(f"Error creating efficiency dashboard data: {str(e)}", level="error")
    import traceback
    log_message(traceback.format_exc(), level="error")

Unnamed: 0,Responsible,Product_Category,Production_Count,Avg_Efficiency,Avg_Production_Days
0,Department 1,Furniture - Tables,122,97.1,26.4
1,Department 1,Furniture - Office,74,96.3,25.7
2,Department 1,Furniture - Components,89,94.2,26.3
3,Department 1,Furniture - Storage,64,92.6,27.0
4,Department 2,Furniture - Tables,148,96.6,25.9
5,Department 2,Furniture - Office,82,95.6,28.8
6,Department 2,Furniture - Storage,60,94.1,28.8
7,Department 2,Furniture - Components,70,94.1,27.5
8,Department 3,Furniture - Storage,61,96.9,26.8
9,Department 3,Furniture - Tables,129,95.1,27.6


Department category efficiency saved to gold layer


Unnamed: 0,Start_Year,Start_Month,Start_Month_Name,Responsible,Production_Count,Avg_Efficiency,Avg_Production_Days
0,2024,1,January,Department 1,32,92.9,27.2
1,2024,1,January,Department 2,31,97.8,27.5
2,2024,1,January,Department 3,27,90.5,28.8
3,2024,1,January,Department 4,22,98.4,31.2
4,2024,1,January,Department 5,29,89.7,26.8
5,2024,2,February,Department 1,37,95.6,26.7
6,2024,2,February,Department 2,39,100.0,26.3
7,2024,2,February,Department 3,27,96.3,23.3
8,2024,2,February,Department 4,40,93.8,27.7
9,2024,2,February,Department 5,32,98.8,25.3


Monthly department efficiency saved to gold layer


## 9. Summary

Summarize the Gold Layer transformation process and results.

In [125]:
# Print summary of gold layer transformation
print("\nGold Layer Transformation Summary:")
print("----------------------------------------")
print(f"Input data from: {silver_path}")
print(f"Output data to: {gold_path}")
print("\nGold Layer Models Created:")
print("1. Department Production Summary")
print("2. Product Category Performance")
print("3. Monthly Production Trends")
print("4. Product Performance and Top Products")
print("5. Production Efficiency Dashboard")
print("\nAll gold layer models have been saved in both Parquet and CSV formats.")
print("Gold layer transformation completed successfully!")


Gold Layer Transformation Summary:
----------------------------------------
Input data from: c:\Users\oussema\OneDrive\Bureau\FSEGN\DW-Odoo\output\silverLayer\2025-05-08
Output data to: c:\Users\oussema\OneDrive\Bureau\FSEGN\DW-Odoo\output\goldLayer\2025-05-08

Gold Layer Models Created:
1. Department Production Summary
2. Product Category Performance
3. Monthly Production Trends
4. Product Performance and Top Products
5. Production Efficiency Dashboard

All gold layer models have been saved in both Parquet and CSV formats.
Gold layer transformation completed successfully!


## 10. Cleanup

Clean up resources by stopping the Spark session.

In [126]:
# Safely stop Spark session
try:
    # Check if spark session is active before stopping
    spark.sparkContext.getConf().getAll()
    spark.stop()
    print("Spark session stopped.")
except:
    print("No active Spark session to stop.")

Spark session stopped.
