<center><a href="https://ilum.cloud"><img src="../logo.svg" alt="ILUM Logo"></a></center>

<center><h1 style="padding-left: 32px;">Silver to Gold</h1></center>
<center>Welcome to the Ilum Interactive Capabilities Tutorial! In this section, you can transform the data from the silver layer to meet the assumptions of the gold layer. Let's dive in!</center>
</br>

# The Gold Layer

The **Gold Layer** is the topmost tier of the Medallion architecture, delivering highly refined, business-ready datasets optimized for analytics, reporting, and advanced applications. Building upon the structured and cleansed data from the Silver Layer, it further enriches and aggregates data according to specific business requirements, providing end-users with reliable, performance-optimized, and actionable datasets.

## Key Capabilities

- **Business-Optimized Data Modeling:**  
  Organizes data into user-friendly analytical models, such as star schemas or wide tables, specifically optimized for business intelligence (BI), analytics, and machine learning (ML). Data structures here simplify queries, reduce complexity, and enhance ease of use for analysts and decision-makers.

- **Data Aggregation and Summarization:**  
  Provides pre-aggregated and summarized datasets aligned with key business KPIs. Data is grouped by dimensions such as time, geography, or business units to significantly accelerate query performance and analytical efficiency.

- **Enrichment and Business Logic Application:**  
  Integrates complex business rules, calculated fields, and derived metrics—such as profitability, customer lifetime value, or risk scoring—aligning datasets closely with organizational objectives and delivering rich business insights.

- **High-Performance Query Optimization:**  
  Implements indexing, partitioning, caching, and denormalization strategies to ensure low-latency access and rapid querying capabilities. Datasets in the Gold Layer are optimized to facilitate quick, efficient, and interactive data exploration.

- **Data Governance and Lineage:**  
  Ensures strict governance policies, detailed lineage tracking, and metadata management. This provides transparency about how data is transformed from its raw state, supporting auditability, compliance, and data trust.

## Why Use the Gold Layer?

The Gold Layer provides significant advantages by delivering trusted, refined datasets specifically tailored for business needs. Its core benefits include:

- **Reliable and Consistent Reporting:**  
  Delivers standardized datasets, ensuring consistency across dashboards, reports, and analytics. It serves as the single source of truth, eliminating conflicting metrics and increasing trust in the data.

- **Optimized Performance for Analytics:**  
  Pre-aggregated and optimized datasets reduce query complexity and execution time, enabling rapid analytics and improved responsiveness in dashboards and reports.

- **Business User Accessibility:**  
  Presents data in intuitive, easily understandable structures, empowering analysts, business users, and decision-makers to perform self-service analytics without advanced technical knowledge.

- **Enhanced Strategic Decision-Making:**  
  Facilitates deeper, data-driven analyses, supporting strategic planning, operational decisions, forecasting, and advanced analytics (such as customer segmentation and predictive modeling).

- **Regulatory Compliance and Governance:**  
  Robust data governance, lineage tracking, and metadata management simplify compliance with regulations (e.g., GDPR, HIPAA, SOX), easing audits and ensuring data governance standards are consistently met.

## Summary

The **Gold Layer** is the ultimate refinement stage within the Medallion architecture, transforming structured data from the Silver Layer into highly optimized, business-ready datasets. It acts as the single source of truth, driving reliable insights, effective analytics, and strategic business decisions. By ensuring datasets are consistently accurate, enriched, and user-focused, the Gold Layer empowers organizations to maximize the value of their data assets across all analytical and operational activities.


As a next step, let’s walk through an example of transforming data into the Gold Layer.

---

## Example: Creating the Gold Layer from the Silver Layer

In this example, we will transform data from the Silver Layer into the Gold Layer by applying business-specific transformations, aggregations, and enrichment.

### **Step 1: Set Up the Environment**
To begin, we need to ensure our environment is ready for data processing and Delta - Hive integration. This includes setting up any necessary configurations, and importing all the required libraries.

<div class="alert alert-info" role="alert">
  <h4 class="alert-heading">Before running your Delta notebook</h4>
  <p>Please ensure your environment is properly configured for Delta Lake and Hive integration.</p>
  <ul>
    <li>
      <strong>Global Delta Capabilities:</strong> Ensure that your cluster or global Spark configuration includes the following settings:
      <table class="table table-bordered" style="text-align: left;">
        <thead>
          <tr>
            <th>key</th>
            <th>value</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>spark.sql.extensions</td>
            <td>io.delta.sql.DeltaSparkSessionExtension</td>
          </tr>
          <tr>
            <td>spark.sql.catalog.spark_catalog</td>
            <td>org.apache.spark.sql.delta.catalog.DeltaCatalog</td>
          </tr>
           <tr>
            <td>spark.databricks.delta.catalog.update.enabled</td>
            <td>true</td>
          </tr>
          <tr>
            <td>spark.kubernetes.container.image</td>
            <td>ilum/spark:3.5.3-delta</td>
          </tr>
        </tbody>
      </table>
    </li>
    <li>
      <strong>Hive Integration Requirements:</strong> This notebook is integrated with Hive. To properly support Hive, you must enable Hive in your environment. For detailed instructions, please refer to <a href="https://ilum.cloud/resources/getting-started" target="_blank" rel="noopener noreferrer">this guide</a>. Also, add the following properties to your cluster configuration:
      <table class="table table-bordered" style="text-align: left;">
        <thead>
          <tr>
            <th>key</th>
            <th>value</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>spark.hadoop.hive.metastore.uris</td>
            <td>thrift://ilum-hive-metastore:9083</td>
          </tr>
          <tr>
            <td>spark.sql.catalogImplementation</td>
            <td>hive</td>
          </tr>
          <tr>
            <td>spark.sql.warehouse.dir</td>
            <td>s3a://ilum-data/</td>
          </tr>
        </tbody>
      </table>
    </li>
    <li>
      <strong>Session-Specific Delta-Hive Capabilities:</strong> If Delta and Hive is only required for a specific session, configure the necessary environment variables and dependencies on a per-session basis. For example:
      <pre><code>{"conf": {"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension", "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog", "spark.sql.warehouse.dir": "s3a://ilum-data/", "spark.kubernetes.container.image": "ilum/spark:3.5.3-delta", "spark.databricks.delta.catalog.update.enabled": "true", "spark.hadoop.hive.metastore.uris": "thrift://ilum-hive-metastore:9083", "spark.sql.catalogImplementation": "hive"}, "driverMemory": "1000M", "executorCores": 2}</code></pre>
      This configuration prepares your session for Delta operations without affecting other workflows.
    </li>
  </ul>
</div>

First, we'll need to load the spark magic extension. You can do this by running the following command:

In [None]:
%load_ext sparkmagic.magics

Ilum's Bundled Jupyter is ready to work out of the box and has a predefined endpoint address, which points to ```livy-proxy```. 

Use **%manage_spark** to create new session. 

Choose between Scala or Python, adjust Spark settings if necessary, and then click the `Create Session` button. As simple as that. 

The following example is written in `Python`.

In [None]:
%manage_spark

Before we start processing, we need to import the necessary libraries.

In [None]:
%%spark

    from pyspark.sql.functions import sort_array, collect_list, concat_ws, count

**Creating a Dedicated Database for the Use Case**

A good practice in data engineering is to separate data within dedicated databases for specific use cases. This approach helps maintain data organization and makes it easier to manage, query, and scale.

For this use case, we will create a database named `example_gold`. This will ensure that all data related to this use case is stored in a structured and isolated manner.

To create the database, we use the following command:

In [None]:
%%spark

    spark.sql("CREATE DATABASE example_gold")

### **Step 2: Load Data from the Silver Layer**
The first stage of processing in this layer is to read data from the silver layer.

In [None]:
%%spark

    animals_df = spark.read.table("example_silver.animals")
    owners_silver_df = spark.read.table("example_silver.owners")

### **Step 3: Transform Data for Business Needs**
One of the business requirements is to count the number of animals per owner and provide their names in one column.

In [None]:
%%spark

    animals_count = (
        animals_df.groupby("owner_id")
        .agg(
            concat_ws(", ", sort_array(collect_list("animal_name"))).alias("animals_names"),
            count("animal_name").alias("animals_qty"),
        )
    )
    
    animals_count.sort("owner_id").show(5)

Then let's combine it into a result table.

In [None]:
%%spark

    owners_df = (
        owners_silver_df.join(animals_count, animals_count.owner_id == owners_silver_df.owner_id, "right")
        .select(
            owners_silver_df.owner_id,
            owners_silver_df.first_name,
            owners_silver_df.last_name,
            animals_count.animals_names,
            animals_count.animals_qty,
            owners_silver_df.mobile,
            owners_silver_df.email,
        )
        .sort("owner_id")
    )

    owners_df.show(5)

### **Step 4: Save Data to the Gold Layer**
Save the transformed and enriched data to the Gold Layer in Delta format. \
The use of the delta format in this case allows access to the history of changes and optimizes the amount of memory consumed.

In [None]:
%%spark

    animals_df.write.format("delta").saveAsTable("example_gold.animals")
    owners_df.write.format("delta").saveAsTable("example_gold.owners")

### **Summary**
In this example:

 - **We loaded data** from the Silver Layer.
 - **We transformed and enriched the data** by applying business-specific aggregations and calculations.
 - **We saved the final data** to the Gold Layer in Delta format, making it ready for business consumption. 

By structuring data in the Gold Layer, businesses can leverage it for trend analysis, customer behavior insights, financial forecasting, and more, enabling smarter, data-driven decisions.

### Cleaning up

Now that you’re done with your work, you should clean them up to free up resources when they’re no longer in use. 
Simply click on the Delete buttons!

![Ilum session clean](../../images/clean_ilum_jupyter_session.png)

In [None]:
%manage_spark