<center><a href="https://ilum.cloud"><img src="../logo.svg" alt="ILUM Logo"></a></center>

<center><h1 style="padding-left: 32px;">Bronze to Silver</h1></center>
<center>Welcome to the Ilum Interactive Capabilities Tutorial! In this section, you can transform the data from the bronze layer to meet the assumptions of the silver layer. Let's dive in!</center>
</br>

### The Silver Layer

The **Silver Layer** is the middle layer of the Medallion architecture. It stores cleansed and conformed data from the Bronze Layer, making it ready for analysis and other downstream applications.

#### **Typical Data in the Silver Layer**
The data stored in the Silver Layer often includes:
- **Cleansed Data**: Data from the Bronze Layer that has been cleansed of errors and inconsistencies.
- **Conformed Data**: Data standardized to a common schema across various sources.
- **Enriched Data**: Data enhanced with additional information, such as historical or demographic data.

This data is typically stored in relational databases, data warehouses, or other cloud-based data lakes.

#### **Purpose of the Silver Layer**
The Silver Layer serves several important purposes:
- **Clean and Conformed View**: Provides a consistent and error-free representation of the data.
- **Accessibility**: Makes data easily available for analysis and downstream applications.
- **Foundation for the Gold Layer**: Acts as a starting point for further transformations into refined datasets.

#### **Applications of Silver Layer Data**
Data from the Silver Layer can be used for:
- Trend and customer behavior analysis.
- Identifying opportunities to improve efficiency.
- Supporting business decision-making processes.

#### **Summary**
The **Silver Layer** plays a critical role in the Medallion architecture by refining and enriching raw data from the Bronze Layer. Through processes like deduplication, filtering, and transformation, the Silver Layer produces clean, structured, and analyzable datasets. This layer bridges the gap between raw data and actionable insights, ensuring consistency and reliability for downstream applications.

As a continuation, let's now walk through an example of processing and enriching data in the Silver Layer.

---

### Example: Transforming Data into the Silver Layer

In this example, we will demonstrate how to transform data from the Bronze Layer into the Silver Layer by cleansing, conforming, and enriching it for analytical use cases.

#### **Step 1: Set Up the Environment**


First, we'll need to load the spark magic extension. You can do this by running the following command:

In [None]:
%load_ext sparkmagic.magics

Ilum's Bundled Jupyter is ready to work out of the box and has a predefined endpoint address, which points to ```livy-proxy```. 

Use **%manage_spark** to create new session. 

Choose between Scala or Python, adjust Spark settings if necessary, and then click the `Create Session` button. As simple as that. 

The following example is written in `Python`.

In [None]:
%manage_spark

Before we start processing, we need to import the necessary libraries.

In [None]:
%%spark

    from pyspark.sql.functions import to_date, col
    from pyspark.sql.types import IntegerType, StringType, LongType, StructType, StructField

**Creating a Dedicated Database for the Use Case**

A good practice in data engineering is to separate data within dedicated databases for specific use cases. This approach helps maintain data organization and makes it easier to manage, query, and scale.

For this use case, we will create a database named `example_silver`. This will ensure that all data related to this use case is stored in a structured and isolated manner.

To create the database, we use the following command:

In [None]:
%%spark

    spark.sql("CREATE DATABASE example_silver")

#### **Step 2: Load Data from the Bronze Layer**

The second stage of processing in this layer is to read data from the bronze layer, set the correct data types and reject invalid rows. The operation is repeated for each data set:

 - #### **animals**
We start by reading the `animals` table from the `example_bronze` database. To ensure data cleanliness, we use the `dropna()` method to remove rows with null values.

In [None]:
%%spark 

    animals_bronze_df = spark.read.table("example_bronze.animals").dropna()
    animals_bronze_df.printSchema()

##### **Define and Enforce a Strict Schema**
We define a strict schema using `StructType` to ensure that all columns have the correct data types. This step validates the data and makes the schema consistent across the pipeline.

In [None]:
%%spark

    animals_schema = StructType([
        StructField("id", IntegerType(), False),
        StructField("owner_id", IntegerType(), False),
        StructField("specie_id", IntegerType(), False),
        StructField("animal_name", StringType(), False),
        StructField("gender", StringType(), False),
        StructField("birth_date", StringType(), False),
        StructField("color", StringType(), False),
        StructField("size", StringType(), False),
        StructField("weight", StringType(), False)
    ])

    animals_df = spark.createDataFrame(animals_bronze_df.rdd, schema=animals_schema)
    animals_df.printSchema()
    animals_df.show(5)

The resulting `animals_df` contains data that adheres to the specified schema. This ensures consistency and reliability for downstream processing.

---

- #### **owners**
This time, we will walk through the entire process for the `owners` table, including data reading, schema refinement, and preparing it future processing.

In [None]:
%%spark

    owners_bronze_df = spark.read.table("example_bronze.owners").dropna()
    owners_bronze_df.printSchema()

##### **Define and Enforce a Strict Schema**
We define a strict schema using `StructType` to ensure that all columns have the correct data types. This step validates the data and makes the schema consistent across the pipeline.

In [None]:
%%spark

    owners_schema = StructType([
                        StructField("owner_id", IntegerType(), False),
                        StructField("first_name", StringType(), False),
                        StructField("last_name", StringType(), False),
                        StructField("mobile", LongType(), False),
                        StructField("email", StringType(), False)
                        ])

    owners_df = spark.createDataFrame(owners_bronze_df.rdd, schema=owners_schema)
    owners_df.printSchema()
    owners_df.show(5)

---

 - #### **species**
This time, we will walk through the entire process for the `species` table, including data reading, schema refinement, and preparing it future processing.

In [None]:
%%spark 

    species_bronze_df = spark.read.table("example_bronze.species").dropna()
    species_bronze_df.printSchema()

##### **Define and Enforce a Strict Schema**
We define a strict schema using `StructType` to ensure that all columns have the correct data types. This step validates the data and makes the schema consistent across the pipeline.

In [None]:
%%spark

    species_schema = StructType([
                        StructField("specie_id", IntegerType(), False),
                        StructField("specie_name", StringType(), False)
                        ])

    species_df = spark.createDataFrame(species_bronze_df.rdd, schema=species_schema)
    species_df.printSchema()
    species_df.show(5)

#### **Step 3: Transform and Cleanse Data**
The third stage of processing data from the brown layer will be combining them in the result table and formatting the data.
Below two Dataframes are combined to link each animal to its corresponding species.

In [None]:
%%spark 

    animals_df = animals_df. \
    join(species_df, animals_df["specie_id"] == species_df["specie_id"], 'left'). \
    select(animals_df["id"], \
           animals_df["owner_id"], \
           species_df["specie_name"], \
           animals_df["animal_name"], \
           to_date(animals_df['birth_date'],'MM/dd/yyyy').alias('birth_date'), \
           animals_df["gender"], \
           animals_df["size"], \
           animals_df["color"], \
           animals_df["weight"], \
          )

#### **Step 4: Save Data to the Silver Layer**
Save the cleansed and conformed data to the Silver Layer in Delta format. \
The use of the delta format in this case allows access to the history of changes and optimizes the amount of memory consumed.

In [None]:
%%spark

    animals_df.write.format("delta").saveAsTable("example_silver.animals")
    owners_df.write.format("delta").saveAsTable("example_silver.owners")

#### **Summary**
In this example:

 - **We loaded data** from the Bronze Layer.
 - **We transformed the data** by cleansing it of errors and conforming it to a consistent schema.
 - **We used SQL to join the `owners` and `animals` tables**, enriching the data by combining relevant information from both sources.
 - **We saved the processed data** to the Silver Layer in Delta format for easy accessibility.

This structured approach ensures that data is ready for analysis and supports efficient business decision-making processes.

### Cleaning up

Now that you’re done with your work, you should clean them up to free up resources when they’re no longer in use. 
Simply click on the Delete buttons!

![Ilum session clean](../../images/clean_ilum_jupyter_session.png)

In [None]:
%manage_spark

#### [Click here to proceed to the "Silver to gold" section.](3_Silver_to_gold.ipynb)