# 1. Order producer

> N.B. para ejecutar el notebook es necesaro modificar el magic_command de iam_role. Cada cuenta tiene uno asociado, navegar a la pestaÃ±a IAM/Roles/LabRole dentro de AWs y copiar ARN como se indica en la imagen adjuntada.
>  

In [5]:
%iam_role arn:aws:iam::484183516222:role/LabRole
%region us-east-1
%number_of_workers 2
%idle_timeout 60

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
It looks like there is a newer version of the kernel available. The latest version is 1.0.6 and you have 1.0.4 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel
Current iam_role is None
iam_role has been set to arn:aws:iam::484183516222:role/LabRole.
Previous region: None
Setting new region to: us-east-1
Region is set to: us-east-1
Previous number of workers: None
Setting new number of workers to: 2
Current idle_timeout is None minutes.
idle_timeout has been set to 60 minutes.


In [1]:
spark

Trying to create a Glue session for the kernel.
Session Type: etl
Worker Type: G.1X
Number of Workers: 2
Session ID: 340fd606-57aa-4f26-98d1-70898bfc2aba
Applying the following default arguments:
--glue_kernel_version 1.0.4
--enable-glue-datacatalog true
Waiting for session 340fd606-57aa-4f26-98d1-70898bfc2aba to get into ready status...
Session 340fd606-57aa-4f26-98d1-70898bfc2aba has been created.
<pyspark.sql.session.SparkSession object at 0x7fce02d35d80>


# 1. imports & Constants

In [14]:
import os
import pyspark.sql.types as t
import pyspark.sql.functions as f




In [15]:
BUCKET_NAME = "s3://vrpoptimiserplatform"
RAW = "raw"
ORDERS = "orders"

BRONZE = "bronze"
SILVER = "silver"
GOLD = "gold"

ADDRESS_DATA = "address_data.json"
CLIENTS_DATA = "client_data.json"

RAW_ADDRESS_PATH = os.path.join(BUCKET_NAME, RAW, ADDRESS_DATA)
RAW_CIENTS_PATH = os.path.join(BUCKET_NAME, RAW, CLIENTS_DATA)




In [20]:
"s3://vrpoptimiserplatform/raw/address_data.json"
RAW_ADDRESS_PATH

's3://vrpoptimiserplatform/raw/address_data.json'


In [5]:
def read_json_to_df(file_path, schema=None):
    """
    Read JSON file into DataFrame.
    
    :param file_path: Path to the JSON file.
    :param schema: Optional schema to enforce while reading.
    :return: DataFrame
    """
    if schema:
        return spark.read.schema(schema).json(file_path)
    else:
        return spark.read.json(file_path)

def write_df_to_parquet(df, file_path, partition_by=None, mode="overwrite"):
    """
    Write DataFrame to Parquet file.
    
    :param df: DataFrame to be written.
    :param file_path: Path where the Parquet file will be saved.
    :param partition_by: Column(s) to partition by.
    :param mode: Write mode, default is 'overwrite'.
    """
    if partition_by:
        df.write.mode(mode).partitionBy(partition_by).parquet(file_path)
    else:
        df.write.mode(mode).parquet(file_path)

def transform_clients_bronze_to_silver(clients_df):
    """
    Transform Bronze (raw) Clients DataFrame to Silver (cleaned) DataFrame.
    
    :param clients_df: Clients DataFrame.
    :return: Cleaned Clients DataFrame.
    """
    # Example transformations: Filtering active clients, renaming columns, etc.
    clients_silver_df = clients_df.filter(col("status") == "active")
    return clients_silver_df

def transform_addresses_bronze_to_silver(addresses_df):
    """
    Transform Bronze (raw) Addresses DataFrame to Silver (cleaned) DataFrame.
    
    :param addresses_df: Addresses DataFrame.
    :return: Cleaned Addresses DataFrame.
    """
    # Assuming some transformations on addresses, e.g., removing empty house numbers
    addresses_silver_df = addresses_df.filter(col("house_number") != "")
    return addresses_silver_df

def transform_clients_addresses_silver_to_gold(clients_silver_df, addresses_silver_df):
    """
    Transform Silver (cleaned) Clients and Addresses DataFrames to Gold (aggregated/enriched) DataFrame.
    
    :param clients_silver_df: Cleaned Clients DataFrame.
    :param addresses_silver_df: Cleaned Addresses DataFrame.
    :return: Enriched DataFrame combining both clients and addresses.
    """
    # Example aggregation: Joining clients with their addresses
    gold_df = clients_silver_df.join(addresses_silver_df, on="client_id", how="inner")
    return gold_df






# 2. Medallion Architecture

## 2.1 Bronze Layer

In [22]:
# Define the schema for the clients JSON structure
clients_schema = t.StructType([
    t.StructField("client_id", t.StringType(), True),
    t.StructField("first_name", t.StringType(), True),
    t.StructField("last_name", t.StringType(), True),
    t.StructField("email", t.StringType(), True),
    t.StructField("phone_number", t.StringType(), True),
    t.StructField("date_of_birth", t.StringType(), True),
    t.StructField("gender", t.StringType(), True),
    t.StructField("occupation", t.StringType(), True),
    t.StructField("created_at", t.StringType(), True),
    t.StructField("updated_at", t.StringType(), True),
    t.StructField("status", t.StringType(), True)
])

# Define the schema for the addresses JSON structure
addresses_schema = t.StructType([
    t.StructField("client_id", t.StringType(), True),
    t.StructField("address_id", t.StringType(), True),
    t.StructField("neighborhood", t.StringType(), True),
    t.StructField("coordinates", t.ArrayType(t.DoubleType()), True),
    t.StructField("road", t.StringType(), True),
    t.StructField("house_number", t.StringType(), True),
    t.StructField("suburb", t.StringType(), True),
    t.StructField("city_district", t.StringType(), True),
    t.StructField("state", t.StringType(), True),
    t.StructField("postcode", t.StringType(), True),
    t.StructField("country", t.StringType(), True),
    t.StructField("lat", t.StringType(), True),
    t.StructField("lon", t.StringType(), True)
])




In [23]:
df_address_bronze = read_json_to_df(RAW_ADDRESS_PATH, addresses_schema)




In [24]:
df_address_bronze.show()

+---------+----------+------------+-----------+----+------------+------+-------------+-----+--------+-------+----+----+
|client_id|address_id|neighborhood|coordinates|road|house_number|suburb|city_district|state|postcode|country| lat| lon|
+---------+----------+------------+-----------+----+------------+------+-------------+-----+--------+-------+----+----+
|     null|      null|        null|       null|null|        null|  null|         null| null|    null|   null|null|null|
|     null|      null|        null|       null|null|        null|  null|         null| null|    null|   null|null|null|
|     null|      null|        null|       null|null|        null|  null|         null| null|    null|   null|null|null|
|     null|      null|        null|       null|null|        null|  null|         null| null|    null|   null|null|null|
|     null|      null|        null|       null|null|        null|  null|         null| null|    null|   null|null|null|
|     null|      null|        null|     

In [27]:
(
    spark.read
    .format("json")
    .option("multiLine", "True")
    .option("mode", "PERMISSIVE")
    .load(RAW_ADDRESS_PATH)
).printSchema()

root
 |-- address_id: string (nullable = true)
 |-- city_district: string (nullable = true)
 |-- client_id: string (nullable = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- country: string (nullable = true)
 |-- house_number: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- lon: string (nullable = true)
 |-- neighborhood: string (nullable = true)
 |-- postcode: string (nullable = true)
 |-- road: string (nullable = true)
 |-- state: string (nullable = true)
 |-- suburb: string (nullable = true)
