![Verne Academy Summit 2024](https://github.com/javendia/verne-academy-summit-2024/blob/main/header.png?raw=true)

## Paso 1: Ingesta de datos
- Leemos el fichero **Orders.csv**, especificando el formato (CSV) e indicando que el archivo contiene las cabeceras de las columnas
- Creamos la tabla delta destino en caso de no existir

In [1]:
from pyspark.sql.types import *
from delta.tables import *
from pyspark.sql.functions import regexp_replace

df = spark.read.load('Files/wwi/Orders.csv',
    format='csv',
    header=True
)

DeltaTable.createIfNotExists(spark) \
     .tableName("Orders") \
     .addColumn("OrderID", IntegerType()) \
     .addColumn("OrderLineID", IntegerType()) \
     .addColumn("CustomerID", IntegerType()) \
     .addColumn("ContactPersonID", IntegerType()) \
     .addColumn("SalespersonPersonID", IntegerType()) \
     .addColumn("CustomerPurchaseOrderNumber", StringType()) \
     .addColumn("OrderDate", DateType()) \
     .addColumn("StockItemID", IntegerType()) \
     .addColumn("Description", StringType()) \
     .addColumn("PackageTypeID", StringType()) \
     .addColumn("Quantity", IntegerType()) \
     .addColumn("UnitPrice", DecimalType(18,2)) \
     .addColumn("TaxRate", DecimalType(18,3)) \
     .addColumn("LastUpdated", TimestampType()) \
     .execute()

deltaTable = DeltaTable.forPath(spark, 'Tables/orders')

StatementMeta(, 337fcae3-0e56-4412-9c97-222ecf701d9a, 3, Finished, Available)

## Paso 2: Transformación
- Adecuamos el separador de miles, reemplazando **','** por **'.'** y definiendo el tipo de datos para el destino

In [2]:
df = df.withColumn('UnitPrice', regexp_replace('UnitPrice', ',', '.').cast(DecimalType(18,2))) \
        .withColumn('TaxRate', regexp_replace('TaxRate', ',', '.').cast(DecimalType(18,3)))

StatementMeta(, 337fcae3-0e56-4412-9c97-222ecf701d9a, 4, Finished, Available)

## Paso 3: Creación de vista temporal
- Creamos la vista temporal **vw_orders** donde almacenar los datos tratados anteriormente

In [3]:
df.createOrReplaceTempView("vw_orders")

StatementMeta(, 337fcae3-0e56-4412-9c97-222ecf701d9a, 5, Finished, Available)

## Paso 4: Instrucción MERGE

- Si el valor del registro para la columna **OrderID** existe **(MATCHED)** en la tabla destino y alguna columna difiere del registro existente, actualizamos la fila
- Si el valor del registro para la columna **OrderID** no existe **(NOT MATCHED)** en la tabla destino, insertamos una nueva fila
- En caso de que el registro de la tabla destino no exista en el fichero origen, eliminamos esa fila

In [4]:
%%sql

MERGE INTO orders AS target
USING vw_orders AS source
ON target.OrderID = source.OrderID AND target.OrderLineID = source.OrderLineID
WHEN MATCHED AND 
(
        target.CustomerID <> source.CustomerID
        OR target.ContactPersonID <> source.ContactPersonID
        OR target.SalespersonPersonID <> source.SalespersonPersonID
        OR target.CustomerPurchaseOrderNumber <> source.CustomerPurchaseOrderNumber
        OR target.OrderDate <> source.OrderDate
        OR target.StockItemID <> source.StockItemID
        OR target.Description <> source.Description
        OR target.PackageTypeID <> source.PackageTypeID
        OR target.Quantity <> source.Quantity
        OR target.UnitPrice <> source.UnitPrice
        OR target.TaxRate <> source.TaxRate
)
THEN
UPDATE SET
        target.CustomerID = source.CustomerID
        ,target.ContactPersonID = source.ContactPersonID
        ,target.SalespersonPersonID = source.SalespersonPersonID
        ,target.CustomerPurchaseOrderNumber = source.CustomerPurchaseOrderNumber
        ,target.OrderDate = source.OrderDate
        ,target.StockItemID = source.StockItemID
        ,target.Description = source.Description
        ,target.PackageTypeID = source.PackageTypeID
        ,target.Quantity = source.Quantity
        ,target.UnitPrice = source.UnitPrice
        ,target.TaxRate = source.TaxRate
        ,target.LastUpdated = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT (OrderID, OrderLineID, CustomerID, ContactPersonID, SalespersonPersonID, CustomerPurchaseOrderNumber, 
        OrderDate, StockItemID, Description, PackageTypeID, Quantity, UnitPrice, TaxRate, LastUpdated)
VALUES (source.OrderID, source.OrderLineID, source.CustomerID, source.ContactPersonID, source.SalespersonPersonID, source.CustomerPurchaseOrderNumber, 
        source.OrderDate, source.StockItemID, source.Description, source.PackageTypeID, source.Quantity, source.UnitPrice, source.TaxRate, CURRENT_TIMESTAMP())
WHEN NOT MATCHED BY SOURCE THEN
DELETE;

StatementMeta(, 337fcae3-0e56-4412-9c97-222ecf701d9a, 6, Finished, Available)

<Spark SQL result set with 1 rows and 4 fields>

## Paso 5: Eliminación vista temporal

In [5]:
spark.catalog.dropTempView("vw_orders")

StatementMeta(, 337fcae3-0e56-4412-9c97-222ecf701d9a, 7, Finished, Available)

True